Projects & Publications

Controllable Clustering with LLM-driven Embeddings

[Poster]

Given the inherent subjectivity of similarity in text, fully unsupervised text clustering is unlikely to produce groupings that are relevant across a variety of use cases. Traditional techniques to guide clustering rely on costly, time-consuming human feedback and/or pre-existing labels. Leveraging recent advancements in LLMs and decoder-only embedding models, this project presents techniques to effectively control text embeddings with minimal human input: instruction prefixing and LLM preprocessing. We evaluate clustering performance for datasets with multiple independent ground-truth labels, or perspectives, and find that these techniques can be used to improve clustering for one perspective or use case, at the cost of a tradeoff in performance for another use case.
Description

Bounded-Confidence Cascade Parameter Fitting

[Poster]

This project investigates bounded-confidence cascades as models of idea spread in social networks, aiming to analyze online political polarization using social media data. Model fitting remains largely unexplored due to mismatches between real-world data and the model’s abstract foundations. To address this, the project develops tools for extracting follower and retweet networks from Twitter and estimating key network parameters. Opinions are represented as single numerical through probabilistic text classifiers, including logistic regression and LSTM models, to assign opinion scores directly to text data.
Description

Controllable Clustering with LLM-driven Embeddings

Inductive Orientation-derived Model-Agnostic Metrics

Probabilistic Abduction

Generating Opinion Distributions for Bounded-Confidence Models