Posts by Collection

portfolio

Controllable Clustering with LLM-driven Embeddings

Given the inherent subjectivity of similarity in text, fully unsupervised text clustering is unlikely to produce groupings that are relevant across a variety of use cases. Traditional techniques to guide clustering rely on costly, time-consuming human feedback and/or pre-existing labels. Leveraging recent advancements in LLMs and decoder-only embedding models, this project presents techniques to effectively control text embeddings with minimal human input: instruction prefixing and LLM preprocessing. We evaluate clustering performance for datasets with multiple independent ground-truth labels, or perspectives, and find that these techniques can be used to improve clustering for one perspective or use case, at the cost of a tradeoff in performance for another use case.

Inductive Orientation-enabled Model Characterization

As models grow more complex, interpretable black-box characterization techniques are increasingly relevant. Based on the algorithmic search framework, we present estimation methods for model-theoretic quantities, such as algorithm flexibility, sensitivity to data, and ability to specialize. We compute these quantities across a wide variety of classification algorithms, observing trends matching known heuristics and theoretical properties. We further utilize these metrics to compare algorithms of different architectures and hyperparameter configurations. These findings validate uses for model evaluation, comparison, and hyperparameter tuning.

Probabilistic Error Guarantees for Abductive Inference

Abductive reasoning is ubiquitous in artificial intelligence and everyday thinking; however, formal theories that provide probabilistic guarantees for abductive inference are lacking. I led the development of a general framework for selective abduction based on Bayesian Decision Theory. With this framework, I have derived probabilistic bounds for abductive success in two ways: (1) rewarding the selection of one most likely cause, or (2) rewarding the selection of any cause whose probability is above some threshold. The former relies purely on Bayesian probability, whereas the latter combines it with a search approach through past developments with the Algorithmic Search Framework (ASF). By incorporating uncertainty in background knowledge, this work establishes probabilistic bounds on the success of selective abduction, leverages information-theoretic results from the ASF, and provides mathematical justifications for everyday abductive intuitions.

Bounded-confidence Cascade Parameter Fitting

Bounded-confidence cascades simulate the spread of ideas on a social network. Fitting these models with social media datasets would let us study the mechanics of online political polarization. However, bounded-confidence model fitting is largely unexplored by the field due to incompatibility between messy, real datasets and the model’s abstract foundations.