PhD Student, Northeastern University
Practitioners of Machine Learning and related fields commonly seek out embeddings of object collections into some Euclidean space. These embeddings are useful for dimensionality reduction, for data visualization, as concrete representations of abstract notions of similarity for similarity search, or as features for some downstream learning task such as web search or sentiment analysis. A wide array of such techniques exist, ranging from traditional (PCA, MDS) to trendy (word2vec, deep learning).
Most existing techniques depend on exact numerical input of some form: object features, distance estimates, etc. For datasets derived from user behavior or opinion, however, such quantitative data is either too sparse, too noisy, or too difficult to interpret for us to rely on such exact numerical values. Instead of exactly trusting the numerical values of similarity or distance estimates, one can rely on the ordering of those values, e.g. by sorting all items by their estimated similarity to a given query item and discarding the (unreliable) exact numerical similarities. The relationship between objects is reduced to a set of ordinal triplets, such as "object a is closer to object b than to object c." One can then seek an embedding which preserves this ordinal information, and seek to prove guarantees for the degree to which such an embedding must recover the true distances or similarities between items.
My research has explored several key questions for producing large-scale ordinal embeddings, including:
(Summaries available upon request)
(Summaries available upon request)