The talk is motivated by the question: how efficiently can we search distributions? The problem is modeled as follows: we are given knowledge of k discrete distributions v_i for 1 <= i <= k over the domain [n] = {1,...,n} which we can preprocess. Then we get samples from an unknown discrete distribution p, also over [n]. The goal is to output the closest distribution to p among the v_i's in TV distance (up to some small additive error). The state of the art algorithms require Theta(log k) samples and run in near linear time.
We introduce a fresh perspective on the problem and ask if we can output the closest distribution in sublinear time. This question is particularly motivated as the problem is a generalization of the traditional nearest neighbor search problem: if we take enough samples, we can learn p explicitly up to low TV distance and then find the closest v_i in o(k) time using standard nearest neighbor search algorithms.
However, this approach requires Omega(n) samples. Thus, it is natural to ask: can we obtain both sublinear number of samples and sublinear query time? We present some nice progress towards this question.
This is joint work with Anders Aamand, Alex Andoni, Justin Chen, Piotr Indyk, and Shyam Narayanan, and appeared in ICML 23.
Sandeep Silwal is a final year Ph.D. student at MIT. With research interests broadly in algorithm design, recently he's been working in the intersection of machine learning and classical algorithms by designing provable algorithms in various ML settings, such as efficient algorithms for processing large datasets, as well as using ML to inspire and motivate algorithm design.
LSRC D344 or join virtually via Zoom https://duke.zoom.us/j/92510717076