Lunch will be served at 11:45 AM.
Efficiency is increasingly tied to quality to machine learning, with more efficient training algorithms leading to more powerful models trained on more data. However, today's most popular machine learning models are built on asymptotically inefficient primitives. For example, attention in Transformers scales quadratically in the input size, which makes it challenging for Transformers to use long context. In this talk, I discuss my work on improving the efficiency of the core primitives in machine learning, with an emphasis on hardware-aware algorithms and long-context applications. In the first half, I focus on replacing attention with gated state space models (SSMs) and convolutions, which scale sub-quadratically in context length. I describe the H3 (Hungry Hungry Hippos) architecture, a gated SSM architecture that matches Transformers in quality up to 3B parameters and achieves 2.4x faster inference. In the second half, I focus on developing hardware-aware algorithms for SSMs and convolutions. I describe FlashFFTConv, a fast algorithm for computing SSMs and convolutions on GPU by optimizing the Fast Fourier Transform (FFT). FlashFFTConv yields up to 7x speedup and 5x memory savings, even over vendor solutions from Nvidia. FlashFFTConv is now widely used in many gated SSM models, including language models, image generation models, and long-context DNA foundation models.
Dan Fu is a PhD student in the Computer Science Department at Stanford University, where he is co-advised by Christopher RĂ© and Kayvon Fatahalian. His research interests are at the intersection of systems and machine learning. Recently, he has focused on developing algorithms and architectures to make machine learning more efficient, especially for enabling longer-context applications. His research has appeared as oral and spotlight presentations at NeurIPS, ICML, and ICLR, and he has received the best student paper runner up at UAI. Dan has also been supported by an NDSEG fellowship.