Lunch will begin at 11:45 am
Abstract:
Diffusion transformers have revolutionized image generation, yet their scalability is fundamentally constrained by the quadratic cost of attention. This talk explores how intrinsic properties of images—their two-dimensional spatial layout and local smoothness—can be systematically leveraged to design hardware-aligned and efficient transformer architectures for diffusion models.
We begin with HilbertA, which reorganizes image tokens along Hilbert curves to align computation with the 2D structure of visual data while preserving spatial neighborhoods and maintaining coalesced GPU memory access. Next, ToMA exploits the local smoothness of images by formulating token merging as a submodular optimization problem, identifying and combining redundant visual tokens through GPU-friendly attention-like transformations. Finally, INTRA generalizes these insights into a broader Intra Sparse Pattern Design (ISPD) principle, introducing non-contiguous sparse attention patterns that capture structural priors in images while remaining computationally efficient and flexible across modalities.
Together, these approaches illustrate how embracing the geometric and statistical regularities of images enables diffusion transformers that are both algorithmically efficient and hardware-aware, bridging the gap between visual understanding and scalable generative modeling.
Bio:
Shengjie Wang is an Assistant Professor of Computer Science at New York University Shanghai, with an affiliated appointment at NYU Tandon School of Engineering. He received his Ph.D. in Computer Science from the University of Washington, after which he worked as a Research Scientist at ByteDance. His research lies at the intersection of machine learning systems, AI for science, and submodular optimization, with a focus on developing efficient and principled learning frameworks that bridge algorithmic theory and large-scale applications in scientific domains.