275 Hutchison Rd, Rochester, NY 14620

Architecture-Tailored Parallelization for Accessible Large Model Era


Talk Abstract:  In this talk, I will introduce my work on machine learning (ML) parallelization, a critical endeavor to bridge the significant gap between diverse ML programs and multitiered computing architectures. Specifically, I will explore ML parallelization at three distinct yet interconnected levels. First, I will show that by leveraging the unexplored space of model partitioning strategies, distributed ML training can be up to 20x faster than existing systems by improving communication efficiency. I will highlight some innovative distributed ML systems, such as HET for sparse embedding models and Galvatron for dense Transformer models, respectively.

Second, I will discuss how to improve GPU utilization through ML parallelization. I will present SpecInfer, a system that reduces large language model (LLM) serving latency by 1.5-3.5x compared to existing systems by leveraging a novel tree-based speculative inference and verification mechanism. Third, I will demonstrate how ML parallelization popularizes LLMs by extending its boundaries throughout inter-cloud environments. I will describe SpotServe, the first LLM serving system on spot instances, handling preemptions with dynamic reparallelization, ensuring relatively low tail latency, and reducing monetary cost by 54%. Finally, I will conclude with a discussion on pushing my research forward to a holistic and unified infrastructure for democratizing ML.


Brief Bio: Xupeng Miao is currently a postdoc researcher at Carnegie Mellon University working with Prof. Zhihao Jia and Prof. Tianqi Chen. Before that, he received his Ph.D. degree from Peking University advised by Prof. Bin Cui. He is broadly interested in machine learning systems, data management, and distributed computing. His research has resulted in 30+ publications (with 13 first-authored papers) in top-tier conferences, including OSDI, ASPLOS, SIGMOD, VLDB, NSDI, NeurIPS and so on. Recently, he has focused on building efficient, scalable, and affordable software systems (e.g., FlexFlow Serve) for large language models. His work was recognized through the 2022 ACM China Doctoral Dissertation Award, the Best Scalable Data Science Paper Award of VLDB 2022, and the Distinguished Artifact Award of ASPLOS 2024.

Event Details

0 people are interested in this event

User Activity

No recent activity