Yulong Ao
Head of AI Framework Development at BAAI
Yulong Ao is the Head of AI Framework R&D at the Beijing Academy of Artificial Intelligence (BAAI). He holds a Ph.D. from the Chinese Academy of Sciences and completed his postdoctoral research at Peking University. He currently leads the development of several key components within FlagOS, an open-source unified AI software stack, including the large model training-inference integrated framework FlagScale, the unified communication library FlagCX, and the unified plugin ecosystem FlagOS-UniDev. He was among the first in the industry to propose practical, industry-deployable technical solutions for cross-chip heterogeneous training and heterogeneous inference for large models. His research focuses on distributed systems and performance optimization in the fields of artificial intelligence, high-performance computing (HPC), and scientific computing. He previously worked at Huawei and Baidu, where he contributed to the development of core technologies for large-model systems. In 2016, he was part of the team that won China’s first ACM Gordon Bell Prize. He has published more than ten papers in top international conferences and journals, holds multiple domestic and international patents, and has participated in the development of national and international standards related to operator interfaces and communication libraries.
Topic
Building a Unified and Efficient Multi-Chip Plugin Ecosystem for Large Model Frameworks with the FlagOS Stack
With the rapid growth of large model applications, frameworks for training, inference, and reinforcement learning are evolving rapidly. However, differences across AI chips in operator implementations, communication mechanisms, and software stacks continue to increase the cost of framework adaptation and deployment optimization. Developers often need to repeatedly port and optimize their frameworks for different hardware platforms, making it difficult to build a unified software ecosystem and achieve efficient cross-chip execution. This talk introduces a unified multi-chip plugin ecosystem for large model frameworks built on the FlagOS technology stack. Through a plugin-based architecture, FlagOS exposes its cross-chip unified high-performance operator and communication capabilities to mainstream large model frameworks in a low-intrusion manner. The ecosystem includes plugins such as Megatron-LM-FL, TransformerEngine-FL, vLLM-Plugin-FL, and VeRL-FL, enabling users to run the same codebase across different AI chips for training, inference, and reinforcement learning tasks while maintaining their existing framework workflows. This approach enables zero-intrusion integration and a “develop once, run on multiple chips” capability. At the same time, the plugin ecosystem provides a standardized adaptation path for AI chip vendors, helping promote the collaborative development of a diverse and interoperable AI software ecosystem. Outline: 1. Background and Challenges High cost of cross-chip adaptation for large model frameworks Ecosystem fragmentation caused by differences in operators and communication libraries 2. The FlagOS Technology Stack Core components of FlagOS: unified operator library FlagGems, unified compiler FlagTree, unified communication library FlagCX, unified training and inference framework FlagScale Design considerations for a unified multi-chip plugin ecosystem 3. Unified Multi-Chip Plugin Ecosystem Training plugins: Megatron-LM-FL, TransformerEngine-FL Inference plugin: vLLM-Plugin-FL Reinforcement learning plugin: VeRL-FL 4. Practice and Ecosystem Multi-chip training and inference practices Chip vendor adaptation and ecosystem collaboration