Qinlong Wang | 2024 Machine Learning Summit

免费领取大会全套演讲PPT

报名领取

我要参会

Qinlong Wang

Technical expert in AI Infrastructure at Ant Group

Qinlong Wang is currently serving as a technical expert in AI Infrastructure at Ant Group. He has been extensively involved in AI infrastructure research and development at Ant Group, leading projects on elastic fault tolerance and automatic scaling for distributed training. Under his leadership, the utilization rate of Ant Group's hybrid cluster resources has been increased from under 20% to over 40%, and the effective training time for large-scale models has been boosted to over 97%. Wang Qinlong has contributed to various open-source projects including ElasticDL and DLRover, and was recognized as a Vitality Open Source Contributor by the OpenAtom Foundation in 2023. He currently serves as the architect for Ant Group's AI Infrastructure open-source project, DLRover, focusing on building a stable, scalable, and efficient large-scale distributed training system.

Topic

DLRover Training Failure Self-Healing: Dramatically Improving Arithmetic Efficiency for Large-Scale AI Training

Current large-scale language model training requires a large number of acceleration cards for training, such as GPUs and NPUs. Due to the high failure rate of GPU machines, frequent failures will lead to training interruptions, computational waste and cluster idling, resulting in a large amount of time and arithmetic power waste. For this reason, DLRover has open-sourced a training failure self-healing technology, which minimises the waste of computing power caused by failures through fast node state detection, elastic capacity expansion and contraction, dynamic networking and Flash Checkpoint. Currently, on Ant's thousand-card scale training, with a failure frequency of once a day, the percentage of effective training time reaches 97%. In addition to supporting GPUs, DLRover Fault Self-Healing also supports distributed training on domestic accelerator cards, such as Huawei's Rise chip and Ali's Pinto chip. Project address: https://github.com/intelligent-machine-learning/dlrover

Boolan is a leading IT Education & Consulting company in China. Our core competence is our experts team around the world and their cutting edge technology experience accumulated through decades. Adhering to the tenet of "Global Experts, Global Wisdom", we are dedicated to providing our customers In-house Training,Technical Conference, Software Consulting, Expert Lecture, Seminar, Talent Evaluation and Certification and other services by gathering the world's top IT technology experts. www.boolan.com

沪ICP备15014563号-6