免费领取大会全套演讲PPT    

报名领取

我要参会

Qinlong Wang

Technical expert in AI Infrastructure at Ant Group

Qinlong Wang is currently serving as a technical expert in AI Infrastructure at Ant Group. He has been extensively involved in AI infrastructure research and development at Ant Group, leading projects on elastic fault tolerance and automatic scaling for distributed training. Under his leadership, the utilization rate of Ant Group's hybrid cluster resources has been increased from under 20% to over 40%, and the effective training time for large-scale models has been boosted to over 97%. Wang Qinlong has contributed to various open-source projects including ElasticDL and DLRover, and was recognized as a Vitality Open Source Contributor by the OpenAtom Foundation in 2023. He currently serves as the architect for Ant Group's AI Infrastructure open-source project, DLRover, focusing on building a stable, scalable, and efficient large-scale distributed training system.

Topic

DLRover Training Failure Self-Healing: Dramatically Improving Arithmetic Efficiency for Large-Scale AI Training

Current large-scale language model training requires a large number of acceleration cards for training, such as GPUs and NPUs. Due to the high failure rate of GPU machines, frequent failures will lead to training interruptions, computational waste and cluster idling, resulting in a large amount of time and arithmetic power waste. For this reason, DLRover has open-sourced a training failure self-healing technology, which minimises the waste of computing power caused by failures through fast node state detection, elastic capacity expansion and contraction, dynamic networking and Flash Checkpoint. Currently, on Ant's thousand-card scale training, with a failure frequency of once a day, the percentage of effective training time reaches 97%. In addition to supporting GPUs, DLRover Fault Self-Healing also supports distributed training on domestic accelerator cards, such as Huawei's Rise chip and Ali's Pinto chip. Project address: https://github.com/intelligent-machine-learning/dlrover

© boolan.com 博览 版权所有

沪ICP备15014563号-6

沪公网安备31011502003949号