Long Xing
MiniMax Technical Director
He has been working on AI + Infra for a long time. After MiniMax, he was responsible for infrastructure work, including high-performance AI infrastructure, mega pre-training training framework, DevOps platform, and SRE. He was responsible for the stability of large-scale GPU clusters and high-performance network transformation at Cognizant, and was responsible for the development of mega Kubernetes cluster mixing system at Baidu.
Topic
Challenges and Practices of Big Models in AI Infra
The increasing size and model complexity of current pre-trained models is placing significant pressure on infrastructure. In order to meet these challenges and build larger-scale high-performance training clusters, enterprises often need to carry out full-link optimisation, including innovations in algorithms, hardware architectures, and system designs, to improve efficiency and performance. At the same time, high-performance arithmetic faces greater supply pressures, and the lack of arithmetic is mitigated through hybrid clouds to accomplish dual gains in cost and flexibility.