Zhiyuan Ai

CEO of Convergence Technology

D. from Tsinghua University, specialising in distributed system optimisation, parallel computing, distributed storage, etc. Former R&D supervisor of several departments of SHENXINSU Big Data, Digital, AI applications, etc., responsible for team management and key product R&D, leading the R&D and delivery of several multi-million projects, with rich experience in productisation and landing.

Topic

How to Run a Hundred Billion Big Models on a Single GPU

The model effect is significantly improved, but when deploying privatisation, using a more effective model, lower latency and higher throughput often means higher cost, especially in class-o1 applications, where constant reasoning is needed to get improvements to eventually solve complex reasoning tasks. Reasoning becomes a core element in the large model landing stage, and how to reduce the cost of reasoning is the most critical issue in solving the challenges of large model landing. Previously, the industry usually used the way of optimising GPU arithmetic, but there is an optimisation bottleneck in this way, and the arithmetic optimisation space of GPU is limited. In addition to the GPU, there are multiple arithmetical forces that can be invoked at the hardware level, and the heterogeneous synergy of the whole system of storage, CPU, and GPU can exponentially increase the available arithmetic force, and at the same time, adopting high-performance arithmetic to improve the arithmetic utilisation rate of GPU, and the inference performance can be improved at least 10-fold. We will focus on how to make storage, CPU, GPU system-wide heterogeneous collaboration, as well as how to make use of storage space and how to use storage arithmetic. Effect: Achievement of using a single consumer-grade GPU to run a hundred billion large models locally, and using a single GPU to execute inference tasks with contexts up to 1M long, with a generation speed up to 16tokens/s, both of which are the first in the industry. The use of full-system heterogeneous collaboration to improve inference performance also bridges the gap between domestic GPU products and NVIDIA products in terms of performance, making domestic alternative solutions more feasible and breaking the necklace dilemma. Outline: 1, the current status and background of Infra 2、Difficulties encountered in the landing of large models: the impossible balance between effect, efficiency and cost 3、Storage, CPU, GPU of the whole system heterogeneous collaborative reasoning framework design 4、Storage space also has computing power? How to call it technically, and how to maximise the utilisation efficiency - analysis of the ‘storage for computing’ technology 5、Effects and cases that can be achieved by using system-wide heterogeneous collaborative reasoning framework. 6、Future to help domestic alternative solutions to break the neck of the dilemma 7、Future Infra layer development outlook