Bin He
Head of Model Performance Optimization at OmnInfer, Huawei R&D Engineer
Head of Model Performance Optimization at OmnInfer and committer of the MTP SIG. He graduated from the University of Science and Technology of China and the University of Chinese Academy of Sciences. After joining Huawei, he has worked for over a decade in computer networking and AI infrastructure, gaining extensive engineering experience in large model inference optimization. He has been deeply involved in performance optimization for multiple open-source models as well as the Pangu large model on the Ascend platform, supporting high-performance inference services and RL rollout.
Topic
Extreme Performance Optimization with Omni-Infer
Abstract: Omni-Infer is a powerful inference acceleration toolkit tailored for Ascend hardware platforms. This talk presents practical explorations of achieving extreme performance optimization for both large language models (LLMs) and multimodal models, focusing on high throughput and low latency. It will cover key techniques such as operator fusion, multi-stream parallelism, advanced scheduling strategies, and speculative execution, along with real-world optimization case studies. Outline: Background Case Studies: High Throughput & Low Latency Optimization Future Directions