Bingning Wang

Head of Intelligent Pre-training, Baichuan

He graduated from Institute of Automation, Chinese Academy of Sciences with a PhD, and focuses on Q&A systems and large-scale language models. He has been a senior researcher in Sogou and Tencent, and has rich experience in large-scale generative modelling, leading and releasing large-scale Chinese Q&A data such as ReCO, ComQA, ChiQA, T2Ranking, as well as the Baichuan series of pre-trained models. He has published 11 papers as the first author in top international AI and natural language processing conferences such as ACL, SIGIR, AAAI, etc., and won the CIKM best paper runner up in 2021. his PhD thesis ‘Research on Key Technology of Machine Reading Comprehension’ won the Excellent Doctoral Dissertation Award of the Chinese Society of Information in China in 2019. He is an executive member of the Youth Working Committee of Chinese Language Information Society of China.

Topic

Transformer efficiency optimisation

In the last two years, the big language modelling technology led by ChatGPT has made very great progress, relying only on the simple unsupervised training method of the next word prediction, the big language model reaches or even exceeds the human level in many tasks. The most important principle for improving the effectiveness of current big language models is the Scaling law, i.e., continuously expanding the number of model parameters and the amount of training data. However, we can still introduce some optimisation techniques and tools to improve the effectiveness of models with the same model size and data volume. Nowadays, many small-sized models, such as 2B and 3B, can already exceed the effect of previous models with tens of billions or even tens of billions of parameters. In this report, I will give you some recent pre-training solutions to improve the efficiency of Transformer, i.e., how to train out a better model effect under the same resources. I will give a summary of current language model efficiency improvement from three aspects: optimisation of model structure, such as some improvements to Attention, optimisation of training scheme, and optimisation of data. Outline:  Background of structural optimisation in the era of big models Attention improvements  Keys to inference speed optimisation Introduction and application of the MOE structure