Ganqu Cui
Young Scientist, Shanghai Artificial Intelligence Laboratory
Ganqu Cui, a young scientist in Shanghai Artificial Intelligence Laboratory, graduated from the Department of Computer Science of Tsinghua University with a Ph.D., supervised by Associate Professor Liu Zhiyuan. His research interests include large language model alignment and reinforcement learning techniques. He has published more than ten papers in ICML, NeurIPS, ICLR, ACL, KDD and other top international AI conferences and journals, with more than 8000 citations in Google Scholar.
Topic
PRIME: Process Reinforcement through Implicit Rewards
The release of OpenAI o1 and DeepSeek-R1 models proved that reinforcement learning is a necessary path to higher-order reasoning ability, but it has been little explored by the open source community. We propose PRIME, an online reinforcement learning method based on scalable process reward, which successfully solves the three essential problems of how to use, how to train, and how to scale PRM in large-model reinforcement learning through implicit process reward, with excellent ease of use and scalability. We trained Eurus-2 from Qwen2.5-Math-7B-Base, which uses only 1/10 of Qwen's open-source data, and its mathematical ability exceeds that of Llama3.1-70B, GPT-4o and other large models. Among other things, PRIME brings an absolute improvement of 16.7% to the model, which far exceeds any open-source scheme we know of.