Xin Pan
CTO of SHARGE
Specializes in Diffusion models and MLLM R&D and applications. 10+ years in AI engineering/algorithms: contributed to TensorFlow/TPU at Google Brain (CV/NLP/speech research), led the foundational overhaul of Baidu PaddlePaddle, built Tencent’s Wuliang Recommendation System (serving 100M+ DAU), and spearheaded ByteDance’s AIGC/vision foundation model platform (powering Douyin/TikTok/CapCut).
Topic
Multimodal techniques and applications
1.Historical Review CV, NLP, Speech from weak to strong, from multi-stage to end-to-end, from fragmentation to convergence 2 Introduction to Diffusion and Multimodal-LLM 2.1 Evolution of Diffusion 2.2 Evolution of MLLM 2.3 Relationship between MLLM and Diffusion 3. Technical Challenges of Multimodal in Products 3.1 Limitations and analyses of current MLLM: Reasoning, Charts & Multilingual, Hallucination 3.2 Some directions for improvement. 3.21 Train multimodal from scratch 3.22 Better and Modular Encoder 3.23 Vision replace Text 4 Applying Multimodal to Documents and Social Products 4.1 Multimodal RAG, Multimodal-conditioned generation 4.2 MLLM and Diffusion Co-design 5 Outlook 5.1 Multimodal Agent 5.2 Co-evolution of Human and AI