Profile image
Kaipeng Zhang
Researcher
Shanghai AI Lab

About Me


I am a researcher in Shanghai AI Lab and working on multimodal LLM and image/video generation.

Before coming to Shanghai AI Lab, I received my PhD degree in 2022 from The University of Tokyo, supervised by Prof. Yoichi Sato, M.S. degree in 2018 from National Taiwan University supervised by Prof. Winston Hsu, and B.E. degree in 2016 from Donghua University. I was a researcher in SenseTime and consultant in ULSee. I also did research internships at Microsoft Research Asia, Tencent AI Lab, and MMLAB of SIAT. I currently work closely with Dr. Yu Qiao, Dr. Ping Luo, and Dr. Wenqi Shao

I'm looking for interns interested in multimodal LLMs or image/video generation. Please send me your resume.

Recent Publications (in 2024)


[Preprint]

Diffree: Text-guided shape free object inpainting with diffusion model
Lirui Zhao*, Tianshuo Yang*, Wenqi Shao*, Yuxin Zhang, Yu Qiao, Ping Luo, Kaipeng Zhang†, Rongrong Ji†
Image GenerationPaper
ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality
Yefei He, Feng Chen, Yuanyu He, Shaoxuan He, Hong Zhou†, Kaipeng Zhang†, Bohan Zhuang
Image GenerationPaper
Towards world simulator: Crafting physical commonsense-based benchmark for video generation
Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao†, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, Ping Luo†
Video GenerationPaper
Zipvl: Efficient large vision-language models with dynamic token sparsification and kv cache compression
Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou†, Kaipeng Zhang†, Bohan Zhuang
Efficient MLLMPaper
GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
Pengfei Zhou, Xiaopeng Peng, Jiajun Song, Chuanhao Li, Zhaopan Xu, Yue Yang, Ziyao Guo, Hao Zhang, Yuqi Lin, Yefei He, Lirui Zhao, Shuo Liu, Tianhua Li, Yuxuan Xie, Xiaojun Chang, Yu Qiao, Wenqi Shao, Kaipeng Zhang†
Multimodal LLMPaper
TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts
Yuxuan Xie, Tianhua Li, Wenqi Shao, Kaipeng Zhang†
Multimodal LLMPaper
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping
Yue Yang*, Shuibai Zhang*, Wenqi Shao†, Kaipeng Zhang†, Yi Bin, Yu Wang, Ping Luo†
Multimodal LLMPaper
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, Kaipeng Zhang†, Wenqi Shao†
Multimodal LLMPaper
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
Quanfeng Lu, Wenqi Shao†, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, Ping Luo†
Multimodal LLMPaper
MLLMs-Augmented Visual-Language Representation Learning
Yanqing Liu, Kai Wang, Wenqi Shao, Ping Luo, Yu Qiao, Mike Zheng Shou, Kaipeng Zhang†, Yang You†
Multimodal LLMPaper
ImageBind-LLM: Multi-modality Instruction Tuning
Jiaming Han*, Renrui Zhang*, Wenqi Shao*, Peng Gao*, Peng Xu*, Han Xiao*, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, Xudong Lu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Xiangyu Yue†, Hongsheng Li†, Yu Qiao†
Multimodal LLMPaper
Meta-Transformer: A Unified Framework for Multimodal Learning
Yiyuan Zhang*, Kaixiong Gong*, Kaipeng Zhang†, Hongsheng Li, Yu Qiao, Wanli Ouyang, Xiangyu Yue†
MultimodalityPaper
HRVMamba: High-Resolution Visual State Space Model for Dense Prediction
Hao Zhang, Yongqiang Ma, Wenqi Shao, Ping Luo, Nanning Zheng†, Kaipeng Zhang†
VisionPaper
Adapting LLaMA Decoder to Vision Transformer
Jiahao Wang, Wenqi Shao†, Mengzhao Chen, Chengyue Wu, Yong Liu, Kaipeng Zhang, Songyang Zhang, Kai Chen, Ping Luo†
VisionPaper
RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation
Junting Chen*, Yao Mu*, Qiaojun Yu, Tianming Wei, Silang Wu, Zhecheng Yuan, Zhixuan Liang, Chao Yang, Kaipeng Zhang, Wenqi Shao, Yu Qiao, Huazhe Xu, Mingyu Ding†, Ping Luo†
Embodied AIPaper


[Journal]

Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching
Hao Zhang, Lumin Xu, Shenqi Lai, Wenqi Shao, Nanning Zheng†, Ping Luo, Yu Qiao, Kaipeng Zhang†
Multimodality[IJCV 2024] Paper
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
Peng Xu*, Wenqi Shao†*, Kaipeng Zhang*, Peng Gao*, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, Ping Luo†
Multimodal LLM[T-PAMI 2024]Paper
B-AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Black-box Adversarial Visual-Instructions
Hao Zhang, Wenqi Shao, Hong Liu, Yongqiang Ma, Ping Luo, Yu Qiao, Nanning Zheng†, Kaipeng Zhang†
Multimodal LLM[T-IFS 2024]Paper
Tiny LVLM-eHub: Early Multimodal Experiments with Bard
Wenqi Shao*, Yutao Hu*, Peng Gao*, Meng Lei*, Kaipeng Zhang, Fanqing Meng, Peng Xu, Siyuan Huang, Hongsheng Li, Yu Qiao†, Ping Luo†
Multimodal LLM[T-BigData 2024]Paper
HF-HRNet: a simple hardware friendly high-resolution network
Hao Zhang, Yujie Dun, Yixuan Pei, Shenqi Lai, Chengxu Liu, Kaipeng Zhang, Xueming Qian†
Vision[T-CSVT 2024]Paper
HRVMamba: High-Resolution Visual State Space Model for Dense Prediction
Hao Zhang, Yongqiang Ma, Kaipeng Zhang, Nanning Zheng†, Shenqi Lai†
Vision[Pattern Recognition 2024]Paper

[Conference]

SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge
Chuanhao Li, Zhen Li, Chenchen Jing, Shuo Liu, Wenqi Shao, Yuwei Wu, Ping Luo, Yu Qiao, Kaipeng Zhang†
Multimodal LLM[NeurIPS 2024]Paper
ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models
Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, Ping Luo, Wenqi Shao†, Kaipeng Zhang†
Multimodal LLM[NeurIPS 2024]Paper
Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality
Tianle Zhang, Langtian Ma, Yuchen Yan, Yuchen Zhang, Kai Wang, Yue Yang, Ziyao Guo, Wenqi Shao, Yang You, Yu Qiao, Ping Luo, Kaipeng Zhang†
Text-to-Video[NeurIPS 2024]Paper
Needle In A Multimodal Haystack
Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, Ping Luo, Yu Qiao, Jifeng Dai, Wenqi Shao†, Wenhai Wang†
Multimodal LLM[NeurIPS 2024]Paper
Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT
Le Zhuo*, Ruoyi Du*, Han Xiao*, Yangguang Li*, Dongyang Liu*, Rongjie Huang*, Wenze Liu*, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang, Kaipeng Zhang, Xiangyang Zhu, Si Liu, Xiangyu Yue, Dingning Liu, Wanli Ouyang, Ziwei Liu, Yu Qiao†, Hongsheng Li†, Peng Gao†
Multimodal LLM[NeurIPS 2024]Paper
Towards Implicit Prompt For Text-To-Image Models
Yue Yang, Yuqi Lin, Hong Liu, Wenqi Shao, Runjian Chen, Hailong Shang, Yu Wang, Yu Qiao, Kaipeng Zhang† and Ping Luo†
Image Generation [ICML 2024] Paper
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Kaining Ying*, Fanqing Meng*, Jin Wang*, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Cunjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang† and Wenqi Shao†
Multimodal LLM [ICML 2024] Paper
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Peng Gao*†, Renrui Zhang*, Chris Liu*, Longtian Qiu*, Siyuan Huang*, Weifeng Lin*, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li† and Yu Qiao
Multimodal LLM [ICML 2024] Paper
DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model
Lirui Zhao*, Yue Yang*, Kaipeng Zhang‡*, Wenqi Shao‡*, Yuxin Zhang, Yu Qiao, Ping Luo, Rongrong Ji†
Image Generation [CVPR 2024] Paper
OneLLM: One Framework to Align All Modalities with Language
Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue†
Multimodal LLM [CVPR 2024] Paper
T3M: Text Guided 3D Human Motion Synthesis from Speech
Wenshuo Peng, Kaipeng Zhang†, Sai Qian Zhang†
Multimodality [NAACL Findings 2024] Paper
ChartAssistant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning
Fanqing Meng, Wenqi Shao†, Quanfeng Lu, Peng Gao, Kaipeng Zhang, Yu Qiao, Ping Luo†
Multimodal LLM [ACL Findings 2024] Paper
Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching
Ziyao Guo, Kai Wang, George Cazenavette, Hui Li, Kaipeng Zhang†, Yang You†
Dataset Distillation [ICLR 2024] Paper
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
Wenqi Shao*, Mengzhao Chen*, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo†
Efficient LLM [ICLR 2024] Paper
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation
Peng Xu, Wenqi Shao†, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, Ping Luo†
Efficient LLM [ICLR 2024] Paper
Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification
Wenshuo Peng, Kaipeng Zhang†, Yue Yang, Hao Zhang, Yu Qiao
Multimodality [AAAI 2024] Paper
TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training
Yuqi Lin, Minghao Chen†, Kaipeng Zhang†, Hengjia Li, Mingming Li, Zheng Yang, Dongqin Lv, Binbin Lin, Haifeng Liu, Deng Cai
Multimodality [AAAI 2024] Paper
Align, Adapt and Inject: Audio-Guided Image Generation, Editing and Stylization
Yue Yang, Kaipeng Zhang†, Yuying Ge, Wenqi Shao, Zeyue Xue, Yu Qiao, Ping Luo†
Image Generation [ICASSP 2024] Paper

Education


Ph.d. in CS, The University of Tokyo, Tokyo, Japan
Apr. 2019 - Mar. 2022
M.S. in CS, National Taiwan University, Taipei, Taiwan
Sep. 2016 - Aug. 2018
B.Eng. in CS, Donghua University, Shanghai, China
Sep. 2012 - July 2016

Selected Awards and Competitions


  • WAIC Young Outstanding Paper Award, 2022
  • World's TOP 2% Scientists (published by Stanford University), 2020 & 2021 & 2022 & 2023
  • JSPS Research Fellowships for Young Scientists, 2020
  • Tencent Rhino-Bird Elite Training Program, 2020
  • MSRA Fellowship Nomination Award, 2019
  • Emotion Recognition in the Wild: Engagement Prediction (ICMI 2019 Grand Challenge), 3rd place
  • Emotion Recognition in the Wild: Group-based Cohesion Prediction (ICMI 2019 Grand Challenge), 2nd place
  • Disguised Faces in the Wild Challenge (in conjunction with CVPR 2018), 1st place
  • Emotion Recognition in the Wild: Group-level emotion recognition (ICMI 2018 Grand Challenge), 2nd place
  • Emotion Recognition in the Wild: Group-level emotion recognition (ICMI 2017 Grand Challenge), 1st place
  • ChaLearn Looking at People Challenge: Accessories Classification (in conjunction with CVPR 2016), 1st place
  • ChaLearn Looking at People Challenge: Smile and Gender Classification (in conjunction with CVPR 2016), 1st place
  • Outstanding Undergraduate Thesis, 2016
  • Academic Service


  • Senior program committee of IJCAI
  • Reviewer/Program committee of NeurIPS, ICML, ICLR, AAAI, ICCV, ECCV, CVPR, BMVC, WACV and ACCV
  • Reviewer of TPAMI, TIP, TCSVT, TNNLS, TMM, TIFS, Neurocomputing, Pattern Recognition, and SPL
  • Work Experience


    Researcher
    Shanghai AI Lab
    OpenGVLab
    Shanghai, China
    May. 2022 - Present
    Researcher
    SenseTime
    Research Institute
    Shenzhen, China
    Sept. 2018 - Mar. 2019
    Intern
    MSRA
    Visual Computing Group
    Beijing, China
    Jan. 2018 - Jul. 2018
    Consultant
    ULSee
    Face Team
    Hangzhou, China
    Oct. 2016 - Mar. 2018
    Intern
    Tencen
    AI Lab & AI Advertisement Department
    Shenzhen, China
    Jul. 2017 - Aug. 2017
    Sep. 2020 - Feb. 2021
    Visiting Student
    Shenzhen Institutes of Advanced Technology
    Multimedia Research Center
    Shenzhen, China
    Jul. 2015 - Aug. 2016