I am a Senior Principal Research Manager at Microsoft AI (from DeepSpeed team to Microsoft Superintelligence team), working on large-scale machine learning system optimization through innovative kernel, algorithm, scheduling, and compiler technologies. Previously, I was a Staff Research Engineer at Alibaba Cloud, where I led the development of optimizing compilers for GPU-based machine learning and spearheaded research on inference optimization for the Platform of Artificial Intelligence (PAI)—Alibaba Cloud’s premier SaaS/PaaS solution for end-to-end machine learning. Before Alibaba, I earned my Ph.D. in Computer Science from Tsinghua University in 2019, co-advised by Prof. Wenguang Chen and Prof. Jidong Zhai. In 2018, I was a visiting scholar at North Carolina State University under the supervision of Prof. Xipeng Shen.

My research interests span machine learning algorithm-system co-design, high-performance computing, and heterogeneous computing. Please feel free to contact me regarding potential research collaborations.

🔥We are hiring! We have world-class LLM industry scenarios and scientific research topics. Together we can make outstanding contributions to the development of AI technology and thereby contribute to human progress.

Publications

[MLSys’26 (to appear)] “FlashAgents: Accelerating Multi-Agent LLM Systems via Streaming Prefill Overlap”. Taosong Fang, Zhen Zheng, Zhengzhao Ma, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun.

[MLSys’26 (to appear)] “Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost”. Haojun Xia, Xiaoxia Wu, Jisen Li, Robert Wu, Junxiong Wang, Jue Wang, Chenxi Li, Aman Singhal, Alay Dilipbhai Shah, Alpay Ariyak, Donglin Zhuang, Zhongzhu Zhou, Ben Athiwaratkun, Zhen Zheng†, Shuaiwen Leon Song†. [PDF | Code] († Corresponding author)

[MLSys’26 (to appear)] “MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design”. Zhen Zheng, Xiaonan Song, Chuanjie Liu. [PDF]

[MLSys’26 (to appear)] “BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching”. Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, Gang Peng. [PDF]

[FCS’26] “A Comprehensive Taxonomy of Prompt Engineering Techniques for Large Language Models”. Yaoyang Liu, Zhen Zheng, Feng Zhang, Jincheng Feng, Yiyang Fu, Jidong Zhai, Bingsheng He, Xiao Zhang, Xiaoyong Du. [PDF]

[ATC’25] “PluS: Highly Efficient and Expandable ML Compiler with Pluggable Graph Schedules”. Ruofan Wu, Zhen Zheng, Feng Zhang, Chuanjie Liu, Zaifeng Pan, Jidong Zhai, Xiaoyong Du. [PDF]

[EuroSys’25] “Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal Sharing”. Shulai Zhang, Quan Chen, Weihao Cui, Han Zhao, Chunyu Xue, Zhen Zheng, Wei Lin, Minyi Guo. [PDF]

[Preprint’24] “ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks”. Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash Bakhtiari, Michael Wyatt, Yuxiong He, Olatunji Ruwase, Leon Song, Zhewei Yao. [PDF]

[SC’24] “RecFlex: Enabling Feature Heterogeneity-Aware Optimization for Deep Recommendation Models with Flexible Schedules”. Zaifeng Pan, Zhen Zheng, Feng Zhang, Bing Xie, Ruofan Wu, Shaden Smith, Chuanjie Liu, Olatunji Ruwase, Xiaoyong Du, Yufei Ding. [PDF | Code]

[ATC’24] “Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs”. Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song. (integrated in torch.ao, previous preprint: FP6-LLM) [PDF | Code | E2E]

[EuroSys’24] “WiseGraph: Optimizing GNN with Joint Workload Partition of Graph and Operations”. Kezhao Huang, Jidong Zhai, Liyan Zheng, Haojie Wang, Yuyang Jin, Qihao Zhang, Runqing Zhang, Zhen Zheng, Youngmin Yi, Xipeng Shen. [PDF | Code]

[OSDI’24] “MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures”. Donglin Zhuang*, Zhen Zheng*, Haojun Xia, Xiafei Qiu, Junjie Bai, Wei Lin, Shuaiwen Leon Song. (revise-and-resubmitted in OSDI’23 and accepted in OSDI’24) [PDF | Code] (* Equal contribution)

[VLDB’24 (PVLDB’23)] “Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity”. Haojun Xia*, Zhen Zheng*, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song. [PDF | Code] (* Equal contribution)

[SIGMOD’24 (PACMMOD’23)] “BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler Approach”. Zhen Zheng, Zaifeng Pan, Dalin Wang, Kai Zhu, Wenyi Zhao, Tianyou Guo, Xiafei Qiu, Minmin Sun, Junjie Bai, Feng Zhang, Xiaoyong Du, Jidong Zhai, Wei Lin. [PDF | Code]

[ASPLOS’23] “RECom: A Compiler Approach to Accelerating Recommendation Model Inference with Massive Embedding Columns”. Zaifeng Pan, Zhen Zheng, Feng Zhang, Ruofan Wu, Hao Liang, Dalin Wang, Xiafei Qiu, Junjie Bai, Wei Lin, Xiaoyong Du. [PDF | Code]

[TKDE’23] “Expanding the Edge: Enabling Efficient Winograd CNN Inference With Deep Reuse on Edge Device”. Feng Zhang, Ruofan Wu, Jiawei Guan, Zhen Zheng, Xiaoguang Guo, Xiao Zhang, Xiaoyong Du, Xipeng Shen. [PDF]

[ASPLOS’22] “AStitch: Enabling a New Multi-dimensional Optimization Space for Memory-intensive ML Training and Inference on Modern SIMT Architectures”. Zhen Zheng, Xuanda Yang, Pengzhan Zhao, Guoping Long, Kai Zhu, Feiwen Zhu, Wenyi Zhao, Xiaoyong Liu, Jun Yang, Jidong Zhai, Shuaiwen Leon Song, Wei Lin. [PDF]

[ATC’22] “Whale: Efficient Giant Model Training over Heterogeneous GPUs”. Xianyan Jia, Le Jiang, Ang Wang, Wencong Xiao, Ziji Shi, Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, Xiaoyong Liu, Wei Lin. [PDF | Code]

[WWW’22] “DREW: Efficient Winograd CNN Inference with Deep Reuse”. Ruofan Wu, Feng Zhang, Jiawei Guan, Zhen Zheng, Xiaoyong Du, Xipeng Shen. [PDF]

[TPDS’22] “Optimizing DNN Compilation for Distributed Training With Joint OP and Tensor Fusion”. Xiaodong Yi, Shiwei Zhang, Lansong Diao, Chuan Wu, Zhen Zheng, Shiqing Fan, Siyu Wang, Jun Yang, Wei Lin. [PDF]

[PPoPP’21] “Understanding and Bridging the Gaps in Current GNN Performance Optimizations”. Kezhao Huang, Jidong Zhai, Zhen Zheng, Youngmin Yi, Xipeng Shen. [PDF]

[PPoPP’21] “DAPPLE: A Pipelined Data Parallel Approach for Training Large Models”. Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, Lansong Diao, Xiaoyong Liu, Wei Lin. [PDF | Code]

[CoNEXT’20] “Optimizing Distributed Training Deployment in Heterogeneous GPU Clusters”. Xiaodong Yi, Shiwei Zhang, Ziyue Luo, Guoping Long, Lansong Diao, Chuan Wu, Zhen Zheng, Jun Yang, Wei Lin. [PDF]

[PACT’20] “GOPipe: A Granularity-oblivious Programming Framework for Pipelined Stencil Executions on GPU”. Chanyoung Oh, Zhen Zheng, Xipeng Shen, Jidong Zhai, Youngmin Yi. [PDF]

[ASPLOS’19] “HiWayLib: A Software Framework for Enabling High Performance Communications for Heterogeneous Pipeline Computations”. Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, Wenguang Chen. [PDF]

[MICRO’17] “VersaPipe: A Versatile Programming Famework for Pipelined Computing on GPU”. Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, Wenguang Chen. [PDF | Code]

[SC’16] “Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway Taihulight Supercomputer”. Haohuan Fu, Junfeng Liao, Wei Xue, Lanning Wang, Dexun Chen, Long Gu, Jinxiu Xu, Nan Ding, Xinliang Wang, Conghui He, Shizhen Xu, Yishuang Liang, Jiarui Fang, Yuanchao Xu, Weijie Zheng, Jingheng Xu, Zhen Zheng, Wanjing Wei, Xu Ji, He Zhang, Bingwei Chen, Kaiwei Li, Xiaomeng Huang, Wenguang Chen, Guangwen Yang. [PDF]