I am a Principal Research Engineer at Microsoft AI, working on large-scale machine learning system optimization through innovative kernel, compiler, compression, and scheduling technologies. Prior to that, I was a Staff Research Engineer at Alibaba Cloud, overseeing the optimizing compilers for machine learning on GPUs and leading the research on machine learning inference optimization at Platform of Artificial Intelligence (PAI, Alibaba Cloud’s ONLY SAAS/PAAS for machine learning end-to-end). Before Alibaba, I obtained my Ph.D. in Computer Science from Tsinghua University in 2019, co-advised by Prof. Wenguang Chen and Prof. Jidong Zhai. I served as a visiting scholar under the supervision of Prof. Xipeng Shen at North Carolina State University in 2018.

My interests include machine learning algorithm system co-design, high-performance computing and heterogeneous computing. Welcome to contact me for any form of research cooperation.

🔥We are hiring! We have world-class LLM industry scenarios and scientific research topics. Together we can make outstanding contributions to the development of AI technology and thereby contribute to human progress.

Projects

Machine Learning Optimizing Compiler

  • BladeDISC. The state-of-the-art optimizing compiler for end-to-end dynamic shape machine learning programs with advanced fusion and code generation optimization on multiple hardware backends (AStitch techniques).
  • RECom. An optimizing compiler that aims to accelerate the expensive embedding column processing during the inference of deep recommendation models on the GPU.
  • MonoNN. An optimizing compiler that can accommodate an entire neural network into a single GPU kernel, drastically reducing non-computation overhead and providing further fine-grained optimization opportunities from the newly formed monolithic optimization space. (code will be released soon)

Machine Learning Kernel Library

  • Flash-LLM. A large language model (LLM) inference acceleration library for unstructured model pruning.
  • Quant-LLM/FP6-LLM. An efficient GPU support for LLM inference with FP6 quantization (end-to-end: DeepSpeed-FP6).

Heterogeneous Computing

  • VersaPipe. A framework for pipelined computing on GPU.

Publications

[EuroSys’25] “Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal Sharing”. Shulai Zhang, Quan Chen, Weihao Cui, Han Zhao, Chunyu Xue, Zhen Zheng, Wei Lin, Minyi Guo. (to appear)

[Preprint’24] “ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks”. Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash Bakhtiari, Michael Wyatt, Yuxiong He, Olatunji Ruwase, Leon Song, Zhewei Yao. [PDF]

[SC’24] “RecFlex: Enabling Feature Heterogeneity-Aware Optimization for Deep Recommendation Models with Flexible Schedules”. Zaifeng Pan, Zhen Zheng, Feng Zhang, Bing Xie, Ruofan Wu, Shaden Smith, Chuanjie Liu, Olatunji Ruwase, Xiaoyong Du, and Yufei Ding. [Code]

[ATC’24] “Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs”. Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song. (previous preprint name: FP6-LLM) [PDF | Code | E2E]

[EuroSys’24] “WiseGraph: Optimizing GNN with Joint Workload Partition of Graph and Operations”. Kezhao Huang, Jidong Zhai, Liyan Zheng, Haojie Wang, Yuyang Jin, Qihao Zhang, Runqing Zhang, Zhen Zheng, Youngmin Yi, Xipeng Shen. [PDF | Code]

[OSDI’24] “MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures”. Donglin Zhuang*, Zhen Zheng*, Haojun Xia, Xiafei Qiu, Junjie Bai, Wei Lin, Shuaiwen Leon Song. (revise-and-resubmitted in OSDI’23 and accepted in OSDI’24) [PDF | Code]

[SIGMOD’24] “BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler Approach”. Zhen Zheng, Zaifeng Pan, Dalin Wang, Kai Zhu, Wenyi Zhao, Tianyou Guo, Xiafei Qiu, Minmin Sun, Junjie Bai, Feng Zhang, Xiaoyong Du, Jidong Zhai, Wei Lin. [PDF | Code]

[VLDB’24] “Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity”. Haojun Xia*, Zhen Zheng*, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song. [PDF | Code]

[ASPLOS’23] “RECom: A Compiler Approach to Accelerating Recommendation Model Inference with Massive Embedding Columns”. Zaifeng Pan, Zhen Zheng, Feng Zhang, Ruofan Wu, Hao Liang, Dalin Wang, Xiafei Qiu, Junjie Bai, Wei Lin, Xiaoyong Du. [PDF | Code]

[TKDE’23] “Expanding the Edge: Enabling Efficient Winograd CNN Inference With Deep Reuse on Edge Device”. Feng Zhang, Ruofan Wu, Jiawei Guan, Zhen Zheng, Xiaoguang Guo, Xiao Zhang, Xiaoyong Du, Xipeng Shen. [PDF]

[ASPLOS’22] “AStitch: Enabling a New Multi-dimensional Optimization Space for Memory-intensive ML Training and Inference on Modern SIMT Architectures”. Zhen Zheng, Xuanda Yang, Pengzhan Zhao, Guoping Long, Kai Zhu, Feiwen Zhu, Wenyi Zhao, Xiaoyong Liu, Jun Yang, Jidong Zhai, Shuaiwen Leon Song, Wei Lin. [PDF]

[ATC’22] “Whale: Efficient Giant Model Training over Heterogeneous GPUs”. Xianyan Jia, Le Jiang, Ang Wang, Wencong Xiao, Ziji Shi, Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, Xiaoyong Liu, Wei Lin. [PDF | Code]

[WWW’22] “DREW: Efficient Winograd CNN Inference with Deep Reuse”. Ruofan Wu, Feng Zhang, Jiawei Guan, Zhen Zheng, Xiaoyong Du, Xipeng Shen. [PDF]

[TPDS’22] “Optimizing DNN Compilation for Distributed Training With Joint OP and Tensor Fusion”. Xiaodong Yi, Shiwei Zhang, Lansong Diao, Chuan Wu, Zhen Zheng, Shiqing Fan, Siyu Wang, Jun Yang, Wei Lin. [PDF]

[PPoPP’21] “Understanding and Bridging the Gaps in Current GNN Performance Optimizations”. Kezhao Huang, Jidong Zhai, Zhen Zheng, Youngmin Yi, Xipeng Shen. [PDF]

[PPoPP’21] “DAPPLE: A Pipelined Data Parallel Approach for Training Large Models”. Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, Lansong Diao, Xiaoyong Liu, Wei Lin. [PDF | Code]

[CoNEXT’20] “Optimizing Distributed Training Deployment in Heterogeneous GPU Clusters”. Xiaodong Yi, Shiwei Zhang, Ziyue Luo, Guoping Long, Lansong Diao, Chuan Wu, Zhen Zheng, Jun Yang, Wei Lin. [PDF]

[PACT’20] “GOPipe: A Granularity-oblivious Programming Framework for Pipelined Stencil Executions on GPU”. Chanyoung Oh, Zhen Zheng, Xipeng Shen, Jidong Zhai, Youngmin Yi. [PDF]

[ASPLOS’19] “HiWayLib: A Software Framework for Enabling High Performance Communications for Heterogeneous Pipeline Computations”. Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, Wenguang Chen. [PDF]

[MICRO’17] “VersaPipe: A Versatile Programming Famework for Pipelined Computing on GPU”. Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, Wenguang Chen. [PDF | Code]

[SC’16] “Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway Taihulight Supercomputer”. Haohuan Fu, Junfeng Liao, Wei Xue, Lanning Wang, Dexun Chen, Long Gu, Jinxiu Xu, Nan Ding, Xinliang Wang, Conghui He, Shizhen Xu, Yishuang Liang, Jiarui Fang, Yuanchao Xu, Weijie Zheng, Jingheng Xu, Zhen Zheng, Wanjing Wei, Xu Ji, He Zhang, Bingwei Chen, Kaiwei Li, Xiaomeng Huang, Wenguang Chen, Guangwen Yang. [PDF]