Publications | Chen Chen

2026

EuroSys

Suika: Efficient and High-quality Re-scheduling of 3D-parallelized LLM Training Jobs in Shared Clusters

Yuxuan Wang, Yanbo Wang, Chen Chen, Chunyu Xue, Qizhen Weng, Yin Chen, Zeren Li, Xuqi Zhu, Yongqiang Yang, Quan Chen, and Minyi Guo

In European Conference on Computer Systems, 2026
EuroSys

Arena: Efficiently Training Large Models via Dynamic Scheduling and Adaptive Parallelism Co-Design

Chunyu Xue, Weihao Cui, Quan Chen, Chen Chen, Han Zhao, Shulai Zhang, Linwei Wang, Yan Li, Limin Xiao, Weifeng Zhang, Jing Yang, Bingsheng He, and Minyi Guo

In European Conference on Computer Systems, 2026
ICLR

ASTRAEA: A Token-wise Acceleration Framework for Video Diffusion Transformers

Haosong Liu, Yuge Cheng, Wenxuan Miao, Zihan Liu, Aiyue Chen, Jing Lin, Yiwu Yao, Chen Chen, Jingwen Leng, Yu Feng, and Minyi Guo

In The Fourteenth International Conference on Learning Representations, 2026
VLDB

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jing Liu, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Bailun Ding, Xiao Yan, Jiawei Jiang, Chen Chen, Mingxing Zhang, Cheng Li, Yuqing Yang, Fan Yang, and Mao Yang

International Conference on Very Large Data Bases, 2026

2025

ToN

Mitigating Server-side Communication Bottlenecks in Distributed Learning with Round-Robin Participant Coordination

Jiayi Zhang, Chen Chen, Zuo Gan, Wei Wang, Bo Li, and Minyi Guo

IEEE Transactions on Networking, 2025
TCC

Castor: Optimizing Deep Learning Job Scheduling in Multi-Tenant GPU Clusters via Intelligent Colocation

Yizhou Luo, Jiaxin Lai, Shaohuai Shi, Chen Chen, Shuhan Qi, Jiajia Zhang, and Qiang Wang

IEEE Transactions on Cloud Computing, 2025

PDF
CloudCom

SemanticPrefetcher: Accelerate Data Lake Access with Semantics-Aware File Prefetching

Tianze Wang, Guanjie Wang, Mingyan Yang, Manqi Luo, Mingchuan Zou, Chen Chen, and Minyi Guo

In The 16th IEEE International Conference on Cloud Computing Technology and Science, 2025

Excellent Paper Award

Awarded the excellent paper (1/3) of IEEE CloudCom
Arxiv

Efficient Unified Caching for Accelerating Heterogeneous AI Workloads

Tianze Wang, Yifei Liu, Chen Chen, Pengfei Zuo, Jiawei Zhang, Qizhen Weng, Yin Chen, Zhenhua Han, Jieru Zhao, Quan Chen, and Minyi Guo

arXiv preprint arXiv:2506.12370, 2025

PDF
Arxiv

Efficient Serving of LLM Applications with Probabilistic Demand Modeling

Yifei Liu, Zuo Gan, Zhenghao Gan, Weiye Wang, Chen Chen, Yizhou Shan, Xusheng Chen, Zhenhua Han, Yifei Zhu, Shixuan Sun, and others

arXiv preprint arXiv:2506.14851, 2025

PDF
NeurIPS

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, and Lili Qiu

In The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025

PDF
TACO

Ares: Fair and Efficient Scheduling of Deep Learning Jobs with Elastic Fair Queuing

Yifei Liu, Chen Chen, Qiang Wang, Yu Feng, Weihao Cui, Quan Chen, and Minyi Guo

ACM Transactions on Architecture and Code Optimization, 2025

PDF
TSC

Trident: A Provider-Oriented Resource Management Framework for Serverless Computing Platforms

Botao Zhu, Yifei Zhu, Chen Chen, and Linghe Kong

IEEE Transactions on Services Computing, 2025

PDF
VLDB

RapidStore: An Efficient Dynamic Graph Storage System for Concurrent Queries

Chiyu Hao, Jixian Su, Shixuan Sun, Hao Zhang, Sen Gao, Jianwen Zhao, Chenyi Zhang, Jieru Zhao, Chen Chen, and Minyi Guo

International Conference on Very Large Data Bases, 2025

PDF
TACO

EDAS: Enabling Fast Data Loading for GPU Serverless Computing

Han Zhao, Weihao Cui, Quan Chen, Zijun Li, Zhenhua Han, Nan Wang, Yu Feng, Jieru Zhao, Chen Chen, Jingwen Leng, and others

ACM Transactions on Architecture and Code Optimization, 2025

PDF
ICDCS

LLMSched: Uncertainty-Aware Workload Scheduling for Compound LLM Applications

Botao Zhu, Chen Chen, Xiaoyi Fan, and Yifei Zhu

In IEEE International Conference on Distributed Computing Systems, 2025

PDF
ICDCS

FedSU: Communication-efficient Federated Learning with Speculative Updating

Wei Yu, Chen Chen, Qinbin Li, Jieru Zhao, Shixuan Sun, Bo Li, and Minyi Guo

In IEEE International Conference on Distributed Computing Systems, 2025

PDF
ISCA

Lumina: Real-Time Neural Rendering by Exploiting Computational Redundancy

Yu Feng, Weikai Lin, Yuge Cheng, Zihan Liu, Jingwen Leng, Minyi Guo, Chen Chen, Shixuan Sun, and Yuhao Zhu

In International Symposium on Computer Architecture, 2025

PDF
TACO

Taming Flexible Job Packing in Deep Learning Training Clusters

Pengyu Yang, Weihao Cui, Chunyu Xue, Han Zhao, Chen Chen, Quan Chen, Jing Yang, and Minyi Guo

ACM Transactions on Architecture and Code Optimization, 2025

PDF
IPDPS

Reducing the End-to-End Latency of DNN-based Recommendation Systems Deployed in GPU Pools

Guangqiang Luan, Pu Pang, Quan Chen, Guoyao Xu, Chi Zhang, Yanyi Zi, Yinghao Yu, Guodong Yang, Liping Zhang, Chen Chen, and Minyi Guo

In 39th IEEE International Parallel and Distributed Processing Symposium, 2025

PDF

2024

ENLSP

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, and Lili Qiu

In the 4th NeurIPS Workshop on Efficient Natural Language and Speech Processing (Spotlight), 2024

Best Paper Award PDF

Awarded the best paper by ENLSP
ICPP

FedCA: Efficient Federated Learning with Client Autonomy

Na Lv, Zhi Shen, Chen Chen, Zhifeng Jiang, Jiayi Zhang, Quan Chen, and Minyi Guo

In Proceedings of the 53rd ACM International Conference on Parallel Processing, 2024

PDF
IWQoS

PAS: Towards Accurate and Efficient Federated Learning with Parameter-Adaptive Synchronization

Zuo Gan, Chen Chen, Jiayi Zhang, Gaoxiong Zeng, Yifei Zhu, Jieru Zhao, Quan Chen, and Minyi Guo

In Proceedings of the IEEE/ACM International Symposium on Quality of Service, 2024

PDF
IWQoS Poster

Towards Efficient Compound Large Language Model System Serving in the Wild

Yifei Zhu, Botao Zhu, Chen Chen, and Xiaoyi Fan

In Proceedings of the IEEE/ACM International Symposium on Quality of Service, 2024

Best Poster Award PDF

The best poster for IWQoS 2024
OSDI

Parrot: Efficient Serving of LLM-based Applications with Semantic Variable

Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu

In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation, 2024

Abs PDF

The rise of large language models (LLMs) has enabled LLM-based applications (a.k.a. AI agents or co-pilots), a new software paradigm that combines the strength of LLM and conventional software. Diverse LLM applications from different tenants could design complex workflows using multiple LLM requests to accomplish one task. However, they have to use the over-simplified request-level API provided by today’s public LLM services, losing essential application-level information. Public LLM services have to blindly optimize individual LLM requests, leading to sub-optimal end-to-end performance of LLM applications.
INFOCOM

DPBalance: Efficient and Fair Privacy Budget Scheduling for Federated Learning as a Service

Yu Liu, Zibo Wang, Yifei Zhu, and Chen Chen

In Proceedings of the IEEE Conference on Computer Communications, 2024

PDF
HPCA

An Optimizing Framework on MLIR for Efficient FPGA-based Accelerator Generation

Weichuang Zhang, Jieru Zhao, Guan Shen, Quan Chen, Chen Chen, and Minyi Guo

In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2024

Distinguished Artifact Award Runner-up PDF

The paper with the best artifact in HPCA 2024 (2 out of 75 accepted papers and 410 submissions)
ASPLOS

DataFlower: Exploiting the Data-flow Paradigm for Serverless Workflow Orchestration

Zijun Li, Chuhao Xu, Quan Chen, Jieru Zhao, Chen Chen, and Minyi Guo

In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024

PDF Video

2023

ICCD

STAG: Enabling Low Latency and Low Staleness of GNN-based Services with Dynamic Graphs

Jiawen Wang, Quan Chen, Deze Zeng, Zhuo Song, Chen Chen, and Minyi Guo

In Proceedings of the IEEE International Conference on Computer Design, 2023

PDF
TCC

Accelerating Distributed Learning in Non-Dedicated Environments

Chen Chen, Qizhen Weng, Wei Wang, Baochun Li, and Bo Li

IEEE Transactions on Cloud Computing, 2023

Abs PDF

Machine learning (ML) models are increasingly trained with distributed workers possessing heterogeneous resources. In such scenarios, model training efﬁciency may be negatively affected by stragglers—workers that run much slower than others. Efﬁcient model training requires eliminating such stragglers, yet for modern ML workloads, existing load balancing strategies are inefﬁcient and even infeasible. In this paper, we propose a novel strategy, called semi-dynamic load balancing, to eliminate stragglers of distributed ML workloads. The key insight is that ML workers shall be load-balanced at iteration boundaries, being non-intrusive to intra-iteration execution. Based on it we further develop LB-BSP, an integrated worker coordination mechanism that adapts workers’ load to their instantaneous processing capabilities—by right-sizing the sample batches at the synchronization barriers. We have designed distinct load tuning algorithms for ML in CPU clusters, in GPU clusters as well as in federated learning setups, based on their respective characteristics. LB-BSP has been implemented as a Python module for ML frameworks like TensorFlow and PyTorch. Our EC2 deployment conﬁrms that LB-BSP is practical, effective and light-weight, and is able to accelerating distributed training by up to 54%.
ICPP

Asfl: Adaptive Semi-asynchronous Federated Learning for Balancing Model Accuracy and Total Latency in Mobile Edge Networks

Jieling Yu, Ruiting Zhou, Chen Chen, Bo Li, and Fang Dong

In Proceedings of the ACM International Conference on Parallel Processing, 2023

PDF
TPDS

Synchronize Only the Immature Parameters: Communication-Efficient Federated Learning By Freezing Parameters Adaptively

Chen Chen, Hong Xu, Wei Wang, Baochun Li, Bo Li, Li Chen, and Gong Zhang

IEEE Transactions on Parallel and Distributed Systems, 2023

Abs PDF

Federated learning allows edge devices to collaboratively train a global model without sharing their local private data. Yet, with limited network bandwidth at the edge, communication often becomes a severe bottleneck. In this paper, we ﬁnd that it is unnecessary to always synchronize the full model in the entire training process, because many parameters already become mature (i.e., stable) prior to model convergence, and can thus be excluded from later synchronizations. This allows us to reduce the communication overhead without compromising the model accuracy. However, challenges are that the local parameters excluded from global synchronization may diverge on different clients, and meanwhile some parameters may stabilize only temporally. To address these challenges, we propose a novel scheme called Adaptive Parameter Freezing (APF), which ﬁxes (freezes) the non-synchronized stable parameters in intermittent periods. Speciﬁcally, the freezing periods are tentatively adjusted in an additively-increase and multiplicativelydecrease manner—depending on whether the previously-frozen parameters remain stable in subsequent iterations. We also extend APF into APF# and APF++, which freeze parameters in a more aggressive manner to achieve larger performance beneﬁt for large complex models. We implemented APF and its variants as Python modules with PyTorch, and extensive experiments show that APF can reduce data transfer amount by over 60%.
JSAC

GIFT: Towards Accurate and Efficient Federated Learning with Gradient-Instructed Frequency Tuning

Chen Chen, Hong Xu, Baochun Li, Bo Li, Li Chen, and Gong Zhang

IEEE Journal on Selected Areas in Communications (special issue on Communication-Efficient Distributed Learning over Networks), 2023

Abs PDF

Federated Learning (FL) enables distributed clients to collectively train a global model without revealing their private data, and for efficiency clients synchronize their gradients periodically. However, this can lead to the inaccuracy in model convergence due to inconsistent data distributions among clients. In this work, we find that there is a strong correlation between FL accuracy loss and the synchronization frequency, and seek to fine tune the synchronization frequency at training runtime to make FL accurate and also efficient. Specifically, aware that under the FL privacy requirement only gradients can be utilized for making frequency tuning decisions, we propose a novel metric called gradient consistency, which can effectively reflect the training status despite the instability of realistic FL scenarios. We further devise a feedback-driven algorithm called GradientInstructed Frequency Tuning (GIFT), which adaptively increases or decreases the synchronization frequency based on the gradient consistency metric. We have implemented GIFT in PyTorch, and large-scale evaluations show that it can improve FL accuracy by up to 10.7% with a time reduction of 58.1%.

2022

SoCC

Characterizing and orchestrating VM reservation in geo-distributed clouds to improve the resource efficiency

Jiuchen Shi, Kaihua Fu, Quan Chen, Changpeng Yang, Pengfei Huang, Mosong Zhou, Jieru Zhao, Chen Chen, and Minyi Guo

In Proceedings of the ACM Symposium on Cloud Computing, 2022

Abs PDF

Cloud providers often build a geo-distributed cloud from multiple datacenters in different geographic regions, to serve tenants at different locations. The tenants that run large scale applications often reserve resources based on their peak loads in the region close to the end users to handle the ever changing application load, wasting a large amount of resources. We therefore characterize the VM request patterns of the top tenants in our production public geo-distributed cloud, and open-source the VM request traces in four months from the top 20 tenants of our cloud. The characterization shows that the resource usage of large tenants has various temporal and spatial patterns on the dimensions of time series, regions, and VM types, and has the potential of peak shaving between different tenants to further reduce the resource reservation cost. Based on the findings, we propose a resource reservation and VM request scheduling scheme named ROS to minimize the resource reservation cost while satisfying the VM allocation requests. Our experiments show that ROS reduces the overall deployment cost by 75.4% and the reservation resources by 60.1%, compared to the tenant-specified reservation strategy.

2021

ICDCS

Communication-Efficient Federated Learning with Adaptive Parameter Freezing

Chen Chen, Hong Xu, Wei Wang, Baochun Li, Bo Li, Li Chen, and Gong Zhang

In Proceedings of the IEEE International Conference on Distributed Computing Systems, 2021

Abs PDF Video

Federated learning allows edge devices to collaboratively train a global model by synchronizing their local updates without sharing private data. Yet, with limited network bandwidth at the edge, communication often becomes a severe bottleneck. In this paper, we ﬁnd that it is unnecessary to always synchronize the full model in the entire training process, because many parameters gradually stabilize prior to the ultimate model convergence, and can thus be excluded from being synchronized at an early stage. This allows us to reduce the communication overhead without compromising the model accuracy. However, challenges are that the local parameters excluded from global synchronization may diverge on different clients, and meanwhile some parameters may stabilize only temporally. To address these challenges, we propose a novel scheme called Adaptive Parameter Freezing (APF), which ﬁxes (freezes) the non-synchronized stable parameters in intermittent periods. Speciﬁcally, the freezing periods are tentatively adjusted in an additively-increase and multiplicatively-decrease manner, depending on if the previouslyfrozen parameters remain stable in subsequent iterations. We implemented APF as a Python module in PyTorch. Our extensive array of experimental results show that APF can reduce data transfer by over 60%.
IJCNN

Two-dimensional learning rate decay: Towards accurate federated learning with non-iid data

Kaiwei Mo, Chen Chen, Jiamin Li, Hong Xu, and Chun Jason Xue

In Proceedings of the International Joint Conference on Neural Networks, 2021

PDF

2020

SoCC

Semi-Dynamic Load Balancing: Efficient Distributed Learning in Non-Dedicated Environments

Chen Chen, Qizhen Weng, Wei Wang, Baochun Li, and Bo Li

In Proceedings of the ACM Symposium on Cloud Computing, 2020

Abs PDF Video

Machine learning (ML) models are increasingly trained in clusters with non-dedicated workers possessing heterogeneous resources. In such scenarios, model training efficiency can be negatively affected by stragglers—workers that run much slower than others. Efficient model training requires eliminating such stragglers, yet for modern ML workloads, existing load balancing strategies are inefficient and even infeasible. In this paper, we propose a novel strategy called semi-dynamic load balancing to eliminate stragglers of distributed ML workloads. The key insight is that ML workers shall be load-balanced at iteration boundaries, being nonintrusive to intra-iteration execution. We develop LB-BSP based on such an insight, which is an integrated worker coordination mechanism that adapts workers’ load to their instantaneous processing capabilities by right-sizing the sample batches at the synchronization barriers. We have customdesigned the batch sizing algorithm respectively for CPU and GPU clusters based on their own characteristics. LB-BSP has been implemented as a Python module for ML frameworks like TensorFlow and PyTorch. Our EC2 deployment confirms that LB-BSP is practical, effective and light-weight, and is able to accelerating distributed training by up to 54%.
SC

Metis: Learning to schedule long-running applications in shared container clusters at scale

Luping Wang, Qizhen Weng, Wei Wang, Chen Chen, and Bo Li

In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2020

PDF

2019

INFOCOM

Round-robin synchronization: Mitigating communication bottlenecks in parameter servers

Chen Chen, Wei Wang, and Bo Li

In Proceedings of the IEEE Conference on Computer Communications, 2019

Best In-Session Presentation Award PDF

The best presentation for the session CLOUD COMPUTING 2

2018

SoCC Poster

Fast distributed deep learning via worker-adaptive batch sizing

Chen Chen, Qizhen Weng, Wei Wang, Baochun Li, and Bo Li

In Proceedings of the ACM symposium on cloud computing, 2018

PDF
INFOCOM

Performance-Aware Fair Scheduling: Exploiting Demand Elasticity of Data Analytics Jobs

Chen Chen, Wei Wang, and Bo Li

In Proceedings of the IEEE Conference on Computer Communications, 2018

Abs PDF

Efﬁcient resource management is of paramount importance in today’s production clusters. In this paper, we identify the demand elasticity of data-parallel jobs. Demand elasticity allows jobs to run with a signiﬁcantly less amount of resources than they ideally need, at the expense of only a modest performance penalty. Our EC2 experiment using popular Spark benchmark suites conﬁrms that running a job using 50% of demanded slots is sufﬁcient to achieve at least 75% of the ideal performance. We show that such an elasticity is an intrinsic property of data-parallel jobs and can be exploited to speed up average job completion. In this regard, we propose PerformanceAware Fair (PAF) scheduler to identify the demand elasticity and use it to improve the average job performance, while still attaining near-optimal isolation guarantee close to fair sharing. PAF starts with a fair allocation and iteratively adjusts it by transferring resources from one job to another, improving the performance of resource-taker without penalizing resource-giver by a noticeable amount. We implemented PAF in Spark and evaluated its effectiveness through both EC2 experiments and large-scale simulations. Evaluation results show that compared with fair allocation, PAF improves the average job performance by 13%, while penalizing resource-givers by no more than 1%.

2017

ICDCS

Speculative Slot Reservation: Enforcing Service Isolation for Dependent Data-Parallel Computations

Chen Chen, Wei Wang, and Bo Li

In Proceedings of the IEEE International Conference on Distributed Computing Systems, 2017

Abs PDF

Priority scheduling is a fundamental tool to provide service isolation for different jobs in shared clusters. Ideally, the performance of a high-priority job should not be dragged down by another with a lower priority. However, we show in this paper that simply assigning a high priority provides no isolation for jobs with dependent computations. A job, even receiving the highest priority, may give up compute slots to another before proceeding to the downstream computation, which is because of barrier, i.e., that the downstream computation cannot start until all the upstream tasks have completed. Such an interruption of execution inevitably results in a signiﬁcant delay.
INFOCOM

Cluster fair queueing: Speeding up data-parallel jobs with delay guarantees

Chen Chen, Wei Wang, Shengkai Zhang, and Bo Li

In Proceedings of the IEEE Conference on Computer Communications, 2017

Best In-Session Presentation Award PDF

The best presentation for the session CLOUD COMPUTING 1

2016

ICC

Software-defined inter-domain routing revisited

Chen Chen, Bo Li, Dong Lin, and Baochun Li

In Proceedings of the IEEE International Conference on Communications, 2016

Abs PDF

With software-deﬁned networking (SDN), the control plane is fully decoupled from the data plane, which has been shown to improve routing performance and reduce route convergence time in the context of intra-domain routing. The applicability of software-deﬁned networking to inter-domain routing, however, has not been fully explored. In this work, we ﬁrst propose a mathematical model that attempts to quantify the BGP convergence time in an inter-domain routing environment, by simplifying the complex BGP convergence process. Based on our model and some practical observations, we ﬁrst investigate how software-deﬁned networking may help speed up interdomain routing, and then present a greedy algorithm that selects Autonomous Systems (ASes) for incremental SDN deployment to minimize the BGP convergence time. Our simulation results based on a real-world Internet topology have demonstrated the effectiveness of our proposed algorithm.