Publications
2025
- ArxivEfficient Unified Caching for Accelerating Heterogeneous AI WorkloadsarXiv preprint arXiv:2506.12370, 2025
- ArxivEfficient Serving of LLM Applications with Probabilistic Demand ModelingarXiv preprint arXiv:2506.14851, 2025
- TACOEDAS: Enabling Fast Data Loading for GPU Serverless ComputingACM Transactions on Architecture and Code Optimization, 2025
- ICDCSLLMSched: Uncertainty-Aware Workload Scheduling for Compound LLM ApplicationsIn IEEE International Conference on Distributed Computing Systems, 2025
- ICDCSFedSU: Communication-efficient Federated Learning with Speculative UpdatingIn IEEE International Conference on Distributed Computing Systems, 2025
- ISCALumina: Real-Time Neural Rendering by Exploiting Computational RedundancyIn International Symposium on Computer Architecture, 2025
- TACOTaming Flexible Job Packing in Deep Learning Training ClustersACM Transactions on Architecture and Code Optimization, 2025
- IPDPSReducing the End-to-End Latency of DNN-based Recommendation Systems Deployed in GPU PoolsIn 39th IEEE International Parallel and Distributed Processing Symposium, 2025
2024
- ENLSPRetrievalAttention: Accelerating Long-Context LLM Inference via Vector RetrievalIn the 4th NeurIPS Workshop on Efficient Natural Language and Speech Processing (Spotlight), 2024
- ICPPFedCA: Efficient Federated Learning with Client AutonomyIn Proceedings of the 53rd ACM International Conference on Parallel Processing, 2024
- IWQoSPAS: Towards Accurate and Efficient Federated Learning with Parameter-Adaptive SynchronizationIn Proceedings of the IEEE/ACM International Symposium on Quality of Service, 2024
- IWQoS PosterTowards Efficient Compound Large Language Model System Serving in the WildIn Proceedings of the IEEE/ACM International Symposium on Quality of Service, 2024
- INFOCOMDPBalance: Efficient and Fair Privacy Budget Scheduling for Federated Learning as a ServiceIn Proceedings of the IEEE Conference on Computer Communications, 2024
- HPCAAn Optimizing Framework on MLIR for Efficient FPGA-based Accelerator GenerationIn Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2024
2023
- ICCDSTAG: Enabling Low Latency and Low Staleness of GNN-based Services with Dynamic GraphsIn Proceedings of the IEEE International Conference on Computer Design, 2023
- ICPPAsfl: Adaptive Semi-asynchronous Federated Learning for Balancing Model Accuracy and Total Latency in Mobile Edge NetworksIn Proceedings of the ACM International Conference on Parallel Processing, 2023
2022
2021
- IJCNNTwo-dimensional learning rate decay: Towards accurate federated learning with non-iid dataIn Proceedings of the International Joint Conference on Neural Networks, 2021
2020
- SCMetis: Learning to schedule long-running applications in shared container clusters at scaleIn Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2020
2019
- INFOCOMRound-robin synchronization: Mitigating communication bottlenecks in parameter serversIn Proceedings of the IEEE Conference on Computer Communications, 2019
2018
- SoCC PosterFast distributed deep learning via worker-adaptive batch sizingIn Proceedings of the ACM symposium on cloud computing, 2018
2017
- INFOCOMCluster fair queueing: Speeding up data-parallel jobs with delay guaranteesIn Proceedings of the IEEE Conference on Computer Communications, 2017