Some papers about ML, Distributed, Network, Virtualization, etc.
ML framework
-
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems -- 备注: Tensorflow的开山之作
-
Fast Distributed Deep Learning over RDMA -- 备注: 用RDMA来加速机器学习
-
Horovod: fast and easy distributed deep learning in TensorFlow
RDMA
- Congestion Control for Large-Scale RDMA Deployments -- 备注: 介绍RDMA的拥塞控制机制
GPU
- Supporting High Performance Molecular Dynamics in Virtualized Clusters using IOMMU, SR-IOV, and GPUDirect -- 备注: 介绍GDR与SRIOV的结合
资源管理
-
Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms -- 备注: 用机器学习来优化资源管理
Network
-
The Design and Implementation of Open vSwitch -- 备注: 介绍
Open vSwitch
实现的paper -
Implementing Open vSwitch datapath using TC -- 备注: 用TC实现OVS的数据面