m3: Accurate Flow-Level Performance Estimation using Machine Learning

Abstrct

To estimate data center network tail latency performance for a given workload, topology, and network configuration. First, they show path-level simulation produce almost the same aggregate statistics as full network-wide packet-level simulaitons. They use flow-level fluid simulation to capture essential elements of the path workload, including the effect of cross-traffic on flows on that path. Next, they use this coarse simulation as input to a ML model to predict path-level behavior, and run it on a sample of paths to produce accurate network-wide estimates. Their model generalizes over congestion control (CC) protocols, CC protocol parameters, and routing.

Introduction

Prominent simulators are packet-level discrete-event simulators
- ns-3, OPNET, OMNET++, htsim
- take every event at each network component, serialize them in a single event queue, and process them one by one
- inherently slow and cannot keep up with the size and speed of modern networks
Recent work accelerate and improve the scalability of packet-level simulators
- ML techniques
  - MimicNet, DeepQueueNet
- Parallelization strategies
  - Parsimon, DONS
- still operate at the packet level, so still slow with increased network size

Goal: design a performance model that overcomes the limitations of packet-level simulation without sacrificing fidelity.

Assumptions: Most network simulations are not used to inspect the behavior of individual packets or even individual flows. In many cases, a network designer is interested in certain performance metrics (e.g., network throughput, tail latency, flow completion time) and how they are affected by changes in network conditions (e.g., workload characteristics) and various design choices (e.g., congestion control parameters, routing policies, job placement). Hence, rather than simulate every packet interaction, they want to learn a model that predicts these performance metrics using a higher level abstraction.

Contributions: m3, a system that uses machine learning to predict the flow-level performance of a data center network. It is trained from ground-truth data generated from packet-level simulations, ns-3. Given a network topology, a workload (a sequence of flows and their network paths), and optionally a set of design parameters (e.g., CC knobs), m3 predicts the FCT distribution for a class of traffic (e.g., flows in a certain size range, flows sent from certain endpoints, flows tracersing certain paths, etc.).

Q: <10% error might still be too high!

Q: No packet loss and other metrics? Only FCT.

Q: Too many assumptions -> Limited use cases?

Q: No application-level fidelity. Still need input dataset for training diff workloads?

Q: Can not deal with dynamic routing, multi-path traffic, etc. Hard to promote to general cases.

Insights

No existing work considers the followings simultaneously:

various topology (MimicNet), dynamic routing (m3), and realistic traffic (DeepQueueNet)
deploy on demand without lots of modifications and design (DONS)
do not rely on any assumption (Parsimon, m3)
no need to retrain everything with tiny changes in network conditions
fast in large scale with high fidelity, accuracy, and capable of presenting users with metrics across multiple dimensions/layers

PreviousSIGCOMM'24 NextNSDI'23

Last updated 1 year ago

hashtagAbstrct

hashtagIntroduction

hashtagInsights

Abstrct

Introduction

Insights