MFNetSim: A Multi-Fidelity Network Simulation Framework for Multi-Trafic Modeling of Dragonfly Systems

June 23rd, 2025

Categories: Applications, Networking, Supercomputing, Machine Learning, Data Science, High Performance Computing

An illustration of workload replay module.
An illustration of workload replay module.

Authors

Wang, X., Brown, K. A., Ross, R. B., Carothers, C.D., Lan, Z.

About

In high-performance computing (HPC), modern supercomputers typically provide exclusive computing resources to user applications. Nevertheless, the interconnect network is a shared resource for both inter-node communication and across-node I/O access, among co-running workloads, leading to inevitable network interference. In this study, we develop MFNetSim, a multi-fidelity modeling framework that enables simulation of multi-traffic simultaneously over the interconnect network, including inter-process communication and I/O traffic. By combining different levels of abstraction, MFNetSim can efficiently co-model the communication and I/O traffic occurring on HPC systems equipped with flash-based storage. We conduct simulation studies of hybrid workloads composed of traditional HPC applications and emerging ML applications on a 1,056-node Dragonfly system with various configurations. Our analysis provides various observations regarding how network interference affects communication and I/O traffic.

https://doi.org/10.1145/3729424

Resources

PDF

URL

Citation

Wang, X., Brown, K. A., Ross, R. B., Carothers, C.D., Lan, Z., MFNetSim: A Multi-Fidelity Network Simulation Framework for Multi-Trafic Modeling of Dragonfly Systems, Proceedings of the 39th ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, June 23rd, 2025. https://dl.acm.org/doi/10.1145/3729424