June 23rd, 2025
Categories: Applications, Networking, Supercomputing, Machine Learning, Data Science, High Performance Computing
In high-performance computing (HPC), modern supercomputers typically provide exclusive computing resources to user applications. Nevertheless, the interconnect network is a shared resource for both inter-node communication and across-node I/O access, among co-running workloads, leading to inevitable network interference. In this study, we develop MFNetSim, a multi-fidelity modeling framework that enables simulation of multi-traffic simultaneously over the interconnect network, including inter-process communication and I/O traffic. By combining different levels of abstraction, MFNetSim can efficiently co-model the communication and I/O traffic occurring on HPC systems equipped with flash-based storage. We conduct simulation studies of hybrid workloads composed of traditional HPC applications and emerging ML applications on a 1,056-node Dragonfly system with various configurations. Our analysis provides various observations regarding how network interference affects communication and I/O traffic.
https://doi.org/10.1145/3729424
Wang, X., Brown, K. A., Ross, R. B., Carothers, C.D., Lan, Z., MFNetSim: A Multi-Fidelity Network Simulation Framework for Multi-Trafic Modeling of Dragonfly Systems, Proceedings of the 39th ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, June 23rd, 2025. https://dl.acm.org/doi/10.1145/3729424