AllReduce is a critical collective in both HPC and large-scale AI workloads. However, scaling it to Exascale systems presents key challenges due to inter-node communication bottlenecks and underutilization of intra-node resources like shared memory and NVLink. This work analyzes state-of-the-art AllReduce algorithms to identify inefficiencies and opportunities for hybrid strategies that explicitly separate intra- and inter-node communication.<br><br>
We introduce a preliminary algorithmic design that leverages tunable intra-node communication patterns and discuss key performance criteria, including message count and data volume. Our early results provide insight into communication trade-offs and guide the development of adaptive AllReduce implementations optimized for Exascale systems.

EMMA: Efficient Multi-node Memory-aware AllReduce Algorithms (poster)

Authors

About

Resources

PDF

URL

Citation

Bibliography

Related Topics

people (2)

Contact