May 8th, 2025
Categories: Applications, Supercomputing, Deep Learning, Data Science, High Performance Computing
High-performance computing (HPC) job scheduling has seen promising advances with Deep Reinforcement Learning (DRL). However, challenges such as low interpretability, instability, and high computational cost hinder DRL’s practical adoption. We explore a model-based alternative using Monte Carlo Tree Search (MCTS) to overcome these limitations. By leveraging existing HPC simulators as models for MCTS and focusing on transparent decision-making, we aim to develop a scalable and interpretable scheduling solution fit for real-world deployment.
Kurkure, Y,, Zhang, Y., Papka, M. E., Lan, Z., Overcoming RL Limitations in HPC Scheduling: A Model-Based MCTS Approach for Practical Deployment (poster), The 12th Greater Chicago Area Systems Research Workshop (GCASR), Chicago, IL, May 8th, 2025. https://gcasr.org/2025/posters