In ever-growing datacenters, millions of servers operate together to meet the rising demand for cloud computing, big data storage, and AI computing tasks. With numerous tasks including resource-intensive AI computations running on these servers, hardware failures significantly impact the servers’ Reliability, Availability and Serviceability (RAS). In this tutorial, we explore cutting-edge methodologies for predicting hardware failures in cloud environments to ensure uninterrupted service continuity and optimal performance.
The tutorial targets for researchers and engineers in the field of system reliability and machine learning. The participants are expected to have a basic understanding of computer architecture, hardware errors and machine learning. If not, we also provide intuitive descriptions for interested participants during the tutorial. Anyone excited about ML for system is welcome. No special equipment is needed, it is strongly recommended to bring a laptop for participating in the Hands-on competition and taking notes.
– Overview of hardware failures in datacenters – Motivation
– Background of memory failure – Hierarchical memory failure prediction – Conclusion and future work
– Introduce analysis of HBM errors in the field – Introduce HBM failure prediction framework – Introduce some techniques for reliable storage
– Overview of the competition – Open discussion and wrap-up
Our competition will furnish participants with a dataset comprising memory system configurations, memory error logs, and failure tags. This dataset will enable participants to devise solutions for predicting potential failures of individual DRAM modules within a subsequent observation period. The competition comprises two stages. The initial stage features an AB List setup, which includes training data tailored for two diverse memory models. Subsequently, in the second stage, a fresh dataset encompassing mixed models (more than two) will be introduced. This encourages solutions with few-shot learning capabilities and knowledge transfer ability. Overall, the competition’s appeal lies in its practical relevance, the accessible entry point of the initial stage, and the fresh challenges presented in both stages. More details of the competition can be found at Memory Failure Prediction Competition.