HPCA Tutorial

Hardware Failure Prediction for Cloud Service Reliability

Tell Me More

Tutorial Overview

In ever-growing datacenters, millions of servers operate together to meet the rising demand for cloud computing, big data storage, and AI computing tasks. With numerous tasks including resource-intensive AI computations running on these servers, hardware failures significantly impact the servers’ Reliability, Availability and Serviceability (RAS). In this tutorial, we explore cutting-edge methodologies for predicting hardware failures in cloud environments to ensure uninterrupted service continuity and optimal performance.

Target

The tutorial targets for researchers and engineers in the field of system reliability and machine learning. The participants are expected to have a basic understanding of computer architecture, hardware errors and machine learning. If not, we also provide intuitive descriptions for interested participants during the tutorial. Anyone excited about ML for system is welcome. No special equipment is needed, it is strongly recommended to bring a laptop for participating in the Hands-on competition and taking notes.

1. Introduction (Min Zhou, 13:10 - 13:30)

– Overview of hardware failures in datacenters – Motivation

2. Memory Failure Prediction (Qiao Yu, 13:30 – 14:00)

– Background of memory failure – Hierarchical memory failure prediction – Conclusion and future work

3. HBM Failure Prediction and Reliable Storage System (Zhirong Shen, 14:00 – 14:30)

– Introduce analysis of HBM errors in the field – Introduce HBM failure prediction framework – Introduce some techniques for reliable storage

4. Competition Introduction (Min Zhou, 14:30 – 15:00)

– Overview of the competition – Open discussion and wrap-up

5. Coffee Break (15:00 - 15:30)

6. Hands-on Online Competition (15:30 - 17:00)

Organizing Committee

  • Qiao Yu (Technical University of Berlin, Germany)
  • Min Zhou (Huawei Technologies Co., Ltd, China)
  • Zhirong Shen (Xiamen University, Xiamen, China)

Competition Committee

  • Min Zhou (Huawei Technologies Co., Ltd, China)
  • Qiao Yu (Technical University of Berlin, Germany)
  • Hongyi Xie (Huawei Technologies Co., Ltd, China)
  • Jialiang Yu (Huawei Technologies Co., Ltd, China)
  • Wengui Zhang (Huawei Technologies Co., Ltd, China)
  • Stefano Mauceri (Huawei Technologies Co., Ltd, Ireland)
  • Ping Liu (Huawei Technologies Co., Ltd, China)
  • Zhenli Sheng (Huawei Technologies Co., Ltd, China)

Our competition will furnish participants with a dataset comprising memory system configurations, memory error logs, and failure tags. This dataset will enable participants to devise solutions for predicting potential failures of individual DRAM modules within a subsequent observation period. The competition comprises two stages. The initial stage features an AB List setup, which includes training data tailored for two diverse memory models. Subsequently, in the second stage, a fresh dataset encompassing mixed models (more than two) will be introduced. This encourages solutions with few-shot learning capabilities and knowledge transfer ability. Overall, the competition’s appeal lies in its practical relevance, the accessible entry point of the initial stage, and the fresh challenges presented in both stages. More details of the competition can be found at Memory Failure Prediction Competition.