Dataset

The dataset has four files:

train.zip / test.zip includes the log files named by serial_number of DIMM, collected via mcelog in 23 columns. The columns are defined as follows:

Field Type Description
cpuid integer The CPU ID, note that a server attaches multiple CPUs
channelid integer The Channel ID, note that a CPU has multiple channels
dimmid integer The DIMM ID, note that a channel attaches multiple DIMMs
rankid integer The rank ID, range from 0 to 1, each DIMM has 1 or 2 ranks
deviceid integer The device ID, range from 0 to 17, each DIMM has multiple devices
bankgroupid integer The bank group ID of DRAM
bankid integer The bank ID of DRAM
rowid integer The row ID of DRAM
columnid integer The column ID of DRAM
retryrderrlogparity integer The parity info on retried reads in decimal format
retryrderrlog integer Logs info on retried reads in decimal format, validating error types and parity.
Hint: If the result of logical AND operation between retryrderrlog and 0x0001 is 1 (e.g., retryrderrlog & 0x0001 = 1), it indicates RETRY_RD_ERR_LOG_PARITY is valid.
burst_info integer The decoded parity error bits in memory DQ and Beat
error_type integer The error type including read and scrubbing error
log_time string The time when the error is detected in timestamp
manufacturter category The server manufacturer in anonymized format
model category The CPU model in anonymized format
PN category The part number of DIMMs in anonymized format
Capacity integer The capacity of DIMM
FrequencyMHz integer Base frequency of the CPU resource, in MHz
MaxSpeedMHz integer Maximum frequency of the CPU resource
McaBank category Machine Check Architecture bank code of the CPU
memory_type string The type of DIMM, e.g., DDR4
region category The region of server location in anonymized format

failure_ticket.csv includes the failures of each DIMM in this dataset in 3 columns. The columns are defined as follows:

Field Type Description
serial_number string The DIMM ID
failure_time string The time when the DIMM failures happened in timestamp
serial_number_type string The server type of failure, including A and B

Note that we have annoymized the DIMM manufacturer, part number, etc to avoid sensitive information being inferred.

submission.csv includes the serial_number (sn_name), serial_number_type and prediction_timestamp, where prediction_timestamp represents the time when DIMM failures are predicted. The submission format allows multiple timestamps predictions as below:

sn_name prediction_timestamp serial_number_type
sn_1xx 1708987825 A
sn_1xx 1708987853 A
sn_2xx 1716206512 B

Files

  • train.zip - the training set
  • test.zip - the test set
  • failure_ticket.csv - the failure labels
  • sample.zip - the sample data
  • sample_submission.csv - a sample submission file in the correct format