APEI error types

Resources

Introduction

This page outlines an APEI error types and the way how Linux kernel deals with APEI error kind. Page focuses on Aarch64 platform but it also take into the consideration x86 specific features. While still working on APEI subsystem for ARM64, the following chapter assume to use x86 equivalent for those features which are not available for Aarch64 e.g. NMI notification type.


Error types

Generic Hardware Error Source (GHES) is the non-x86 specific error type that platform may reports to the kernel. Architecture agnostic nature makes GHES the most heavily used error types for Aarch64.

GHES section types supported in current Linux kernel:

  • Memory error
  • PCI Express error

Memory error

First step of platform which supports APEI is to discover as much HEST entries type as possible and register those to corresponding drivers. GHES error sources are registered to GHES driver. For that case kernel needs to allocate resources associated with the notification type and register error source to the EDAC framework (if supported). From now on, corresponding error occurrence can be diagnosed within platform firmware and then analysed at kernel side. Error inspection and OS recover process depends on error severity. There are three types:

  • Corrected - errors that have been noted, and need to be captured but the system can continue to operate since the underlying cause has been repaired,
  • Uncorrected - errors that the hardware could not recover from. They can be broken down further:
    • Non-fatal - where the system can continue to operate, perhaps in a degraded state,
    • Fatal - where the safest thing to do is halt and/or restart the system.

Fatal memory errors conveyed to kernel via NMI are trying to print info for the user record and just do the panic (control flow ends at "Memory error" stage, see picture below). Non-fatal and corrected memory errors fall into the deep inspection.Those are passed to APEI corresponding EDAC driver which, in turn, records trace point for RAS daemon and takes on immediate rescue actions. Main role of GHES driver is to recognise error severity and forward to appropriate entity. It rather provide useful information for drivers like EDAC, x86 specific or RAS daemon than repair error by itself. GHES memory error handling is cooperating with EDAC driver very tightly. RAS daemon can take advantage of EDAC features during further memory error investigation as well.

Memory error

Memory error - kernel requirements

Considering proper memory error handling via ACPI path, kernel needs APEI subsystem and EDAC driver provided with DMI information. Kernel config: ACPI_APEI, ACPI_APEI_GHES, EDAC_MM_EDAC, EDAC_GHES, DMI. This is basic setup and obviously it might be more advanced by adding another drivers like x86 did.

Note Currently there is no NMI notification type equivalent for ARM. NMI is essential for low latency error propagation and is under investigation.

LEG/ServerArchitecture/RAS/APEIErrorTypes (last modified 2017-08-17 12:12:49)