Dynamic Page retirement on DGX Systems

To address memory errors or defects, Tesla V100 GPUs leverage two technologies called Error Correcting Code (ECC) and Page Retirement. As will be explained below, these technologies increase the reliability and durability of the systems.

Error Correcting Code technology is used to detect and fix Single Bit Errors (SBE) which may occur as a result of a variety of reasons which are beyond the scope of this document. These errors can be corrected and ECC will set the memory location to the correct value so the applications can continue to run unaffected. These error occurrences are written to a device in the GPU called the InfoROM that lets the GPU keep track of locations where SBE’s have occured. Single Bit Errors do not impact customer applications as the GPU autocorrects the SBE “on the fly” via ECC and therefore applications continue to run unaffected.

Dynamic Page Retirement is a technology that disables a page of memory in the GPU that includes the memory location that presented errors. The objective  is to deal with Single Bit Errors that occur on the same location more than once, and with Double Bit Errors (DBE) which are detected by ECC but cannot be corrected. Page Retirement reduces the footprint of overall addressable memory by a negligible amount considering GPUs have 16 or 32 GB.

If two Single Bit Errors have occurred in the same memory location, the application will continue to run, but the page where that memory location is will get retired, so it will not be available next time the system comes up. In this case, the error log (/var/log/kern.log) would reflect an XID error 63. In the case of Double Bit Errors the log will reflect XID errors 48 and 63, the application will be killed to prevent data corruption, and the page where that memory location is will be retired in the GPU’s InfoROM for blacklisting.  

After a Page Retirement event is triggered by a Double Bit Error, it is recommended that the system be rebooted so the driver may update the memory mapping to avoid using the recently retired pages of memory. If the pages are retired due to Single Bit Errors that occur in the same memory location, rebooting the system is not immediately necessary because ECC will continue to deal with any SBE’s that show up and the reboot can take place during a maintenance window.

