The ever-growing scale of high-performance computing systems, particularly with the transition to exascale computing, has underscored the critical need for robust fault tolerance. As these systems ...