← Go back to ISG D-ITET Service Status

NetScratch filesystem corruption

Wednesday, January 24, 2024 at 06:30

Storage

Resolved after 457h 30m of downtime. Monday, February 12, 2024 at 08:00

Solved (2024-02-12, 08:00) - The initialization of the underlying RAID volume has been completed.

Update (2024-01-27, 19:00) - Ordering for accounts of type staff, guest, stud and ueb are re-enabled via the self-service portal.

Update (2024-01-26, 14:30) - While the system is accessible again, until the RAID volume will be fully initialized there might be further performance penalty. While using the service keep in mind the terms of use for the service.

Update (2024-01-26, 05:40) - The structure to access the D-ITET NetScratch has been created to allow access to users. Links in the itet-stor are available again.

Update (2024-01-25, 13:35) - System is almost back in an usable state for the D-ITET NetScratch. The underlying RAID volume needs to be initialized which will take several days. To give this initialization enough priority to avance we will only release access to the NetScratch on 2024-01-26.

Update (2024-01-25, 11:25) - Volumesets of the RAID are initializing, which is a slow progress. Folder in data-scratch-02 are created.

Update (2024-01-25, 10:45) - Configuration verified and RAID volumes created. Filesystems are beeing recreated.

Update (2024-01-25, 09:45) - Defective RAID Controller has been replaced. We are verifying the setup and will then proceed with recreating the service.

Update (2024-01-24, 14:45) - Data on data-scratch-01 (D-ITET NetScratch) and data-scratch-02 are not recoverable. The filesystems will be re-created after hardware replacement.

Update (2024-01-24, 13:15) - Given the suspect of the malfunctioning Areca RAID Controller it will be replaced. The replacement should be available tomorrow, 2024-01-25. After the configuration the filesystem will be rebootstraped and made available again to the users.

Update (2024-01-24, 09:40) - The HW Raid-Controller hickup combined with the high I/O load from the compute clusters has corrupted the filesystem journaling metadata. At this point in time we have to assume that no data will be possible to be recovered and the system will be built up again after further investigation.

Problem (2024-01-24, 07:15) - The NetScratch server is taken offline due to a filesystem corruption on the scratch data. The system is currently not available for further access. To date data lost cannot be ruled out.

Last updated: Monday, February 12, 2024 at 08:59