Episode Details

204: WWF - Wayland, Weston, and FreeBSD

Published 8 years, 8 months ago

Description

In this episode, we clear up the myth about scrub of death, look at Wayland and Weston on FreeBSD, Intel QuickAssist is here, and we check out OpenSMTP on OpenBSD.

This episode was brought to you by

Headlines

Matt Ahrens answers questions about the “Scrub of Death”

In working on the breakdown of that ZFS article last week, Matt Ahrens contacted me and provided some answers he has given to questions in the past, allowing me to answer them using HIS exact words.
“ZFS has an operation, called SCRUB, that is used to check all data in the pool and recover any data that is incorrect. However, if a bug which make errors on the pool persist (for example, a system with bad non-ecc RAM) then SCRUB can cause damage to a pool instead of recover it. I heard it called the “SCRUB of death” somewhere. Therefore, as far as I understand, using SCRUB without ECC memory is dangerous.” > I don't believe that is accurate. What is the proposed mechanism by which scrub can corrupt a lot of data, with non-ECC memory? > ZFS repairs bad data by writing known good data to the bad location on disk. The checksum of the data has to verify correctly for it to be considered "good". An undetected memory error could change the in-memory checksum or data, causing ZFS to incorrectly think that the data on disk doesn’t match the checksum. In that case, ZFS would attempt to repair the data by first re-reading the same offset on disk, and then reading from any other available copies of the data (e.g. mirrors, ditto blocks, or RAIDZ reconstruction). If any of these attempts results in data that matches the checksum, then the data will be written on top of the (supposed) bad data. If the data was actually good, then overwriting it with the same good data doesn’t hurt anything. > Let's look at what will happen with 3 types of errors with non-ECC memory: > 1. Rare, random errors (e.g. particle strikes - say, less than one error per GB per second). If ZFS finds data that matches the checksum, then we know that we have the correct data (at least at that point in time, with probability 1-1/2^256). If there are a lot of memory errors happening at a high rate, or if the in-memory checksum was corrupt, then ZFS won’t be able to find a good copy of the data , so it won’t do a repair write. It’s possible that the correctly-checksummed data is later corrupted in memory, before the repair write. However, the window of vulnerability is very very small - on the order of milliseconds between when the checksum is verified, and when the write to disk completes. It is implausible that this tiny window of memory vulnerability would be hit repeatedly. > 2. Memory that pretty much never does the right thing. (e.g. huge rate of particle strikes, all memory always reads 0, etc). In this case, critical parts of kernel memory (e.g. instructions) will be immediately corrupted, causing the system to panic and not be able to boot again. > 3. One or a few memory locations have "stuck bits", which always read 0 (or always read 1). This is the scenario discussed in the message which (I believe) originally started the "Scrub of Death" myth: https://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-ram-and-zfs.15449/ This assumes that we read in some data from disk to a memory location with a stuck bit, "correct" that same bad memory location by overwriting the mem

Episode Details

204: WWF - Wayland, Weston, and FreeBSD

Description

This episode was brought to you by

Headlines

Matt Ahrens answers questions about the “Scrub of Death”

Listen Now

Love PodBriefly?