A journal for MD/RAID5

November 24, 2015

This article was contributed by Neil Brown

RAID5 support in the MD driver has been part of mainline Linux since 2.4.0 was released in early 2001. During this time it has been used widely by hobbyists and small installations, but there has been little evidence of any impact on the larger or "enterprise" sites. Anecdotal evidence suggests that such sites are usually happier with so-called "hardware RAID" configurations where a purpose-built computer, whether attached by PCI or fibre channel or similar, is dedicated to managing the array. This situation could begin to change with the 4.4 kernel, which brings some enhancements to the MD driver that should make it more competitive with hardware-RAID controllers.

While hardware-RAID solutions suffer from the lack of transparency and flexibility that so often come with closed devices, they have two particular advantages. First, a separate computer brings dedicated processing power and I/O-bus capacity which takes some load off the main system, freeing it for other work. At the very least, the system CPU will never have to perform the XOR calculations required to generate the parity block, and the system I/O bus will never have to carry that block from memory to a storage device. As commodity hardware has increased in capability and speed over the years, though, this advantage has been significantly eroded.

The second advantage is non-volatile memory (NVRAM). While traditional commodity hardware has not offered much NVRAM because it would hardly ever be used, dedicated RAID controllers nearly always have NVRAM as it brings real benefits in both performance and reliability. Utilizing NVRAM provides more than just the incremental benefits brought by extra processing components. It allows changes in data management that can yield better performance from existing devices.

With recent developments, non-volatile memory is becoming a reality on commodity hardware, at least on server-class machines, and it is becoming increasing easy to attach a small solid-state storage device (SSD) to any system that manages a RAID array. So the time is ripe for MD/RAID5 to benefit from the ability to manage data in the ways that NVRAM allows. Some engineers from Facebook, particularly Shaohua Li and Song Liu, have been working toward this end; Linux 4.4 will be the first mainline release to see the fruits of that labor.

Linux 4.4 — closing the RAID5 write hole

RAID5 (and related levels such as RAID4 and RAID6) suffer from a potential problem known as the "write hole". Each "stripe" on such an array — meaning a set of related blocks, one stored on each active device — will contain data blocks and parity blocks; these must always be kept consistent. The parity must always be exactly what would be computed from the data. If this is not the case then reconstructing the data that was on a device that has failed will produce incorrect results.

In reality, stripes are often inconsistent, though only for very short intervals of time. As the drives in an array are independent (that is the "I" of RAID) they cannot all be updated atomically. When any change is made to a stripe, this independence will almost certainly result in a moment when data and parity are inconsistent. Naturally the MD driver understands this and would never try to access data during that moment of inconsistency ... unless....

Problems occur if a machine crash or power failure causes an unplanned shutdown. It is fairly easy to argue that the likelihood that an unclean shutdown would interrupt some writes but not others is extremely small. It's not easy to argue that such a circumstance could never happen, though. So when restarting from an unclean shutdown, the MD driver must assume that the failure may have happened during a moment of inconsistency and, thus, the parity blocks cannot be trusted. If the array is still optimal (no failed devices) it will recalculate the parity on any stripe that could have been in the middle of an update. If, however, the array is degraded, the parity cannot be recalculated. If some blocks in a stripe were updated and others weren't, then the block that was on the failed device will be reconstructed based on inconsistent information, leading to data corruption. To handle this case, MD will refuse to assemble the array without the "--force" flag, which effectively acknowledges that data might be corrupted.

An obvious way to address this issue is to use the same approach that has worked so well with filesystems: write all updates to a journal before writing them to the main array. When the array is restarted, any data and parity blocks still in the journal are simply written to the array again. This ensures the array will be consistent whether it is degraded or not. This could be done with a journal on a rotating-media drive but the performance would be very poor indeed. The advent of large NVRAM and SSDs makes this a much more credible proposition.

The new journal feature

The functionality developed at Facebook does exactly this. It allows a journal device (sometimes referred to as a "cache" or "log" device) to be configured with an MD/RAID5 (or RAID4 or RAID6) array. This can be any block device and could even be a mirrored pair of SSDs (because you wouldn't want the journal device to become a single point of failure).

To try this out you would need Linux 4.4-rc1 or later, and the current mdadm from git://neil.brown.name/mdadm. Then you can create a new array with a journal using a command like

    mdadm --create /dev/md/test --level=5 --raid-disks=4 --write-journal=/dev/loop9 \
          /dev/loop[0-3]

It is not currently possible to add a journal to an existing array, but that functionality is easy enough to add later.

With the journal in place, RAID5 handling will progress much as it normally does, gathering write requests into stripes and calculating the parity blocks. Then, instead of being written to the array, the stripe is intercepted by the journaling subsystem and queued for the journal instead. When write traffic is sufficiently heavy, multiple stripes will be grouped together into a single transaction and written to the journal with a single metadata block listing the addresses of the data and parity. Once this transaction has been written and, if necessary, flushed to stable storage, the core RAID5 engine is told to process the stripe again, and this time the write-out is not intercepted.

When the write to the main array completes, the journaling subsystem will be told; it will occasionally update its record of where the journal starts so that data that is safe on the array effectively disappears from the journal. When the array is shut down cleanly, this start-of-journal pointer is set to an empty transaction with nothing following. When the array is started, the journal is inspected and if any transactions are found (with both data and parity) they are written to the array.

The journal metadata block uses 16 bytes per data block and so can describe well over 200 blocks. Along with each block's location and size (currently always 4KB), the journal metadata records a checksum for each data block. This, together with a checksum on the metadata block itself, allows very reliable determination of which blocks were successfully written to the journal and so should be copied to the array on restart.

In general, the journal consists of an arbitrarily large sequence of metadata blocks and associated data and parity blocks. Each metadata block records how much space in the journal is used by the data and parity and so indicates where the next metadata block will be, if it has been written. The address of the first metadata block to be considered on restart is stored in the standard MD/RAID superblock.

The net result of this is that, while writes to the array might be slightly slower (depending on how fast the journal device is), a system crash never results in a full resync — only a short journal recovery — and there is no chance of data corruption due to the write hole.

Given that the write-intent bitmap already allows resynchronization after crash to be fairly quick, and that write-hole corruption is, in practice, very rare; you may wonder if this is all worth the cost. Undoubtedly different people will assess this tradeoff differently; now at least the option is available once that assessment is made. But this is not the full story. The journal can provide benefits beyond closing the write-hole. That was a natural place to start as it is conceptually relatively simple and provides a context for creating the infrastructure for managing a journal. The more interesting step comes next.

The future: writeback caching and more full-stripe writes

While RAID5 or RAID6 provide a reasonably economical way to combine multiple devices to provide large storage capacity with reduced chance of data loss, they do come at a cost. When the host system writes a full stripe worth of data to the array, the parity can be calculated from that data and all writes can be scheduled almost immediately, leading to very good throughput. When writing to less than a full stripe, though, throughput drops dramatically.

In that case, some data or parity blocks need to be read from the array before the new parity can be calculated. This read-before-write introduces significant latency to each request, so throughput suffers. The MD driver tries to delay partial-stripe writes a little bit in the hope that the rest of the stripe might be written soon. When this works, it helps a lot. When it doesn't, it just increases latency further.

It is possible for a filesystem to help to some extent, and to align data with stripes to increase the chance of a full-stripe write, but that is far from a complete solution. A journal can make a real difference here by being managed as a writeback cache. Data can be written to the journal and the application can be told that the data is safe before the RAID5 engine even starts considering whether some pre-reading might be needed to be able to update parity blocks.

This allows the application to see very short latencies no matter what data-block pattern is being written. It also allows the RAID5 core to delay writes even longer, hoping to gather full stripes, without inconveniencing the application. This is something that dedicated RAID controllers have (presumably) been doing for years, and hopefully something that MD will provide in the not-too-distant future.

There are plenty of interesting questions here, such as whether to keep all in-flight data in main memory, or to discard it after writing to the journal and to read it back when it is time to write to the RAID. There is also the question of when to give up waiting for a full stripe and to perform the necessary pre-reading. Together with all this, a great deal of care will be needed to ensure we actually get the performance improvements that theory suggests are possible.

This is just engineering though. There is interest in this from both potential users of the technology and vendors of the NVRAM and there is little doubt that we will see the journal enhanced to provide very visible performance improvements to complement the nearly invisible reliability improvements already achieved.

Index entries for this article
Kernel	Block layer/RAID
Kernel	RAID
GuestArticles	Brown, Neil

A journal for MD/RAID5

Posted Nov 24, 2015 22:50 UTC (Tue) by nix (subscriber, #2304) [Link] (25 responses)

Woo-hoo! I've been hoping for something like this for a *long* time. Anything to get us away from the overcomplicated and buggy firmware and horrible binary-only software with terrible user interfaces of hardware RAID controllers. I just never imagined that we'd get it.

I'll admit I'm surprised that SSD write speeds and lifetimes are good enough for something like this: I was betting they would never get as fast as rotating rust at writes, but clearly they've got there. (Hardware RAID controllers in my experience generally use battery-backed DRAM, not NVRAM -- so rather than having a problem with the NVRAM degrading through write load, we have a problem with the battery dying!) I do wonder about sustained writes, though -- a lot of SSDs can only keep up with high write rates in brief bursts, artificially throttling afterwards. If I'm doing, say, vapoursynth work, I can easily shuffle a couple of terabytes of writes of huffyuv-compressed video around the array in the intermediate pipeline stages: anything that slows that down can easily add hours to the processing pipeline, which is slow enough as it is. (Obviously this data is inherently transient, so preserving any of it against power losses is a total waste of time -- so probably you'd relegate it to a non-journalled array in any case. But one can envisage workloads with very high write loads that are not inherently transient -- you'd want write hole protection for those even more, since if the array is writing for hours non-stop the chance of a write hole is climbing to non-insignificant levels!)

If you only have one SSD, can you split it between this and bcache somehow? Maybe dm-linear, since cutting block devices in two is more or less what it's meant for. They both seem worthwhile things to have in a RAID-enabled box.

Probably, though, bcache and good backups would be the first priority, unless the system needed very high uptime: as you say, the write hole is a rare occurrence under normal conditions. Personally I always assemble with --force in my initramfs, specifically to ensure that I come up after a crash even if unattended, then rely on fsck to clean up the worst of any write-hole damage and should it be needed then do a giant find | cmp (roughly) against the most recent (FUSE-mounted bup) backups to identify any corruption. (I've never had scattershot corruption due to the write hole, but when I've had it in other circumstances this has found all corruption in a few hours, bounded purely by disk I/O time. You *do* need very frequent backups to get away with this, though, which people only generally do after they get burned. I started doing that sort of thing after the unfortunate ext4 journal metadata corruption incident of a few years back. Remember, this one: <https://lwn.net/Articles/521803/>.)

A journal for MD/RAID5

Posted Nov 24, 2015 23:18 UTC (Tue) by neilbrown (subscriber, #359) [Link] (19 responses)

> generally use battery-backed DRAM, not NVRAM

I always understood the term "NVRAM" to include battery backed RAM (both D and S), flash, memristors (are they are thing yet?) bubble memory and anything else which is memory, and isn't volatile.

I agree that most of the devices that this journalling gets used on will probably be of the "flash" family, but even there I understand that technology is moving quickly. It would certainly be valuable to hear reports from people who try this out on different devices.

> If you only have one SSD, can you split it between this and bcache somehow?

I would recommend partitions with cfdisk. Certainly lvm could do it too - and with greater flexibility. You just need a block device.

A journal for MD/RAID5

Posted Nov 24, 2015 23:44 UTC (Tue) by nix (subscriber, #2304) [Link] (18 responses)

You're right about NVRAM, of course -- I mistyped (misthought?) I meant to say 'SSD'. I've never really thought of battery-backed RAM as nonvolatile, though -- it's as volatile as normal RAM, it just powers off more slowly. (Anything else would force us to say that a laptop, or for that matter any machine on a UPS, contains nothing but NVRAM, which feels a bit ridiculous.)

I never thought of partitioning an SSD, but I suppose if it looks like a disk, you can partition it! The question is whether the kernel can identify partitions on all sorts of block devices, or whether this is restricted to only a subset of them. (I have no idea, I've never checked the code and have no real idea how that machinery works. It clearly doesn't run on *all* block devices or you wouldn't have had to do anything special to make partitioned md work... but md has always been rather special with its semi-dynamic major numbers etc, so maybe this was something related to that specialness.)

A journal for MD/RAID5

Posted Nov 25, 2015 1:32 UTC (Wed) by neilbrown (subscriber, #359) [Link] (4 responses)

> The question is whether the kernel can identify partitions on all sorts of block devices

The block device driver needs to opt-in to the kernel's generic partition support.
The driver can specify a number of minor numbers to use for partition block devices (arg to "alloc_disk()"), and can set a flag (GENHD_FL_EXT_DEVT) to dynamically support extra partitions.

Two drivers I know of which don't support either static or dynamic allocations of partition devices are "loop" and "dm".
In both cases a similar effect can be achieved using the "kpartx" tool which reads the partition table and creates dm-linear mappings.

Many SSDs register under the "sd" driver and so get full partition support.

A journal for MD/RAID5

Posted Nov 25, 2015 4:08 UTC (Wed) by ABCD (subscriber, #53650) [Link] (3 responses)

That is no longer true (as of 3.1-rc2) for loop. The ioctl that sets up a loop device takes a flag (LO_FLAGS_PARTSCAN) that tells the driver to set up partitions with names like loopXpY, where loopX is the base loop device and Y is the partition number on that device. I believe there were ways to do that prior that involved setting the max_part module parameter and ways to do it dynamically after the loop device is created (so that fdisk and friends can Do The Right Thing).

A journal for MD/RAID5

Posted Nov 25, 2015 4:15 UTC (Wed) by ABCD (subscriber, #53650) [Link]

Correction: I meant 3.2-rc1, not 3.1-rc2.

A journal for MD/RAID5

Posted Nov 25, 2015 22:51 UTC (Wed) by Sesse (subscriber, #53779) [Link] (1 responses)

So can you get mount to supply this flag?

A journal for MD/RAID5

Posted Nov 26, 2015 2:03 UTC (Thu) by ABCD (subscriber, #53650) [Link]

You can get losetup to pass the flag by using the -P option.

A journal for MD/RAID5

Posted Dec 7, 2015 19:24 UTC (Mon) by nix (subscriber, #2304) [Link] (12 responses)

An update: it turns out that there *are* still devices being sold which look like disks and are actually RAM. They vary from insanely pricey things which give you 8GiB of SAS-attached storage for the low, low price of $3000 (!!!) to this: <http://www.hyperossystems.co.uk/07042003/hardware.htm> which looks just about perfect: the price is sane enough, at least.

Actually you can stick enough RAM in it that it's questionable if you need an SSD at all, even for bcache: just partition this thing and use some of it for the RAID write hole avoidance and some of it for bcache. It can even dump its contents onto CF and restore back from it if the battery runs out.

I think my next machine will have one of these.

A journal for MD/RAID5

Posted Dec 10, 2015 15:20 UTC (Thu) by itvirta (guest, #49997) [Link] (11 responses)

I thought SSD:s can already be as fast as the SATA interface, so I wonder what the advantage of SATA-attached RAM is,
especially given that an SSD can easily be 4x the size of that 64 GB RAM thingy. Putting all that RAM on the motherboard
might be different, though.

A journal for MD/RAID5

Posted Dec 10, 2015 15:44 UTC (Thu) by andresfreund (subscriber, #69562) [Link] (10 responses)

Entirely depends on the type of load. Few SSDs e.g. are fast enoough to saturate SATA for small random writes.

A journal for MD/RAID5

Posted Dec 11, 2015 22:24 UTC (Fri) by nix (subscriber, #2304) [Link] (9 responses)

Quite. Further, RAM never wears out, no matter how hard it's written and for how long. As someone who fairly routinely writes hundreds of gigs to RAID at a time, I'd like to be able to close the write hole without rapidly destroying an SSD in the process! bcache detects and avoids caching I/O from processes doing a lot of sequential I/O, but that's not going to work in this situation: you have to journal the lot.

A journal for MD/RAID5

Posted Dec 11, 2015 22:59 UTC (Fri) by zlynx (guest, #2285) [Link] (6 responses)

SSDs are much less fragile than people seem to think.

The Tech Report ran six drives until they died: http://techreport.com/review/27909/the-ssd-endurance-expe...

First failures were at around 200 TB of writes. That is a lot. The next one was at 700 TB. Two of the drives survived more than 2 PB of writes.

I don't believe you should worry about a couple hundred gigabytes unless you do it every day for a couple of years.

A journal for MD/RAID5

Posted Dec 13, 2015 21:24 UTC (Sun) by nix (subscriber, #2304) [Link] (3 responses)

One of my use cases -- not one I do all the time, but one I want to be able to do without worrying about wearing things out, and I can easily see people who are actually involved in video processing rather than being total dilettantes like me spending far more time on it -- involves writing on the order of three hundred gigabytes every *three hours*, for weeks on end (and even then, even slow non-RAIDed consumer rotating rust drives can keep up with that load and only sit at about 20% utilization, averaged over the time).

In light of the 200TiB figure, it's safe to e.g. not care about doing builds and the like on SSDs, even very big builds of monsters like LibreOffice with debugging enabled (so it writes out 20GiB per build, that's nothing, a ten-thousandth of the worst observed failure level and a hundred thousandth of some of them). But things like huffyuv-compressed video being repeatedly rewritten as things mux and unmux it... that's more substantial. One of my processing flows writes a huffyuv-not-very-compressed data mountain out *eight times* as the mux/unmux/chew/mux/remuxes fly past, and only then gets to deal with tools that can handle something that's been compressed to a useful level. Ideally that'd all sit on a ramdisk, but who the hell has that much RAM? Not me, that's for sure. So I have to let the machine read and write on the order of a terabyte, each time... thankfully, this being Linux, the system is snappy and responsive while all this is going on, so I can more or less ignore the thing as a background job -- but if it ages my drives before their time I wouldn't be able to ignore it!

A journal for MD/RAID5

Posted Dec 14, 2015 19:34 UTC (Mon) by bronson (subscriber, #4806) [Link] (2 responses)

That kind of workload is going to age any kind of drive -- rust-based or flash based. But it sounds like this data is fairly disposable so you can treat your drives like the throwaway commodities they are. Live sports studios replace their streaming storage on a schedule, long before they start getting questionable.

An exotic DRAM-based drive might be more reliable than just swapping out your devices every n events. Or it might not, I've never used one.

A journal for MD/RAID5

Posted Dec 14, 2015 19:53 UTC (Mon) by nix (subscriber, #2304) [Link] (1 responses)

Would that workload age a normal rotating rust drive? Does I/O age a spinning rust drive noticeably? (I mean, it must do so a *bit* because the head is moving more than it otherwise would -- but how much? Presumably reads are as bad as writes...)

The upcoming spinning rust drives that have their heads contacting the storage medium -- now *those* would get aged by this, and indeed by any load at all. But as far as I can tell those suck for any purpose other than write-once-access-never archival storage...

A journal for MD/RAID5

Posted Dec 14, 2015 23:02 UTC (Mon) by smckay (guest, #103253) [Link]

write-once-access-never archival storage...

Sounds like an excellent application for the Signetics 25000 Series 9C46XN. An underrated chip that never got near enough use.

A journal for MD/RAID5

Posted Dec 15, 2015 20:05 UTC (Tue) by hummassa (guest, #307) [Link]

A rust-based control would have been nice...

A journal for MD/RAID5

Posted Mar 1, 2017 15:00 UTC (Wed) by nix (subscriber, #2304) [Link]

Aside, years later: I'm actually getting an SSD-based box now (and using md5 + journal + bcache, with journal + bcache on the same SSD). The SSD I'm putting in it is a 480GiB Intel one guaranteed for the to-me-astonishing figure of one complete device write per day for five years. That comes out as 900TiB.

I... don't think it's worth worrying about this much. Not even if you're, say, compiling Chromium over and over again, on the RAID: at ~90GiB of writes a time, that *still* comes to less than one complete device write per day because compiling Chromium is not a fast thing.

(However, I'm still splitting off a non-bcached, non-journalled, md0 array for transient stuff I don't care about and won't ever read, or won't read more than once, simply because it's *inefficient* to burn stuff into SSD that I'll never reference.)

A journal for MD/RAID5

Posted Dec 15, 2015 9:56 UTC (Tue) by paulj (subscriber, #341) [Link] (1 responses)

So, I think RAM actually can wear out. My understanding is the use of electronics does cause physical changes to the electronics. With DRAM I think this can manifest itself as increasing error rates with use. This ageing process may though be effectively negligible for nearly all purposes though. ;)

A journal for MD/RAID5

Posted Dec 16, 2015 18:14 UTC (Wed) by nix (subscriber, #2304) [Link]

Well, yeah, ionic migration will slowly wear all silicon chips out -- but the CPU will be being aged no matter what you do, which seems more significant a problem (and in fact with modern processes that aging has now started to dip into the point where it is noticeable to real users, rather than being a century or so away: don't expect to be still using that machine of yours a decade or so from now...)

A journal for MD/RAID5

Posted Nov 25, 2015 1:38 UTC (Wed) by fandingo (guest, #67019) [Link]

> I do wonder about sustained writes, though -- a lot of SSDs can only keep up with high write rates in brief bursts, artificially throttling afterwards.

To clarify, this would be due to garbage collection, caused by a lack of empty blocks and the inherent read-erase-modify pattern for partial block writes. It's not artificial. There is a large variance between SSD controllers in how aggressively they clear pages. Some, like the 3-bit NAND in the Samsung 840, aren't aggressive due to lower lifetime writes for those NAND cells. Others, like the Corsair Neutron, are very aggressive and accept the extra NAND writes in exchange for better sustained throughput. (Those models are somewhat older, but I happened to remember them from then-contemporary SSD reviews.) Spare area is also extremely helpful in avoiding GC "crunches." Users can mitigate this as well by not using the full capacity, creating an unofficial spare area, although high volatility and the GC algorithm on the controller can still undermine the effectiveness.

A journal for MD/RAID5

Posted Nov 25, 2015 8:03 UTC (Wed) by niner (subscriber, #26151) [Link]

"a lot of SSDs can only keep up with high write rates in brief bursts, artificially throttling afterwards"

Piece of hard earned advice: if you use SSDs for anything where (especially write-) performance may really matter, throw money at it. Buy larger SSDs than you'll actually need and buy professional or datacenter versions. You will get much better sustained write performance. That's where the biggest difference between consumer and professional versions really is nowadays. Luckily they are not that expensive anymore.

A journal for MD/RAID5

Posted Nov 26, 2015 10:31 UTC (Thu) by paulj (subscriber, #341) [Link]

You can monitor the battery though. Normal "running out" is detectable well in advance through the voltage dropping. In the case of other, sudden battery failure, the controller can at least fail-safe.

You still have a "hole", but now it's the probability that two independent events occur at the same time - server power dying AND battery suddenly failing - where the additional event is fairly rare of itself. So, the hole becomes a whole lot more rare. ;)

A journal for MD/RAID5

Posted Nov 30, 2015 17:05 UTC (Mon) by wazoox (subscriber, #69624) [Link] (1 responses)

> Hardware RAID controllers in my experience generally use battery-backed DRAM, not NVRAM

All current controllers use RAM, supercapacitors and flash. The supercapacitor provides just enough power to allow writing the cache to flash.

> If you only have one SSD, can you split it between this and bcache somehow?

I suppose that by finely tuning bcache write-back mode to only send full stripes writes directly to the disk, you could render this feature mostly redundant. Mostly.

A journal for MD/RAID5

Posted Dec 1, 2015 12:06 UTC (Tue) by nix (subscriber, #2304) [Link]

> All current controllers use RAM, supercapacitors and flash. The supercapacitor provides just enough power to allow writing the cache to flash.

Or you could do that. Again this requires specialist hardware support and is almost surely unavailable to md :(

A journal for MD/RAID5

Posted Nov 25, 2015 0:05 UTC (Wed) by trentbuck (guest, #66356) [Link] (11 responses)

Compared to the write-intent bitmap (mdadm -b internal),
is the *only* gain the closed RAID5 write hole?

It the journal useful at all on RAID1 & RAID10?

Is it useful to use journal *and* WI bitmap on the same array?

It sounds like the answers are yes (for now), no, & no;
but after future work it will also improve RAID5/6 write latency.

A journal for MD/RAID5

Posted Nov 25, 2015 0:50 UTC (Wed) by neilbrown (subscriber, #359) [Link]

> It sounds like the answers are...

Exactly correct. I'm fairly sure the code won't let you create or use an array with both a journal and a bitmap.

Actually; depending on workload, the WI-bitmap can cause a measurable performance hit, in which case trading for a journal would cause a different performance hit, quite possibly less.

A journal for MD/RAID5

Posted Nov 25, 2015 17:04 UTC (Wed) by gwolf (subscriber, #14632) [Link] (9 responses)

I do believe it is useful under RAID1 scenarios. After all, even though the data is complete (not striped) in each of the drives, if two blocks are different in a RAID1 setup, only one of them is current. Adding a log to it helps identify which is which.

A journal for MD/RAID5

Posted Nov 25, 2015 20:51 UTC (Wed) by neilbrown (subscriber, #359) [Link] (8 responses)

> if two blocks are different in a RAID1 setup, only one of them is current.

This is a widely held opinion that I do not agree with.
If the two copies of a block on a RAID1 differ, then both are equally current.

Between the moment when a write request is submitted and the moment when that request reports completion, both the "old" and "new" data are equally valid - at a sector granularity (so a mix of old and new sectors must be considered valid). Any application or filesystem that doesn't accept this is already broken even without RAID1.

After an unclean restart it is important to return consistent data for each read, but it doesn't matter if it is consistently "old" data or consistently "new" data. MD/RAID1 handles this by always reading from the "first" device until resync has completed.

This is not quite a perfect solution. If that "first" device fails during resync, it will start reading from the "second" device instead, and this might give results different to previous reads.
The appropriate fix here would *not* be a journal, but would be to read all blocks in parallel when reading from a region that is not known to be in-sync, and then writing out an arbitrary candidate block to all devices which contained a different value.

A journal for MD/RAID5

Posted Nov 25, 2015 21:54 UTC (Wed) by nix (subscriber, #2304) [Link]

The appropriate fix here would *not* be a journal, but would be to read all blocks in parallel when reading from a region that is not known to be in-sync, and then writing out an arbitrary candidate block to all devices which contained a different value.

Not quite: your comments re consistency still apply. It should attempt to write out a candidate block from a consistently-chosen device (e.g., the first) to the other one. It could also note that any blocks it read in this way which were beyond the in-sync region were now considered in-sync so don't need to be resynced or 'multiread' again, if that's not too expensive -- it's probably impractical on devices without a write-intent bitmap.

A journal for MD/RAID5

Posted Nov 26, 2015 10:39 UTC (Thu) by paulj (subscriber, #341) [Link] (6 responses)

Probably a silly question, but why not use a serial number per block? When you write out a block to RAID, increment the serial number. When you read in the chunks from the drive, the serial number would allow you to detect a mismatch.

Seems simple, so I must be missing something. :)

A journal for MD/RAID5

Posted Nov 26, 2015 19:42 UTC (Thu) by raven667 (subscriber, #5198) [Link] (5 responses)

Where would you store all these serial numbers that wouldn't be just as susceptible to getting out of sync?

A journal for MD/RAID5

Posted Nov 26, 2015 20:14 UTC (Thu) by paulj (subscriber, #341) [Link] (4 responses)

In a header, as part of the data to be written to the disk.

I'm in networking, this is how we solve problems like this.

A journal for MD/RAID5

Posted Nov 28, 2015 15:05 UTC (Sat) by ttonino (guest, #4073) [Link]

That is probably obe of the reasons why drives exist that use larger than 512 byte sectors.

OTOH handling the RAID as part of the file layout (btrfs/zfs) might also solve this kind of problem: the damage is then limited to the file of which the writing was interrupted. And that file was truncated anyway.

I wonder if block-based anything still makes sense. I mean, drives themselves are not directly bloack-adressable any more, but instead are a file system exposed as zillions of 512 byte files with fixed names.

A journal for MD/RAID5

Posted Nov 28, 2015 15:58 UTC (Sat) by raven667 (subscriber, #5198) [Link] (1 responses)

But there isn't any header on disk IO, you get 512 byte or 4k disk sectors, effectively a fixed MTU that is entirely dedicated to the payload storage, so there is no way to extend the model in that direction. All metadata has to be stored somewhere else on the disk, which uses space and is additional IO load. If we lived in a world where there was some per-sector metadata storage as part of the disk hardware then maybe this would make sense, but we didn't design the hardware that way 40 years ago so here we are.

A journal for MD/RAID5

Posted Nov 30, 2015 9:46 UTC (Mon) by paulj (subscriber, #341) [Link]

So it's storage's own version of the IP locked-down MTU mess? (Which btw is why some designing protocols for non-Internet contexts will just avoid IP entirely).

A journal for MD/RAID5

Posted Dec 4, 2015 15:17 UTC (Fri) by plugwash (subscriber, #29694) [Link]

In networking packets can vary in size fairly freely (yes there is a minimum and a maximum but any size between those is typically allowed) whereas storage blocks are expected to be a power of two size. If the layer below you works in power of 2 sized blocks and the layer above you expects power of two sized blocks then you can't easilly store metadata in the blocks themselves without either wasting half the space or creating a block boundry mismatch which is likely to kill performance. If you store it seperately then you risk it getting out of sync (though checksums can fix that) and also again waste a fair bit of performance reading and writing it.

The only real fix for this is to change the model, rather than providing redundancy as a shim layer between the storage system and the filesystem provide it as part of the filesystem.

A journal for MD/RAID5

Posted Nov 25, 2015 5:54 UTC (Wed) by thwutype (subscriber, #22891) [Link] (5 responses)

> "... and there is no chance of data corruption due to the write hole."
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I will guess that __unless__ I could disable all downstream member-disks caches stack (Member-Disk-Caches of hdd/sdd/raid card/HBA/etc.)
and/or make sure all the members path are configured 100% write-through, then this quote will be true;

Otherwise, there shall still be some chances to see corruption if any of downstream members' cache=ON,
right?

A journal for MD/RAID5

Posted Nov 25, 2015 6:38 UTC (Wed) by neilbrown (subscriber, #359) [Link] (4 responses)

The code is careful to flush caches on the journal device before trusting the data to be safe there, and to flush cached on the RAID devices before deleting anything from the journal.

So if your devices handle flush-cache requests correctly, there should be no room for corruption.

Does that ease your concerns?

A journal for MD/RAID5

Posted Nov 25, 2015 7:17 UTC (Wed) by thwutype (subscriber, #22891) [Link] (2 responses)

Neil, thanks for the quick feedback.

So, making sure that every under layers of HBA(LUNs as MD's member disks)
and all its downstreams RAID/JBOD boxes(as HBA's member disks) will decently handle flush commands
from MD/RAID5,
seemingly to be a necessary step before using MD journal to achieve zero corruption, right?

A journal for MD/RAID5

Posted Nov 25, 2015 7:29 UTC (Wed) by neilbrown (subscriber, #359) [Link] (1 responses)

That seems like a correct statement, though I would assume that if you have any writeback caches anywhere which don't correctly handle cache flush commands, then you are risking data corruption no matter what filesystem or RAID or whatever you have using them. This doesn't seem specific to RAID5 journals.

A journal for MD/RAID5

Posted Nov 25, 2015 20:08 UTC (Wed) by smckay (guest, #103253) [Link]

That's true, but if you're trying to eliminate the risk of power-loss-induced corruption in your RAID5 array and looking at the RAID5 journal as a way to achieve that, then it's especially important to be aware of conditions that could cause the journal to not provide the desired protection. If you're using plain RAID5, then whether the underlying devices respect cache flush commands isn't so important--either way you've got to be ready for a corrupted array.

A journal for MD/RAID5

Posted Nov 25, 2015 12:09 UTC (Wed) by HIGHGuY (subscriber, #62277) [Link]

> So if your devices handle flush-cache requests correctly

Be sure to check this. There are drives out there (currently still available) that completely ignore flush cache requests. Instead they require "IDLE immediate" to flush their caches.

A journal for MD/RAID5

Posted Nov 25, 2015 9:32 UTC (Wed) by ms (subscriber, #41272) [Link]

Presumably this sort of thing is also very much in-scope for bcachefs?

A journal for MD/RAID5

Posted Nov 25, 2015 10:02 UTC (Wed) by malor (guest, #2973) [Link] (4 responses)

There's another speedup that becomes available: disabling write barriers is safe on a RAID device with battery-backed RAM. I assume this would also be true on a journaled mdraid. The mount option in question is 'nobarrier' on XFS, but appears to be 'barrier=0' on ext4 and most other filesystems.

Without the NVRAM failsafe, disabling write barriers is playing with fire. You don't want to use it unless you've got that safety net. But if you do, the speedup could be quite noticeable; most filesystems spend a lot of time flushing caches and ensuring consistency, not trusting lying consumer-quality drives. Keeping the journal closer to the metal, as it were, could obviate a great deal of work at higher layers.

A journal for MD/RAID5

Posted Nov 25, 2015 10:28 UTC (Wed) by malor (guest, #2973) [Link] (1 responses)

As a followup, I just looked at ext4, and it respects either 'nobarrier' or 'barrier=0' as synonyms. XFS requires 'nobarrier'.

From a quick grep of the kernel docs, it appears that other filesystems with barrier support are inconsistent about how to turn it off. Some use both keywords, but some use use only one, and which one of the two they picked is about evenly split.

A journal for MD/RAID5

Posted Dec 6, 2015 13:06 UTC (Sun) by hmh (subscriber, #3838) [Link]

This would be really simple to fix, as adding missing variants for "disable barrier" is not going to be an ABI break.

However, it is also of very limited value (in fact, it could make things worse) because it will not be supported on older kernels, unless this kind of change is accepted in the -stable trees and also backported by the distros.

A journal for MD/RAID5

Posted Dec 12, 2015 9:05 UTC (Sat) by joib (subscriber, #8541) [Link]

AFAIU the block barrier rework which landed 2.6.33-ish ought to make barrier vs nobarrier mostly moot if you have a non-volatile write cache.

Disclaimer: This if from reading various comments from people more knowledgeable than me on the matter around the time this was merged, and on a very-high-level understanding of how the code works, rather than on actual benchmarks.

A journal for MD/RAID5

Posted Mar 1, 2017 15:52 UTC (Wed) by nix (subscriber, #2304) [Link]

Aside: With the writeback journal work in v4.10+, nobarrier won't provide much speedup, since all barriers need to do is ensure a flush to the journal, not the RAID array itself -- and generally one requires nobarrier due to seek-induced sloth. So, AIUI, you can ignore nobarrier entirely if you are using a writeback journal.

Is there any redundancy while the data resides in the journal?

Posted Nov 25, 2015 11:05 UTC (Wed) by ayers (subscriber, #53541) [Link] (3 responses)

I have the feeling that I'm missing something fundamental. I am reading that when the application submits a write to the filesystem which then submits the write to the RAID device, this will first be stored in the non-redundant NVRAM which may return success to the filesystem and it to the application, possibly before there is any redundant data on disk. If the NVRAM fails before the journal is written, it seems data would be lost. Shouldn't there at least be a mirrored write NVRAM before the filesystem returns success? Thanks!

Is there any redundancy while the data resides in the journal?

Posted Nov 25, 2015 12:56 UTC (Wed) by hthoma (subscriber, #4743) [Link] (2 responses)

Well, the article says this:

> This can be any block device and could even be a mirrored pair of SSDs (because you wouldn't want the journal device to become a single point of failure).

So, yes, it is most probably reasonable to have some redundancy in the journal. But it is up to the admin to set it up that way.

Is there any redundancy while the data resides in the journal?

Posted Nov 26, 2015 2:51 UTC (Thu) by thwutype (subscriber, #22891) [Link] (1 responses)

It comes to me that: Is this possible to use something like this? Would this help?

e.g. mdadm ... --write-journal=/dev/md-journal" ...
md-journal is a N-mirrors RAID1 w/ member hybrided of NVMe and Ramdisk disks

Is there any redundancy while the data resides in the journal?

Posted Dec 24, 2021 18:15 UTC (Fri) by snnn (guest, #155862) [Link]

According to a discussion in linux-raid, 6 years later, the answer is still: no. You can't use another MD raid array for this. You may use a hardware raid but it doesn't make much sense.

MD/RAID5 vs. more "intelligent" media aggregation schemes

Posted Nov 25, 2015 17:10 UTC (Wed) by gwolf (subscriber, #14632) [Link] (9 responses)

I might be quite blunt here, but... It's been several years I don't consider "traditional" RAID anymore for my Linux filesystems; I first went towards the LVM2 side, and am now still toying (that means, my devel machines, but not my production servers) with BTRFS.
My impression is that having more layers understand the true geometry of the issue will help me get better performance (and keep the same reliability). I won't talk much of systems that go all the way up to the filesystem such as BTRFS and ZFS, as I'm quite new at them, but LVM2 provides more or less the same basic functionality quite enough flexibility and is IMO better suited than RAID at most tasks.
Now, I am more than ready to admit there is a flaw in my reasoning, and would love to understand where it lies. Any takers?

MD/RAID5 vs. more "intelligent" media aggregation schemes

Posted Nov 25, 2015 20:14 UTC (Wed) by smckay (guest, #103253) [Link] (2 responses)

The one time I tried to create a RAID LV, the kernel whined about the volume being busted and refused to mount it. Back to mdadm for me.

lvm

Posted Dec 3, 2015 14:52 UTC (Thu) by shane (subscriber, #3335) [Link] (1 responses)

That's a pity. I use lvm all the time, and have found that the ability to grow volumes and ship them between physical drives to be quite handy.

lvm

Posted Dec 11, 2015 2:50 UTC (Fri) by Pc5Y9sbv (guest, #41328) [Link]

I've always used MD RAID to construct the PVs underneath my LVM2 volume groups. All the flexibility of LVM to migrate logical volumes and all the flexibility of mdadm to manage disk failures, reshaping, etc.

Lately, I am combining these with LV cache. I use MD RAID to create redundant SSD and bulk HDD arrays as different PVs and can choose to place some LVs only on SSD, some only on HDD, and some as an SSD-cached HDD volume.

MD/RAID5 vs. more "intelligent" media aggregation schemes

Posted Nov 25, 2015 21:12 UTC (Wed) by neilbrown (subscriber, #359) [Link] (2 responses)

> LVM2 provides more or less the same basic functionality quite enough flexibility and is IMO better suited than RAID at most tasks.

I hear this sort of comment from time to time and must confess that I don't really understand it.
LVM2 provides some very useful functionality, but it doesn't really provide security in the face of device failure.
There is a "dm-raid1" module which can provide basic RAID1 functionality, but I get the impression that it is primarily focused at implementing "pvmove" - a very important function but not really about data reliabilty.
There is also the "dm-raid" module which is a wrapper around the MD/RAID code to provide a consistent interface for the lvm2 tools to use.
But LVM2 does lots of other useful things (thin provisioning, volume resizing, crypto, etc etc) which are quite orthogonal to RAID.

So LVM2 is good and RAID is good and they are different (though with some commonality). Many people use both.

On the question of BTRFS/ZFS - I think there is room for multiple approaches.
There are strong reasons relating to performance and functionality to encourage vertical integration. Having the user-facing filesystem access all storage devices directly without LVM or RAID in between can bring real benefits. A filesystem which knows about the RAID5 layout, uses copy-on-write to only ever write to areas of devices that were not in use, and always writes full stripes (zero-padding if needed) would not get any benefit at all from the RAID5 journal.

On the other hand, there are strong reasons relating to flexibility and maintainability to define clear layering and clean interfaces between the layers and let code at each layer "do one thing well". This allows us to easily combine different filesystems with different RAID and volume management and cryptography and network access and..and..and..

Both approaches will continue to have value and Linux is easily big enough for both. There is plenty of room for people such as yourself to experiment with doing things differently.

MD/RAID5 vs. more "intelligent" media aggregation schemes

Posted Nov 25, 2015 22:41 UTC (Wed) by gwolf (subscriber, #14632) [Link]

Umh... Does not provide security? I often use mirrored lvm volumes; a "lvs -a" shows me (for one of my logical volumes):

lxc_mail baktun mwi-aom--- 65.00g lxc_mail_mlog 100.00
[lxc_mail_mimage_0] baktun iwi-aom--- 65.00g
[lxc_mail_mimage_1] baktun iwi-aom--- 65.00g
[lxc_mail_mlog] baktun lwi-aom--- 4.00m

So, yes, it consists of two "mimage" volumes plus one "mlog" — Which is quite in line with what is discussed in the article. I have had hard drives die on me, and recovering often translates to just slipping in a new HD, pvcreate + vgextend + pvmove + vgreduce, and carry on with my work.

MD/RAID5 vs. more "intelligent" media aggregation schemes

Posted Nov 25, 2015 23:34 UTC (Wed) by jtaylor (subscriber, #91739) [Link]

don't the lvm raid type volumes use the same code as mdadm now?
Or is there still a difference between a raid type lvm module (not the mirrored type) and mdadm raid + lvm on top?

MD/RAID5 vs. more "intelligent" media aggregation schemes

Posted Nov 25, 2015 22:52 UTC (Wed) by Sesse (subscriber, #53779) [Link] (2 responses)

Based on how unbearably slow btrfs has been for me on single devices (30 seconds for an ls of a directory containing 100 snapshots, seriously?), I'm not willing to believe offhand it does a better job than md on RAID :-)

MD/RAID5 vs. more "intelligent" media aggregation schemes

Posted Nov 26, 2015 16:43 UTC (Thu) by flussence (guest, #85566) [Link] (1 responses)

That seems pretty bad. What kind of hardware setup are you using? I have a similarly sized directory (70 snapshots on a 4-core KVM instance), and even after using drop_caches ls takes less than a second.

MD/RAID5 vs. more "intelligent" media aggregation schemes

Posted Nov 26, 2015 16:53 UTC (Thu) by Sesse (subscriber, #53779) [Link]

20-core Haswell, 64GB RAM, two Intel 530 SSDs (RAID-0) as dm-cache in front of a 8-disk RAID-5 array of 7200rpm disks, with LVM to split it up.

ext4 on the exact same array has no problems. btrfs with snapshots is unbearably slow. Eventually I just gave it up.

A journal for MD/RAID5

Posted Nov 25, 2015 19:03 UTC (Wed) by NightMonkey (subscriber, #23051) [Link] (5 responses)

SysAdmin question here. While this does sound exciting, it seems that there remains one circumstance where having the host kernel handle array management and supervision would be a negative: a host kernel panic or kernel freeze, or host CPU/chipset freeze. I'd think in some cases that having a (separate) hardware controller to manage RAID would ensure that, should the host kernel fail, the 'last drip' of data (or other low-level array commands) will still be committed to storage. Whereas, if the kernel is doing the management work, that data may not be committed.

Am I nuts to think this is the case? (Of *course* I love having Linux to manage arrays, don't get me wrong. I'm merely trying to determine operating traits and caveats.)

Cheers!

A journal for MD/RAID5

Posted Nov 25, 2015 20:06 UTC (Wed) by raven667 (subscriber, #5198) [Link]

I think the risk is the same in either case, maybe even greater with a separate RAID controller as each has its own risk of crashing and each can corrupt data when it crashes, so it's the added risk of both components vs. the risk of just the OS kernel. The whole point of journaling, at the database, filesystem and block level, is to always leave the disk in a consistent state, so that a panic at any point could be easily recovered from.

A journal for MD/RAID5

Posted Nov 26, 2015 4:56 UTC (Thu) by malor (guest, #2973) [Link] (3 responses)

>the 'last drip' of data (or other low-level array commands) will still be committed to storage.

That may not be what you want.

Rather, in most cases, you probably want the last drip of *known correct* data to get to the disk. The three-stage design of the new mdraid should mean that only complete and correct transactions are permanently recorded. Hardware controllers aren't as married to the kernel, and won't have the same kind of insight into the structure of incoming data, so they'll be at least a little more likely to write out a partial set of blocks given to them by a dying system.

It's not terribly likely in either case, mind, but the staged-commit-with-checksum approach at least LOOKS like it would be more robust. I imagine it will take time to shake out, but a few years from now, hardware controllers may have a harder time competing.

A journal for MD/RAID5

Posted Nov 26, 2015 7:03 UTC (Thu) by kleptog (subscriber, #1183) [Link] (2 responses)

> Hardware controllers aren't as married to the kernel, and won't have the same kind of insight into the structure of incoming data, so they'll be at least a little more likely to write out a partial set of blocks given to them by a dying system.

Unfortunately, the kernel also has no insight into what data is important to user space programs because there is no interface for user space programs to provide this to the kernel. So all this journaling is good for preserving the file system and RAID array consistency but does nothing to ensure data consistency for users.

It is a necessary step in the right direction though. One day I hope there will be two new syscalls begin_fs_transaction() and commit_fs_transaction() so programs can indicate the data dependencies from their point of view, and we can throw away the millions of lines of user space code dedicated to providing reliable transactional data storage on top of unreliable file systems and disks.

A journal for MD/RAID5

Posted Nov 26, 2015 19:39 UTC (Thu) by raven667 (subscriber, #5198) [Link] (1 responses)

Maybe I'm just ignorant of the computer science behind this but I don't think you can abstract away the complexity of maintaining data integrity to the point you can just expose a begin/commit primitive, if that were the case then you would see the same abstraction within the kernel between the block IO and filesystem, instead you see careful, complicated code and use of flushes, barriers, to achieve integrity. Userspace programs like databases that care deeply about integrity are going to have all the same careful complexity, that doesn't just go away.

A journal for MD/RAID5

Posted Nov 26, 2015 21:20 UTC (Thu) by kleptog (subscriber, #1183) [Link]

IMO one of the greatest successes of SQL is that it uses transaction primitives to handle concurrency. It is the most general form of concurrency control that provides the most flexibility to the implementation while providing a solid primitive on which complex systems can be built. STM is providing the same technology for memory allowing you to do away with explicit locking entirely.

Actually, AFAICT there's not so much complicated code involving flushes, barriers, etc to ensure integrity. Instead they're using the standard trick every database uses: a journal. This allows you to simulate atomic transactions on a unreliable sub-layer. You don't need huge guarantees to make a journal work, just being able to specify ordering constraints is enough.

But now we have the database writing a journal, the filesystem writing a journal and the RAID array writing a journal. All to achieve the same result, namely providing transaction semantics on top of an unreliable base. If each layer exposed a begin/commit interface to the level above the entire system could run with *one* journal, allowing the maximum scope to parallelisability.

I can guess though why it hasn't happened. One of the possibilities of transaction semantics is that things can fail and require a rollback to a previous state. And unless the entire machine is under transaction semantics you end up having to write code to deal with that. With explicit locking you can be careful to arrange things so that you never need to rollback, which saves a chunk of work. It's hard to prove it correct though. Filesystems probably are careful to arrange their transactions in such a way that they never need to rollback, but if you exposed the capability to user-space you'd have to deal with the possibility that two programs could be trying to do conflicting things.

Linux Software RAID > Hardware RAID

Posted Nov 30, 2015 23:07 UTC (Mon) by ldo (guest, #40946) [Link] (16 responses)

I think Linux software RAID is wonderful. I have had several clients running it for many years, and I am impressed with how well it copes with disk failures. Why it’s better than hardware RAID:

Hardware-independent disk formats. You can swap disk controllers, even move the disks to a different machine, and they will still work. You can use disks of different brands, even different sizes. This is handy when a drive fails—you can use any replacement disk that is big enough.
Array components remain directly accessible. Thus, it is easy to run periodic badblocks scans on all your disks, regardless of whether they’re part of an array or not.
Common administration tools. Once you have figured out the basics of mdadm, you can use it on any Linux distro, any hardware, any RAID configuration. And, as I mentioned, other common non-RAID-specific tools like badblocks also work fine.

Performance? I’ve never noticed an issue. Modern systems have CPU to burn, and RAID processing essentially disappears in the idle rounding error.

Linux Software RAID > Hardware RAID

Posted Dec 1, 2015 7:08 UTC (Tue) by sbakker (subscriber, #58443) [Link] (1 responses)

I heartily agree. We've been burned by HW RAID failures in the past and I can tell you it's no fun. Monitoring them is also a pain, as some vendors won't allow you to see the S.M.A.R.T. stats of the underlying devices, while others will, but they all do it in a different way (and with truly horrible CLI utilities). Oh, and you have to manage and update the firmware for these things separately.

The possibility that MD gives you to take the disks out of one machine and just plug them into another greatly helps with recovery from server failures.

With MD RAID, I just monitor the S.M.A.R.T. stats on the physical disks, and as soon as I see reallocated sectors, they get replaced.

I also tend to favour RAID10 over RAID5 or RAID6 (faster rebuilds, better write performance), but then, my storage needs are not that large, so I can afford it.

Linux Software RAID > Hardware RAID

Posted Dec 1, 2015 7:48 UTC (Tue) by ldo (guest, #40946) [Link]

> I just monitor the S.M.A.R.T. stats on the physical disks, and as soon as
> I see reallocated sectors, they get replaced.

In that case, you are probably replacing disks a lot more often than you need to, adding to your costs without any significant improvement in data reliability.

Linux Software RAID > Hardware RAID

Posted Dec 1, 2015 12:05 UTC (Tue) by nix (subscriber, #2304) [Link] (7 responses)

Array components remain directly accessible. Thus, it is easy to run periodic badblocks scans on all your disks, regardless of whether they’re part of an array or not.

I guess this gives you half of a poor-man's scrubbing, but any real hardware RAID will provide automatic scrubbing in any case. You can't do anything with the output of badblocks: even if it finds some, because md doesn't know there's a bad block there it's not going to do any of the things it would routinely do when a bad block is found (starting by rewriting it from the others to force the drive to spare it out, IIRC). All you can do is start a real scrub -- in which case why didn't you just run one in the first place? Set the min and max speeds right and it'll be a lot less disruptive to you disk-load-wise too.

Your first and last points are, of course, compelling (and they're why I'm probably going mdadm on the next machine -- well, that and the incomparable Neil Brown advantage, you will never find a hardware RAID vendor as clued-up or helpful), but this one in particular seems like saying 'md is better than hardware RAID because you can badly implement, by hand, half of something hardware RAID does as a matter of course'.

The right way to scrub with mdadm is echo check > /sys/block/md*/md/sync_action (or 'repair' if you want automatic rewriting of bad blocks). If you're using badblocks by hand I'd say you're doing something wrong.

Linux Software RAID > Hardware RAID

Posted Dec 1, 2015 19:51 UTC (Tue) by ldo (guest, #40946) [Link]

Yes, I can do something with the outout of badblocks; if a disk has bad sectors on it, I replace it. I’ve found more than one bad disk this way. Also, badblocks scans work whether the disk is RAIDed or not.

Here is a pair of scripts I wrote to ease the job of running badblocks scans.

Linux Software RAID > Hardware RAID

Posted Dec 3, 2015 16:48 UTC (Thu) by hmh (subscriber, #3838) [Link] (5 responses)

Array components remain directly accessible. Thus, it is easy to run periodic badblocks scans on all your disks, regardless of whether they’re part of an array or not.

I guess this gives you half of a poor-man's scrubbing, but any real hardware RAID will provide automatic scrubbing in any case.

Well, md's "repair" sync_action will give you poor-man's scrubbing (which only rewrites when the underlying storage reports an erasure/read error, or when the parity data sets are not consistent with the data -- ideal for SSDs, but not really what you want for modern "slightly forgetful" HDDs, where you'd actually want to trigger a hard-scrub that rewrites all stripes).

Linux Software RAID > Hardware RAID

Posted Dec 4, 2015 9:13 UTC (Fri) by neilbrown (subscriber, #359) [Link] (4 responses)

> you'd actually want to trigger a hard-scrub that rewrites all stripes

Is that really something that people would want?

I guess I imagine that the drive itself would notice if there was any weakness in the current recording (e.g. correctable errors) and would re-write the block proactively. So all that should be necessary is to read every block. But maybe I give too much credit to the drive firmware.

FWIW this would very quite straight forward to implement if anyone thought it would actually be used and wanted a journey-man project to work on.

Linux Software RAID > Hardware RAID

Posted Dec 6, 2015 14:00 UTC (Sun) by hmh (subscriber, #3838) [Link]

you'd actually want to trigger a hard-scrub that rewrites all stripes

Is that really something that people would want?

I guess I imagine that the drive itself would notice if there was any weakness in the current recording (e.g. correctable errors) and would re-write the block proactively. So all that should be necessary is to read every block. But maybe I give too much credit to the drive firmware.

I used to think the HDD firmware would handle that sanely, as well. Well, let's just say you cannot assume consumer HDDs 1TB and above will do that properly (or will be sucessful at it while the sector is still weak but ECC-correctable).

Forcing a scrub has saved my data several times, already. Once it start happening, I get a new group of unreadable sectors detected by SMART or by an array read attempt every couple weeks (each requiring a md repair cycle to ensure none are left behind), until I either get pissed off enough to find a way to force-hard-scrub that entire component device (typically by using mdadm --replace with the help of a hot-spare device).

Linux Software RAID > Hardware RAID

Posted Dec 10, 2015 15:29 UTC (Thu) by itvirta (guest, #49997) [Link] (2 responses)

As an aside, and this may be assuming the worst of hard drives, has there been any thought on a block-checksumming RAID?
(Or any block device, but being able to read from a mirror or parity if the data is corrupted would be nice.)

Can we have one for Christmas?

Linux Software RAID > Hardware RAID

Posted Dec 10, 2015 21:41 UTC (Thu) by hmh (subscriber, #3838) [Link] (1 responses)

Well, people are trying to add proper FEC to dm-verify.

I realise this is not what you asked for, since it actually repairs the data, but hey, that could be even more useful depending on what you want to do ;-)

Thread starts at:
https://lkml.org/lkml/2015/11/4/772

Block device checksumming

Posted Dec 13, 2015 16:09 UTC (Sun) by itvirta (guest, #49997) [Link]

Well, of course even the basic dm-verity would be able to detect changes.
But everything I can find tells me that it's a read-only target, which isn't
really what one wants for general use.

Linux Software RAID > Hardware RAID

Posted Dec 7, 2015 21:56 UTC (Mon) by Yenya (subscriber, #52846) [Link] (5 responses)

I, for one, also think that the article summary is too modest. Linux MD RAID is far superior to the hardware solutions not only for the reasons written above, but _also_ for its performance. According to my experience, even the most expensive solutions of the time (including the DAC960 SCSI-to-SCSI bridges and, to some extent, 3ware HW RAID cards), the performance of the HW RAID sucks. The typical HW RAID controller happily accept your write requests, and stall the later read request, manifesting a similar behaviour to the networking bufferbloat. On the other hand, the kernel is aware of the requests that are being waited upon, and can prioritise the requests accordingly. Also, the kernel can use the whole RAM as a cache, while the HW RAID controller cache is much more expensive per byte, and usually unobtainable in sizes similar to the RAM of the modern computers. For me, "HW RAID" is a bad joke. JBOD with Linux MD RAID is much better.

Linux Software RAID > Hardware RAID

Posted Dec 7, 2015 22:53 UTC (Mon) by pizza (subscriber, #46) [Link] (4 responses)

As a counterpoint, HW RAID controllers offer an advantage in reliability and robustness at the system level. Or at least the good ones do.

Linux Software RAID > Hardware RAID

Posted Dec 8, 2015 7:36 UTC (Tue) by Yenya (subscriber, #52846) [Link] (3 responses)

I am not sure what do you mean by "reliability and rubustness at the system level": sure, battery backed RAM is and advantage (unless the MD journal reaches the end-user kernels). But I had lots of stories where HW RAID failed for bizzare reasons such as replacing a failed drive with a vendor-provided one, which has not been erased beforehand, and which destroyed the rest of the configuration of the whole array, because the controllers in the array thought that the configuration stored on the replaced drive was for some reason newer than the configuration stored on the rest of the drives in the array. So no, my exprerience tells that the reliability and robustness is on the Linux MD RAID side.

Linux Software RAID > Hardware RAID

Posted Dec 8, 2015 13:27 UTC (Tue) by pizza (subscriber, #46) [Link] (2 responses)

This is one area where our experiences differ.

In fifteen years of using 3Ware RAID cards, for example, I've never had a single controller-induced failure, or data loss that wasn't the result of blatant operator error (or multiple drive failure..) My experience with the DAC960/1100 series was similar (though I did once have a controller fail; no data loss once it was swapped). I've even performed your described failure scenario multiple times. Even in the day of PATA/PSCSI, hot-swapping (and hot spares) just worked with those things.

(3Ware cards, the DAC family, and a couple of the Dell PERC adapters were the only ones I had good experiences with; the rest were varying degrees of WTF-to-outright-horror. Granted, my experience now about five years out of date..)

Meanwhile, The supermicro-based server next to me actually *locked up* two days ago when I attempted to swap a failing eSATA-attached drive used for backups.

But my specific comment about robustness is that you can easily end up with an unbootable system if the wrong drive fails on an MDRAID array that contains /boot. And if you don't put /boot on an array, you end up in the same position. (to work around this, I traditionally put /boot on a PATA CF or USB stick, which I regularly imaged and backed up so I could immediately swap in a replacement)

FWIW I retired the last of those MDRAID systems about six months ago.

Linux Software RAID > Hardware RAID

Posted Dec 8, 2015 21:34 UTC (Tue) by Yenya (subscriber, #52846) [Link] (1 responses)

3ware cards suffer(ed) badly from the bufferbloat problem - the perceived filesystem latency rapidly increased with increasing write load. This is something that MD RAID does not manifest.

Linux Software RAID > Hardware RAID

Posted Dec 9, 2015 17:04 UTC (Wed) by pizza (subscriber, #46) [Link]

Oh, I've seen that bufferbloat problem myself. It's much better in recent years (and more modern HW generations), but intelligently setting up the file system with awareness of the underlying block/stripes also made a hell of a difference.

But I don't use 3Ware cards for RAID5 write performance, I use them for reliability/robustness for bulk storage that is nearly always read loads. (If write performance mattered that much, I'd use sufficient disks for RAID10; RAID5/6 is awful)

A journal for MD/RAID5

Posted Dec 24, 2021 18:16 UTC (Fri) by snnn (guest, #155862) [Link]

While it looks beautiful, we must know it has limitations.
First. the write through mode is for increasing data safety, not performance. The problem it tries to fix, write-hole, isn't common. Thus you don't this need this feature. RAID isn't a backup, it doesn't need to provide 100% data safety. It is meant to reduce downtime in most common scenarios. So the extra gain from adding a RAID journal is small.

While the write back mode can increase performance, it reduces reliability. Because the journal device can't be a MD raid array.

And I think the code isn't stable yet. We saw kernel hangs when the raid array was doing sync and it had a write journal and also had heavy read/write load during the sync. Such problems were also got reported to the Linux RAID mail list by other users.