A journal for MD/RAID5
While hardware-RAID solutions suffer from the lack of transparency and flexibility that so often come with closed devices, they have two particular advantages. First, a separate computer brings dedicated processing power and I/O-bus capacity which takes some load off the main system, freeing it for other work. At the very least, the system CPU will never have to perform the XOR calculations required to generate the parity block, and the system I/O bus will never have to carry that block from memory to a storage device. As commodity hardware has increased in capability and speed over the years, though, this advantage has been significantly eroded.
The second advantage is non-volatile memory (NVRAM). While traditional commodity hardware has not offered much NVRAM because it would hardly ever be used, dedicated RAID controllers nearly always have NVRAM as it brings real benefits in both performance and reliability. Utilizing NVRAM provides more than just the incremental benefits brought by extra processing components. It allows changes in data management that can yield better performance from existing devices.
With recent developments, non-volatile memory is becoming a reality on commodity hardware, at least on server-class machines, and it is becoming increasing easy to attach a small solid-state storage device (SSD) to any system that manages a RAID array. So the time is ripe for MD/RAID5 to benefit from the ability to manage data in the ways that NVRAM allows. Some engineers from Facebook, particularly Shaohua Li and Song Liu, have been working toward this end; Linux 4.4 will be the first mainline release to see the fruits of that labor.
Linux 4.4 — closing the RAID5 write hole
RAID5 (and related levels such as RAID4 and RAID6) suffer from a potential problem known as the "write hole". Each "stripe" on such an array — meaning a set of related blocks, one stored on each active device — will contain data blocks and parity blocks; these must always be kept consistent. The parity must always be exactly what would be computed from the data. If this is not the case then reconstructing the data that was on a device that has failed will produce incorrect results.
In reality, stripes are often inconsistent, though only for very short intervals of time. As the drives in an array are independent (that is the "I" of RAID) they cannot all be updated atomically. When any change is made to a stripe, this independence will almost certainly result in a moment when data and parity are inconsistent. Naturally the MD driver understands this and would never try to access data during that moment of inconsistency ... unless....
Problems occur if a machine crash or power failure causes an unplanned
shutdown. It is fairly easy to argue that the likelihood that an
unclean shutdown would interrupt some writes but not others is
extremely small. It's not easy to argue that such a circumstance could
never happen, though. So when restarting from an unclean shutdown, the MD
driver must assume that the failure may have happened during a moment of
inconsistency and, thus, the parity blocks cannot be trusted.
If the
array is still optimal (no failed devices) it will recalculate the
parity on any stripe that could have been in the middle of an update.
If, however, the array is degraded, the parity cannot be recalculated. If
some blocks in a stripe were updated and others weren't, then the block that
was on the failed device will be reconstructed based on inconsistent
information, leading to data corruption. To handle this case,
MD will refuse to assemble the array
without the "--force
" flag, which effectively acknowledges
that data might be corrupted.
An obvious way to address this issue is to use the same approach that has worked so well with filesystems: write all updates to a journal before writing them to the main array. When the array is restarted, any data and parity blocks still in the journal are simply written to the array again. This ensures the array will be consistent whether it is degraded or not. This could be done with a journal on a rotating-media drive but the performance would be very poor indeed. The advent of large NVRAM and SSDs makes this a much more credible proposition.
The new journal feature
The functionality developed at Facebook does exactly this. It allows a journal device (sometimes referred to as a "cache" or "log" device) to be configured with an MD/RAID5 (or RAID4 or RAID6) array. This can be any block device and could even be a mirrored pair of SSDs (because you wouldn't want the journal device to become a single point of failure).
To try this out you would need Linux 4.4-rc1 or later, and the current
mdadm from git://neil.brown.name/mdadm
. Then you can create a new
array with a journal using a command like
mdadm --create /dev/md/test --level=5 --raid-disks=4 --write-journal=/dev/loop9 \ /dev/loop[0-3]
It is not currently possible to add a journal to an existing array, but that functionality is easy enough to add later.
With the journal in place, RAID5 handling will progress much as it normally does, gathering write requests into stripes and calculating the parity blocks. Then, instead of being written to the array, the stripe is intercepted by the journaling subsystem and queued for the journal instead. When write traffic is sufficiently heavy, multiple stripes will be grouped together into a single transaction and written to the journal with a single metadata block listing the addresses of the data and parity. Once this transaction has been written and, if necessary, flushed to stable storage, the core RAID5 engine is told to process the stripe again, and this time the write-out is not intercepted.
When the write to the main array completes, the journaling subsystem will be told; it will occasionally update its record of where the journal starts so that data that is safe on the array effectively disappears from the journal. When the array is shut down cleanly, this start-of-journal pointer is set to an empty transaction with nothing following. When the array is started, the journal is inspected and if any transactions are found (with both data and parity) they are written to the array.
The journal metadata block uses 16 bytes per data block and so can describe well over 200 blocks. Along with each block's location and size (currently always 4KB), the journal metadata records a checksum for each data block. This, together with a checksum on the metadata block itself, allows very reliable determination of which blocks were successfully written to the journal and so should be copied to the array on restart.
In general, the journal consists of an arbitrarily large sequence of metadata blocks and associated data and parity blocks. Each metadata block records how much space in the journal is used by the data and parity and so indicates where the next metadata block will be, if it has been written. The address of the first metadata block to be considered on restart is stored in the standard MD/RAID superblock.
The net result of this is that, while writes to the array might be slightly slower (depending on how fast the journal device is), a system crash never results in a full resync — only a short journal recovery — and there is no chance of data corruption due to the write hole.
Given that the write-intent bitmap already allows resynchronization after crash to be fairly quick, and that write-hole corruption is, in practice, very rare; you may wonder if this is all worth the cost. Undoubtedly different people will assess this tradeoff differently; now at least the option is available once that assessment is made. But this is not the full story. The journal can provide benefits beyond closing the write-hole. That was a natural place to start as it is conceptually relatively simple and provides a context for creating the infrastructure for managing a journal. The more interesting step comes next.
The future: writeback caching and more full-stripe writes
While RAID5 or RAID6 provide a reasonably economical way to combine multiple devices to provide large storage capacity with reduced chance of data loss, they do come at a cost. When the host system writes a full stripe worth of data to the array, the parity can be calculated from that data and all writes can be scheduled almost immediately, leading to very good throughput. When writing to less than a full stripe, though, throughput drops dramatically.
In that case, some data or parity blocks need to be read from the array before the new parity can be calculated. This read-before-write introduces significant latency to each request, so throughput suffers. The MD driver tries to delay partial-stripe writes a little bit in the hope that the rest of the stripe might be written soon. When this works, it helps a lot. When it doesn't, it just increases latency further.
It is possible for a filesystem to help to some extent, and to align data with stripes to increase the chance of a full-stripe write, but that is far from a complete solution. A journal can make a real difference here by being managed as a writeback cache. Data can be written to the journal and the application can be told that the data is safe before the RAID5 engine even starts considering whether some pre-reading might be needed to be able to update parity blocks.
This allows the application to see very short latencies no matter what data-block pattern is being written. It also allows the RAID5 core to delay writes even longer, hoping to gather full stripes, without inconveniencing the application. This is something that dedicated RAID controllers have (presumably) been doing for years, and hopefully something that MD will provide in the not-too-distant future.
There are plenty of interesting questions here, such as whether to keep all in-flight data in main memory, or to discard it after writing to the journal and to read it back when it is time to write to the RAID. There is also the question of when to give up waiting for a full stripe and to perform the necessary pre-reading. Together with all this, a great deal of care will be needed to ensure we actually get the performance improvements that theory suggests are possible.
This is just engineering though. There is interest in this from both potential users of the technology and vendors of the NVRAM and there is little doubt that we will see the journal enhanced to provide very visible performance improvements to complement the nearly invisible reliability improvements already achieved.
Index entries for this article | |
---|---|
Kernel | Block layer/RAID |
Kernel | RAID |
GuestArticles | Brown, Neil |
Posted Nov 24, 2015 22:50 UTC (Tue)
by nix (subscriber, #2304)
[Link] (25 responses)
I'll admit I'm surprised that SSD write speeds and lifetimes are good enough for something like this: I was betting they would never get as fast as rotating rust at writes, but clearly they've got there. (Hardware RAID controllers in my experience generally use battery-backed DRAM, not NVRAM -- so rather than having a problem with the NVRAM degrading through write load, we have a problem with the battery dying!) I do wonder about sustained writes, though -- a lot of SSDs can only keep up with high write rates in brief bursts, artificially throttling afterwards. If I'm doing, say, vapoursynth work, I can easily shuffle a couple of terabytes of writes of huffyuv-compressed video around the array in the intermediate pipeline stages: anything that slows that down can easily add hours to the processing pipeline, which is slow enough as it is. (Obviously this data is inherently transient, so preserving any of it against power losses is a total waste of time -- so probably you'd relegate it to a non-journalled array in any case. But one can envisage workloads with very high write loads that are not inherently transient -- you'd want write hole protection for those even more, since if the array is writing for hours non-stop the chance of a write hole is climbing to non-insignificant levels!)
If you only have one SSD, can you split it between this and bcache somehow? Maybe dm-linear, since cutting block devices in two is more or less what it's meant for. They both seem worthwhile things to have in a RAID-enabled box.
Probably, though, bcache and good backups would be the first priority, unless the system needed very high uptime: as you say, the write hole is a rare occurrence under normal conditions. Personally I always assemble with --force in my initramfs, specifically to ensure that I come up after a crash even if unattended, then rely on fsck to clean up the worst of any write-hole damage and should it be needed then do a giant find | cmp (roughly) against the most recent (FUSE-mounted bup) backups to identify any corruption. (I've never had scattershot corruption due to the write hole, but when I've had it in other circumstances this has found all corruption in a few hours, bounded purely by disk I/O time. You *do* need very frequent backups to get away with this, though, which people only generally do after they get burned. I started doing that sort of thing after the unfortunate ext4 journal metadata corruption incident of a few years back. Remember, this one: <https://lwn.net/Articles/521803/>.)
Posted Nov 24, 2015 23:18 UTC (Tue)
by neilbrown (subscriber, #359)
[Link] (19 responses)
I always understood the term "NVRAM" to include battery backed RAM (both D and S), flash, memristors (are they are thing yet?) bubble memory and anything else which is memory, and isn't volatile.
I agree that most of the devices that this journalling gets used on will probably be of the "flash" family, but even there I understand that technology is moving quickly. It would certainly be valuable to hear reports from people who try this out on different devices.
> If you only have one SSD, can you split it between this and bcache somehow?
I would recommend partitions with cfdisk. Certainly lvm could do it too - and with greater flexibility. You just need a block device.
Posted Nov 24, 2015 23:44 UTC (Tue)
by nix (subscriber, #2304)
[Link] (18 responses)
I never thought of partitioning an SSD, but I suppose if it looks like a disk, you can partition it! The question is whether the kernel can identify partitions on all sorts of block devices, or whether this is restricted to only a subset of them. (I have no idea, I've never checked the code and have no real idea how that machinery works. It clearly doesn't run on *all* block devices or you wouldn't have had to do anything special to make partitioned md work... but md has always been rather special with its semi-dynamic major numbers etc, so maybe this was something related to that specialness.)
Posted Nov 25, 2015 1:32 UTC (Wed)
by neilbrown (subscriber, #359)
[Link] (4 responses)
The block device driver needs to opt-in to the kernel's generic partition support.
Two drivers I know of which don't support either static or dynamic allocations of partition devices are "loop" and "dm".
Many SSDs register under the "sd" driver and so get full partition support.
Posted Nov 25, 2015 4:08 UTC (Wed)
by ABCD (subscriber, #53650)
[Link] (3 responses)
Posted Nov 25, 2015 4:15 UTC (Wed)
by ABCD (subscriber, #53650)
[Link]
Posted Nov 25, 2015 22:51 UTC (Wed)
by Sesse (subscriber, #53779)
[Link] (1 responses)
Posted Nov 26, 2015 2:03 UTC (Thu)
by ABCD (subscriber, #53650)
[Link]
Posted Dec 7, 2015 19:24 UTC (Mon)
by nix (subscriber, #2304)
[Link] (12 responses)
Actually you can stick enough RAM in it that it's questionable if you need an SSD at all, even for bcache: just partition this thing and use some of it for the RAID write hole avoidance and some of it for bcache. It can even dump its contents onto CF and restore back from it if the battery runs out.
I think my next machine will have one of these.
Posted Dec 10, 2015 15:20 UTC (Thu)
by itvirta (guest, #49997)
[Link] (11 responses)
Posted Dec 10, 2015 15:44 UTC (Thu)
by andresfreund (subscriber, #69562)
[Link] (10 responses)
Posted Dec 11, 2015 22:24 UTC (Fri)
by nix (subscriber, #2304)
[Link] (9 responses)
Posted Dec 11, 2015 22:59 UTC (Fri)
by zlynx (guest, #2285)
[Link] (6 responses)
The Tech Report ran six drives until they died: http://techreport.com/review/27909/the-ssd-endurance-expe...
First failures were at around 200 TB of writes. That is a lot. The next one was at 700 TB. Two of the drives survived more than 2 PB of writes.
I don't believe you should worry about a couple hundred gigabytes unless you do it every day for a couple of years.
Posted Dec 13, 2015 21:24 UTC (Sun)
by nix (subscriber, #2304)
[Link] (3 responses)
In light of the 200TiB figure, it's safe to e.g. not care about doing builds and the like on SSDs, even very big builds of monsters like LibreOffice with debugging enabled (so it writes out 20GiB per build, that's nothing, a ten-thousandth of the worst observed failure level and a hundred thousandth of some of them). But things like huffyuv-compressed video being repeatedly rewritten as things mux and unmux it... that's more substantial. One of my processing flows writes a huffyuv-not-very-compressed data mountain out *eight times* as the mux/unmux/chew/mux/remuxes fly past, and only then gets to deal with tools that can handle something that's been compressed to a useful level. Ideally that'd all sit on a ramdisk, but who the hell has that much RAM? Not me, that's for sure. So I have to let the machine read and write on the order of a terabyte, each time... thankfully, this being Linux, the system is snappy and responsive while all this is going on, so I can more or less ignore the thing as a background job -- but if it ages my drives before their time I wouldn't be able to ignore it!
Posted Dec 14, 2015 19:34 UTC (Mon)
by bronson (subscriber, #4806)
[Link] (2 responses)
An exotic DRAM-based drive might be more reliable than just swapping out your devices every n events. Or it might not, I've never used one.
Posted Dec 14, 2015 19:53 UTC (Mon)
by nix (subscriber, #2304)
[Link] (1 responses)
The upcoming spinning rust drives that have their heads contacting the storage medium -- now *those* would get aged by this, and indeed by any load at all. But as far as I can tell those suck for any purpose other than write-once-access-never archival storage...
Posted Dec 14, 2015 23:02 UTC (Mon)
by smckay (guest, #103253)
[Link]
Posted Dec 15, 2015 20:05 UTC (Tue)
by hummassa (guest, #307)
[Link]
Posted Mar 1, 2017 15:00 UTC (Wed)
by nix (subscriber, #2304)
[Link]
I... don't think it's worth worrying about this much. Not even if you're, say, compiling Chromium over and over again, on the RAID: at ~90GiB of writes a time, that *still* comes to less than one complete device write per day because compiling Chromium is not a fast thing.
(However, I'm still splitting off a non-bcached, non-journalled, md0 array for transient stuff I don't care about and won't ever read, or won't read more than once, simply because it's *inefficient* to burn stuff into SSD that I'll never reference.)
Posted Dec 15, 2015 9:56 UTC (Tue)
by paulj (subscriber, #341)
[Link] (1 responses)
Posted Dec 16, 2015 18:14 UTC (Wed)
by nix (subscriber, #2304)
[Link]
Posted Nov 25, 2015 1:38 UTC (Wed)
by fandingo (guest, #67019)
[Link]
To clarify, this would be due to garbage collection, caused by a lack of empty blocks and the inherent read-erase-modify pattern for partial block writes. It's not artificial. There is a large variance between SSD controllers in how aggressively they clear pages. Some, like the 3-bit NAND in the Samsung 840, aren't aggressive due to lower lifetime writes for those NAND cells. Others, like the Corsair Neutron, are very aggressive and accept the extra NAND writes in exchange for better sustained throughput. (Those models are somewhat older, but I happened to remember them from then-contemporary SSD reviews.) Spare area is also extremely helpful in avoiding GC "crunches." Users can mitigate this as well by not using the full capacity, creating an unofficial spare area, although high volatility and the GC algorithm on the controller can still undermine the effectiveness.
Posted Nov 25, 2015 8:03 UTC (Wed)
by niner (subscriber, #26151)
[Link]
Piece of hard earned advice: if you use SSDs for anything where (especially write-) performance may really matter, throw money at it. Buy larger SSDs than you'll actually need and buy professional or datacenter versions. You will get much better sustained write performance. That's where the biggest difference between consumer and professional versions really is nowadays. Luckily they are not that expensive anymore.
Posted Nov 26, 2015 10:31 UTC (Thu)
by paulj (subscriber, #341)
[Link]
You still have a "hole", but now it's the probability that two independent events occur at the same time - server power dying AND battery suddenly failing - where the additional event is fairly rare of itself. So, the hole becomes a whole lot more rare. ;)
Posted Nov 30, 2015 17:05 UTC (Mon)
by wazoox (subscriber, #69624)
[Link] (1 responses)
All current controllers use RAM, supercapacitors and flash. The supercapacitor provides just enough power to allow writing the cache to flash.
> If you only have one SSD, can you split it between this and bcache somehow?
I suppose that by finely tuning bcache write-back mode to only send full stripes writes directly to the disk, you could render this feature mostly redundant. Mostly.
Posted Dec 1, 2015 12:06 UTC (Tue)
by nix (subscriber, #2304)
[Link]
Or you could do that. Again this requires specialist hardware support and is almost surely unavailable to md :(
Posted Nov 25, 2015 0:05 UTC (Wed)
by trentbuck (guest, #66356)
[Link] (11 responses)
It the journal useful at all on RAID1 & RAID10?
Is it useful to use journal *and* WI bitmap on the same array?
It sounds like the answers are yes (for now), no, & no;
Posted Nov 25, 2015 0:50 UTC (Wed)
by neilbrown (subscriber, #359)
[Link]
Exactly correct. I'm fairly sure the code won't let you create or use an array with both a journal and a bitmap.
Actually; depending on workload, the WI-bitmap can cause a measurable performance hit, in which case trading for a journal would cause a different performance hit, quite possibly less.
Posted Nov 25, 2015 17:04 UTC (Wed)
by gwolf (subscriber, #14632)
[Link] (9 responses)
Posted Nov 25, 2015 20:51 UTC (Wed)
by neilbrown (subscriber, #359)
[Link] (8 responses)
This is a widely held opinion that I do not agree with.
Between the moment when a write request is submitted and the moment when that request reports completion, both the "old" and "new" data are equally valid - at a sector granularity (so a mix of old and new sectors must be considered valid). Any application or filesystem that doesn't accept this is already broken even without RAID1.
After an unclean restart it is important to return consistent data for each read, but it doesn't matter if it is consistently "old" data or consistently "new" data. MD/RAID1 handles this by always reading from the "first" device until resync has completed.
This is not quite a perfect solution. If that "first" device fails during resync, it will start reading from the "second" device instead, and this might give results different to previous reads.
Posted Nov 25, 2015 21:54 UTC (Wed)
by nix (subscriber, #2304)
[Link]
Posted Nov 26, 2015 10:39 UTC (Thu)
by paulj (subscriber, #341)
[Link] (6 responses)
Seems simple, so I must be missing something. :)
Posted Nov 26, 2015 19:42 UTC (Thu)
by raven667 (subscriber, #5198)
[Link] (5 responses)
Posted Nov 26, 2015 20:14 UTC (Thu)
by paulj (subscriber, #341)
[Link] (4 responses)
I'm in networking, this is how we solve problems like this.
Posted Nov 28, 2015 15:05 UTC (Sat)
by ttonino (guest, #4073)
[Link]
OTOH handling the RAID as part of the file layout (btrfs/zfs) might also solve this kind of problem: the damage is then limited to the file of which the writing was interrupted. And that file was truncated anyway.
I wonder if block-based anything still makes sense. I mean, drives themselves are not directly bloack-adressable any more, but instead are a file system exposed as zillions of 512 byte files with fixed names.
Posted Nov 28, 2015 15:58 UTC (Sat)
by raven667 (subscriber, #5198)
[Link] (1 responses)
Posted Nov 30, 2015 9:46 UTC (Mon)
by paulj (subscriber, #341)
[Link]
Posted Dec 4, 2015 15:17 UTC (Fri)
by plugwash (subscriber, #29694)
[Link]
The only real fix for this is to change the model, rather than providing redundancy as a shim layer between the storage system and the filesystem provide it as part of the filesystem.
Posted Nov 25, 2015 5:54 UTC (Wed)
by thwutype (subscriber, #22891)
[Link] (5 responses)
I will guess that __unless__ I could disable all downstream member-disks caches stack (Member-Disk-Caches of hdd/sdd/raid card/HBA/etc.)
Otherwise, there shall still be some chances to see corruption if any of downstream members' cache=ON,
Posted Nov 25, 2015 6:38 UTC (Wed)
by neilbrown (subscriber, #359)
[Link] (4 responses)
So if your devices handle flush-cache requests correctly, there should be no room for corruption.
Does that ease your concerns?
Posted Nov 25, 2015 7:17 UTC (Wed)
by thwutype (subscriber, #22891)
[Link] (2 responses)
So, making sure that every under layers of HBA(LUNs as MD's member disks)
Posted Nov 25, 2015 7:29 UTC (Wed)
by neilbrown (subscriber, #359)
[Link] (1 responses)
Posted Nov 25, 2015 20:08 UTC (Wed)
by smckay (guest, #103253)
[Link]
Posted Nov 25, 2015 12:09 UTC (Wed)
by HIGHGuY (subscriber, #62277)
[Link]
Be sure to check this. There are drives out there (currently still available) that completely ignore flush cache requests. Instead they require "IDLE immediate" to flush their caches.
Posted Nov 25, 2015 9:32 UTC (Wed)
by ms (subscriber, #41272)
[Link]
Posted Nov 25, 2015 10:02 UTC (Wed)
by malor (guest, #2973)
[Link] (4 responses)
Without the NVRAM failsafe, disabling write barriers is playing with fire. You don't want to use it unless you've got that safety net. But if you do, the speedup could be quite noticeable; most filesystems spend a lot of time flushing caches and ensuring consistency, not trusting lying consumer-quality drives. Keeping the journal closer to the metal, as it were, could obviate a great deal of work at higher layers.
Posted Nov 25, 2015 10:28 UTC (Wed)
by malor (guest, #2973)
[Link] (1 responses)
From a quick grep of the kernel docs, it appears that other filesystems with barrier support are inconsistent about how to turn it off. Some use both keywords, but some use use only one, and which one of the two they picked is about evenly split.
Posted Dec 6, 2015 13:06 UTC (Sun)
by hmh (subscriber, #3838)
[Link]
However, it is also of very limited value (in fact, it could make things worse) because it will not be supported on older kernels, unless this kind of change is accepted in the -stable trees and also backported by the distros.
Posted Dec 12, 2015 9:05 UTC (Sat)
by joib (subscriber, #8541)
[Link]
Disclaimer: This if from reading various comments from people more knowledgeable than me on the matter around the time this was merged, and on a very-high-level understanding of how the code works, rather than on actual benchmarks.
Posted Mar 1, 2017 15:52 UTC (Wed)
by nix (subscriber, #2304)
[Link]
Posted Nov 25, 2015 11:05 UTC (Wed)
by ayers (subscriber, #53541)
[Link] (3 responses)
Posted Nov 25, 2015 12:56 UTC (Wed)
by hthoma (subscriber, #4743)
[Link] (2 responses)
> This can be any block device and could even be a mirrored pair of SSDs (because you wouldn't want the journal device to become a single point of failure).
So, yes, it is most probably reasonable to have some redundancy in the journal. But it is up to the admin to set it up that way.
Posted Nov 26, 2015 2:51 UTC (Thu)
by thwutype (subscriber, #22891)
[Link] (1 responses)
e.g. mdadm ... --write-journal=/dev/md-journal" ...
Posted Dec 24, 2021 18:15 UTC (Fri)
by snnn (guest, #155862)
[Link]
Posted Nov 25, 2015 17:10 UTC (Wed)
by gwolf (subscriber, #14632)
[Link] (9 responses)
Posted Nov 25, 2015 20:14 UTC (Wed)
by smckay (guest, #103253)
[Link] (2 responses)
Posted Dec 3, 2015 14:52 UTC (Thu)
by shane (subscriber, #3335)
[Link] (1 responses)
Posted Dec 11, 2015 2:50 UTC (Fri)
by Pc5Y9sbv (guest, #41328)
[Link]
Lately, I am combining these with LV cache. I use MD RAID to create redundant SSD and bulk HDD arrays as different PVs and can choose to place some LVs only on SSD, some only on HDD, and some as an SSD-cached HDD volume.
Posted Nov 25, 2015 21:12 UTC (Wed)
by neilbrown (subscriber, #359)
[Link] (2 responses)
I hear this sort of comment from time to time and must confess that I don't really understand it.
So LVM2 is good and RAID is good and they are different (though with some commonality). Many people use both.
On the question of BTRFS/ZFS - I think there is room for multiple approaches.
On the other hand, there are strong reasons relating to flexibility and maintainability to define clear layering and clean interfaces between the layers and let code at each layer "do one thing well". This allows us to easily combine different filesystems with different RAID and volume management and cryptography and network access and..and..and..
Both approaches will continue to have value and Linux is easily big enough for both. There is plenty of room for people such as yourself to experiment with doing things differently.
Posted Nov 25, 2015 22:41 UTC (Wed)
by gwolf (subscriber, #14632)
[Link]
lxc_mail baktun mwi-aom--- 65.00g lxc_mail_mlog 100.00
So, yes, it consists of two "mimage" volumes plus one "mlog" — Which is quite in line with what is discussed in the article. I have had hard drives die on me, and recovering often translates to just slipping in a new HD, pvcreate + vgextend + pvmove + vgreduce, and carry on with my work.
Posted Nov 25, 2015 23:34 UTC (Wed)
by jtaylor (subscriber, #91739)
[Link]
Posted Nov 25, 2015 22:52 UTC (Wed)
by Sesse (subscriber, #53779)
[Link] (2 responses)
Posted Nov 26, 2015 16:43 UTC (Thu)
by flussence (guest, #85566)
[Link] (1 responses)
Posted Nov 26, 2015 16:53 UTC (Thu)
by Sesse (subscriber, #53779)
[Link]
ext4 on the exact same array has no problems. btrfs with snapshots is unbearably slow. Eventually I just gave it up.
Posted Nov 25, 2015 19:03 UTC (Wed)
by NightMonkey (subscriber, #23051)
[Link] (5 responses)
Am I nuts to think this is the case? (Of *course* I love having Linux to manage arrays, don't get me wrong. I'm merely trying to determine operating traits and caveats.)
Cheers!
Posted Nov 25, 2015 20:06 UTC (Wed)
by raven667 (subscriber, #5198)
[Link]
Posted Nov 26, 2015 4:56 UTC (Thu)
by malor (guest, #2973)
[Link] (3 responses)
That may not be what you want.
Rather, in most cases, you probably want the last drip of *known correct* data to get to the disk. The three-stage design of the new mdraid should mean that only complete and correct transactions are permanently recorded. Hardware controllers aren't as married to the kernel, and won't have the same kind of insight into the structure of incoming data, so they'll be at least a little more likely to write out a partial set of blocks given to them by a dying system.
It's not terribly likely in either case, mind, but the staged-commit-with-checksum approach at least LOOKS like it would be more robust. I imagine it will take time to shake out, but a few years from now, hardware controllers may have a harder time competing.
Posted Nov 26, 2015 7:03 UTC (Thu)
by kleptog (subscriber, #1183)
[Link] (2 responses)
Unfortunately, the kernel also has no insight into what data is important to user space programs because there is no interface for user space programs to provide this to the kernel. So all this journaling is good for preserving the file system and RAID array consistency but does nothing to ensure data consistency for users.
It is a necessary step in the right direction though. One day I hope there will be two new syscalls begin_fs_transaction() and commit_fs_transaction() so programs can indicate the data dependencies from their point of view, and we can throw away the millions of lines of user space code dedicated to providing reliable transactional data storage on top of unreliable file systems and disks.
Posted Nov 26, 2015 19:39 UTC (Thu)
by raven667 (subscriber, #5198)
[Link] (1 responses)
Posted Nov 26, 2015 21:20 UTC (Thu)
by kleptog (subscriber, #1183)
[Link]
Actually, AFAICT there's not so much complicated code involving flushes, barriers, etc to ensure integrity. Instead they're using the standard trick every database uses: a journal. This allows you to simulate atomic transactions on a unreliable sub-layer. You don't need huge guarantees to make a journal work, just being able to specify ordering constraints is enough.
But now we have the database writing a journal, the filesystem writing a journal and the RAID array writing a journal. All to achieve the same result, namely providing transaction semantics on top of an unreliable base. If each layer exposed a begin/commit interface to the level above the entire system could run with *one* journal, allowing the maximum scope to parallelisability.
I can guess though why it hasn't happened. One of the possibilities of transaction semantics is that things can fail and require a rollback to a previous state. And unless the entire machine is under transaction semantics you end up having to write code to deal with that. With explicit locking you can be careful to arrange things so that you never need to rollback, which saves a chunk of work. It's hard to prove it correct though. Filesystems probably are careful to arrange their transactions in such a way that they never need to rollback, but if you exposed the capability to user-space you'd have to deal with the possibility that two programs could be trying to do conflicting things.
Posted Nov 30, 2015 23:07 UTC (Mon)
by ldo (guest, #40946)
[Link] (16 responses)
I think Linux software RAID is wonderful. I have had several clients running it for many years, and I am impressed with how well it copes with disk failures. Why it’s better than hardware RAID:
Performance? I’ve never noticed an issue. Modern systems have CPU to burn, and RAID processing essentially disappears in the idle rounding error.
Posted Dec 1, 2015 7:08 UTC (Tue)
by sbakker (subscriber, #58443)
[Link] (1 responses)
The possibility that MD gives you to take the disks out of one machine and just plug them into another greatly helps with recovery from server failures.
With MD RAID, I just monitor the S.M.A.R.T. stats on the physical disks, and as soon as I see reallocated sectors, they get replaced.
I also tend to favour RAID10 over RAID5 or RAID6 (faster rebuilds, better write performance), but then, my storage needs are not that large, so I can afford it.
Posted Dec 1, 2015 7:48 UTC (Tue)
by ldo (guest, #40946)
[Link]
> I just monitor the S.M.A.R.T. stats on the physical disks, and as soon as In that case, you are probably replacing disks a lot more often than you need to, adding to your costs without any significant improvement in data reliability.
Posted Dec 1, 2015 12:05 UTC (Tue)
by nix (subscriber, #2304)
[Link] (7 responses)
Your first and last points are, of course, compelling (and they're why I'm probably going mdadm on the next machine -- well, that and the incomparable Neil Brown advantage, you will never find a hardware RAID vendor as clued-up or helpful), but this one in particular seems like saying 'md is better than hardware RAID because you can badly implement, by hand, half of something hardware RAID does as a matter of course'.
The right way to scrub with mdadm is echo check > /sys/block/md*/md/sync_action (or 'repair' if you want automatic rewriting of bad blocks). If you're using badblocks by hand I'd say you're doing something wrong.
Posted Dec 1, 2015 19:51 UTC (Tue)
by ldo (guest, #40946)
[Link]
Yes, I can do something with the outout of badblocks; if a disk has bad sectors on it, I replace it. I’ve found more than one bad disk this way. Also, badblocks scans work whether the disk is RAIDed or not.
Here is a pair of scripts I wrote to ease the job of running badblocks scans.
Posted Dec 3, 2015 16:48 UTC (Thu)
by hmh (subscriber, #3838)
[Link] (5 responses)
Array components remain directly accessible. Thus, it is easy to run periodic badblocks scans on all your disks, regardless of whether they’re part of an array or not. I guess this gives you half of a poor-man's scrubbing, but any real hardware RAID will provide automatic scrubbing in any case. Well, md's "repair" sync_action will give you poor-man's scrubbing (which only rewrites when the underlying storage reports an erasure/read error, or when the parity data sets are not consistent with the data -- ideal for SSDs, but not really what you want for modern "slightly forgetful" HDDs, where you'd actually want to trigger a hard-scrub that rewrites all stripes).
Posted Dec 4, 2015 9:13 UTC (Fri)
by neilbrown (subscriber, #359)
[Link] (4 responses)
Is that really something that people would want?
I guess I imagine that the drive itself would notice if there was any weakness in the current recording (e.g. correctable errors) and would re-write the block proactively. So all that should be necessary is to read every block. But maybe I give too much credit to the drive firmware.
FWIW this would very quite straight forward to implement if anyone thought it would actually be used and wanted a journey-man project to work on.
Posted Dec 6, 2015 14:00 UTC (Sun)
by hmh (subscriber, #3838)
[Link]
you'd actually want to trigger a hard-scrub that rewrites all stripes Is that really something that people would want? I guess I imagine that the drive itself would notice if there was any weakness in the current recording (e.g. correctable errors) and would re-write the block proactively. So all that should be necessary is to read every block. But maybe I give too much credit to the drive firmware. I used to think the HDD firmware would handle that sanely, as well. Well, let's just say you cannot assume consumer HDDs 1TB and above will do that properly (or will be sucessful at it while the sector is still weak but ECC-correctable). Forcing a scrub has saved my data several times, already. Once it start happening, I get a new group of unreadable sectors detected by SMART or by an array read attempt every couple weeks (each requiring a md repair cycle to ensure none are left behind), until I either get pissed off enough to find a way to force-hard-scrub that entire component device (typically by using mdadm --replace with the help of a hot-spare device).
Posted Dec 10, 2015 15:29 UTC (Thu)
by itvirta (guest, #49997)
[Link] (2 responses)
Can we have one for Christmas?
Posted Dec 10, 2015 21:41 UTC (Thu)
by hmh (subscriber, #3838)
[Link] (1 responses)
I realise this is not what you asked for, since it actually repairs the data, but hey, that could be even more useful depending on what you want to do ;-)
Thread starts at:
Posted Dec 13, 2015 16:09 UTC (Sun)
by itvirta (guest, #49997)
[Link]
Posted Dec 7, 2015 21:56 UTC (Mon)
by Yenya (subscriber, #52846)
[Link] (5 responses)
Posted Dec 7, 2015 22:53 UTC (Mon)
by pizza (subscriber, #46)
[Link] (4 responses)
Posted Dec 8, 2015 7:36 UTC (Tue)
by Yenya (subscriber, #52846)
[Link] (3 responses)
Posted Dec 8, 2015 13:27 UTC (Tue)
by pizza (subscriber, #46)
[Link] (2 responses)
In fifteen years of using 3Ware RAID cards, for example, I've never had a single controller-induced failure, or data loss that wasn't the result of blatant operator error (or multiple drive failure..) My experience with the DAC960/1100 series was similar (though I did once have a controller fail; no data loss once it was swapped). I've even performed your described failure scenario multiple times. Even in the day of PATA/PSCSI, hot-swapping (and hot spares) just worked with those things.
(3Ware cards, the DAC family, and a couple of the Dell PERC adapters were the only ones I had good experiences with; the rest were varying degrees of WTF-to-outright-horror. Granted, my experience now about five years out of date..)
Meanwhile, The supermicro-based server next to me actually *locked up* two days ago when I attempted to swap a failing eSATA-attached drive used for backups.
But my specific comment about robustness is that you can easily end up with an unbootable system if the wrong drive fails on an MDRAID array that contains /boot. And if you don't put /boot on an array, you end up in the same position. (to work around this, I traditionally put /boot on a PATA CF or USB stick, which I regularly imaged and backed up so I could immediately swap in a replacement)
FWIW I retired the last of those MDRAID systems about six months ago.
Posted Dec 8, 2015 21:34 UTC (Tue)
by Yenya (subscriber, #52846)
[Link] (1 responses)
Posted Dec 9, 2015 17:04 UTC (Wed)
by pizza (subscriber, #46)
[Link]
But I don't use 3Ware cards for RAID5 write performance, I use them for reliability/robustness for bulk storage that is nearly always read loads. (If write performance mattered that much, I'd use sufficient disks for RAID10; RAID5/6 is awful)
Posted Dec 24, 2021 18:16 UTC (Fri)
by snnn (guest, #155862)
[Link]
While the write back mode can increase performance, it reduces reliability. Because the journal device can't be a MD raid array.
And I think the code isn't stable yet. We saw kernel hangs when the raid array was doing sync and it had a write journal and also had heavy read/write load during the sync. Such problems were also got reported to the Linux RAID mail list by other users.
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
The driver can specify a number of minor numbers to use for partition block devices (arg to "alloc_disk()"), and can set a flag (GENHD_FL_EXT_DEVT) to dynamically support extra partitions.
In both cases a similar effect can be achieved using the "kpartx" tool which reads the partition table and creates dm-linear mappings.
That is no longer true (as of 3.1-rc2) for loop. The ioctl that sets up a loop device takes a flag (LO_FLAGS_PARTSCAN) that tells the driver to set up partitions with names like loopXpY, where loopX is the base loop device and Y is the partition number on that device. I believe there were ways to do that prior that involved setting the max_part module parameter and ways to do it dynamically after the loop device is created (so that fdisk and friends can Do The Right Thing).
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
You can get losetup to pass the flag by using the -P option.
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
especially given that an SSD can easily be 4x the size of that 64 GB RAM thingy. Putting all that RAM on the motherboard
might be different, though.
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
write-once-access-never archival storage...
Sounds like an excellent application for the Signetics 25000 Series 9C46XN. An underrated chip that never got near enough use.
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
is the *only* gain the closed RAID5 write hole?
but after future work it will also improve RAID5/6 write latency.
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
If the two copies of a block on a RAID1 differ, then both are equally current.
The appropriate fix here would *not* be a journal, but would be to read all blocks in parallel when reading from a region that is not known to be in-sync, and then writing out an arbitrary candidate block to all devices which contained a different value.
A journal for MD/RAID5
The appropriate fix here would *not* be a journal, but would be to read all blocks in parallel when reading from a region that is not known to be in-sync, and then writing out an arbitrary candidate block to all devices which contained a different value.
Not quite: your comments re consistency still apply. It should attempt to write out a candidate block from a consistently-chosen device (e.g., the first) to the other one. It could also note that any blocks it read in this way which were beyond the in-sync region were now considered in-sync so don't need to be resynced or 'multiread' again, if that's not too expensive -- it's probably impractical on devices without a write-intent bitmap.
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
and/or make sure all the members path are configured 100% write-through, then this quote will be true;
right?
A journal for MD/RAID5
A journal for MD/RAID5
and all its downstreams RAID/JBOD boxes(as HBA's member disks) will decently handle flush commands
from MD/RAID5,
seemingly to be a necessary step before using MD journal to achieve zero corruption, right?
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
Is there any redundancy while the data resides in the journal?
Is there any redundancy while the data resides in the journal?
Is there any redundancy while the data resides in the journal?
md-journal is a N-mirrors RAID1 w/ member hybrided of NVMe and Ramdisk disks
Is there any redundancy while the data resides in the journal?
MD/RAID5 vs. more "intelligent" media aggregation schemes
My impression is that having more layers understand the true geometry of the issue will help me get better performance (and keep the same reliability). I won't talk much of systems that go all the way up to the filesystem such as BTRFS and ZFS, as I'm quite new at them, but LVM2 provides more or less the same basic functionality quite enough flexibility and is IMO better suited than RAID at most tasks.
Now, I am more than ready to admit there is a flaw in my reasoning, and would love to understand where it lies. Any takers?
MD/RAID5 vs. more "intelligent" media aggregation schemes
lvm
lvm
MD/RAID5 vs. more "intelligent" media aggregation schemes
LVM2 provides some very useful functionality, but it doesn't really provide security in the face of device failure.
There is a "dm-raid1" module which can provide basic RAID1 functionality, but I get the impression that it is primarily focused at implementing "pvmove" - a very important function but not really about data reliabilty.
There is also the "dm-raid" module which is a wrapper around the MD/RAID code to provide a consistent interface for the lvm2 tools to use.
But LVM2 does lots of other useful things (thin provisioning, volume resizing, crypto, etc etc) which are quite orthogonal to RAID.
There are strong reasons relating to performance and functionality to encourage vertical integration. Having the user-facing filesystem access all storage devices directly without LVM or RAID in between can bring real benefits. A filesystem which knows about the RAID5 layout, uses copy-on-write to only ever write to areas of devices that were not in use, and always writes full stripes (zero-padding if needed) would not get any benefit at all from the RAID5 journal.
MD/RAID5 vs. more "intelligent" media aggregation schemes
[lxc_mail_mimage_0] baktun iwi-aom--- 65.00g
[lxc_mail_mimage_1] baktun iwi-aom--- 65.00g
[lxc_mail_mlog] baktun lwi-aom--- 4.00m
MD/RAID5 vs. more "intelligent" media aggregation schemes
Or is there still a difference between a raid type lvm module (not the mirrored type) and mdadm raid + lvm on top?
MD/RAID5 vs. more "intelligent" media aggregation schemes
MD/RAID5 vs. more "intelligent" media aggregation schemes
MD/RAID5 vs. more "intelligent" media aggregation schemes
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
A journal for MD/RAID5
Linux Software RAID > Hardware RAID
Linux Software RAID > Hardware RAID
Linux Software RAID > Hardware RAID
> I see reallocated sectors, they get replaced.
Linux Software RAID > Hardware RAID
Array components remain directly accessible. Thus, it is easy to run periodic badblocks scans on all your disks, regardless of whether they’re part of an array or not.
I guess this gives you half of a poor-man's scrubbing, but any real hardware RAID will provide automatic scrubbing in any case. You can't do anything with the output of badblocks: even if it finds some, because md doesn't know there's a bad block there it's not going to do any of the things it would routinely do when a bad block is found (starting by rewriting it from the others to force the drive to spare it out, IIRC). All you can do is start a real scrub -- in which case why didn't you just run one in the first place? Set the min and max speeds right and it'll be a lot less disruptive to you disk-load-wise too.
Linux Software RAID > Hardware RAID
Linux Software RAID > Hardware RAID
Linux Software RAID > Hardware RAID
Linux Software RAID > Hardware RAID
Linux Software RAID > Hardware RAID
(Or any block device, but being able to read from a mirror or parity if the data is corrupted would be nice.)
Linux Software RAID > Hardware RAID
https://lkml.org/lkml/2015/11/4/772
Block device checksumming
But everything I can find tells me that it's a read-only target, which isn't
really what one wants for general use.
Linux Software RAID > Hardware RAID
Linux Software RAID > Hardware RAID
Linux Software RAID > Hardware RAID
Linux Software RAID > Hardware RAID
Linux Software RAID > Hardware RAID
Linux Software RAID > Hardware RAID
A journal for MD/RAID5
First. the write through mode is for increasing data safety, not performance. The problem it tries to fix, write-hole, isn't common. Thus you don't this need this feature. RAID isn't a backup, it doesn't need to provide 100% data safety. It is meant to reduce downtime in most common scenarios. So the extra gain from adding a RAID journal is small.