md.4 (mdadm-4.1) | : | md.4 (mdadm-4.2) | ||
---|---|---|---|---|
skipping to change at line 196 | skipping to change at line 196 | |||
drives have been assigned one chunk. This collection of chunks forms a | drives have been assigned one chunk. This collection of chunks forms a | |||
.BR stripe . | .BR stripe . | |||
Further chunks are gathered into stripes in the same way, and are | Further chunks are gathered into stripes in the same way, and are | |||
assigned to the remaining space in the drives. | assigned to the remaining space in the drives. | |||
If devices in the array are not all the same size, then once the | If devices in the array are not all the same size, then once the | |||
smallest device has been exhausted, the RAID0 driver starts | smallest device has been exhausted, the RAID0 driver starts | |||
collecting chunks into smaller stripes that only span the drives which | collecting chunks into smaller stripes that only span the drives which | |||
still have remaining space. | still have remaining space. | |||
A bug was introduced in linux 3.14 which changed the layout of blocks in | ||||
a RAID0 beyond the region that is striped over all devices. This bug | ||||
does not affect an array with all devices the same size, but can affect | ||||
other RAID0 arrays. | ||||
Linux 5.4 (and some stable kernels to which the change was backported) | ||||
will not normally assemble such an array as it cannot know which layout | ||||
to use. There is a module parameter "raid0.default_layout" which can be | ||||
set to "1" to force the kernel to use the pre-3.14 layout or to "2" to | ||||
force it to use the 3.14-and-later layout. when creating a new RAID0 | ||||
array, | ||||
.I mdadm | ||||
will record the chosen layout in the metadata in a way that allows newer | ||||
kernels to assemble the array without needing a module parameter. | ||||
To assemble an old array on a new kernel without using the module parameter, | ||||
use either the | ||||
.B "--update=layout-original" | ||||
option or the | ||||
.B "--update=layout-alternate" | ||||
option. | ||||
Once you have updated the layout you will not be able to mount the array | ||||
on an older kernel. If you need to revert to an older kernel, the | ||||
layout information can be erased with the | ||||
.B "--update=layout-unspecificed" | ||||
option. If you use this option to | ||||
.B --assemble | ||||
while running a newer kernel, the array will NOT assemble, but the | ||||
metadata will be update so that it can be assembled on an older kernel. | ||||
No that setting the layout to "unspecified" removes protections against | ||||
this bug, and you must be sure that the kernel you use matches the | ||||
layout of the array. | ||||
.SS RAID1 | .SS RAID1 | |||
A RAID1 array is also known as a mirrored set (though mirrors tend to | A RAID1 array is also known as a mirrored set (though mirrors tend to | |||
provide reflected images, which RAID1 does not) or a plex. | provide reflected images, which RAID1 does not) or a plex. | |||
Once initialised, each device in a RAID1 array contains exactly the | Once initialised, each device in a RAID1 array contains exactly the | |||
same data. Changes are written to all devices in parallel. Data is | same data. Changes are written to all devices in parallel. Data is | |||
read from any one device. The driver attempts to distribute read | read from any one device. The driver attempts to distribute read | |||
requests across all devices to maximise performance. | requests across all devices to maximise performance. | |||
skipping to change at line 875 | skipping to change at line 910 | |||
that succeeds, the address will be removed from the list. | that succeeds, the address will be removed from the list. | |||
This allows an array to fail more gracefully - a few blocks on | This allows an array to fail more gracefully - a few blocks on | |||
different devices can be faulty without taking the whole array out of | different devices can be faulty without taking the whole array out of | |||
action. | action. | |||
The list is particularly useful when recovering to a spare. If a few blocks | The list is particularly useful when recovering to a spare. If a few blocks | |||
cannot be read from the other devices, the bulk of the recovery can | cannot be read from the other devices, the bulk of the recovery can | |||
complete and those few bad blocks will be recorded in the bad block list. | complete and those few bad blocks will be recorded in the bad block list. | |||
.SS RAID456 WRITE JOURNAL | .SS RAID WRITE HOLE | |||
Due to non-atomicity nature of RAID write operations, interruption of | Due to non-atomicity nature of RAID write operations, | |||
write operations (system crash, etc.) to RAID456 array can lead to | interruption of write operations (system crash, etc.) to RAID456 | |||
inconsistent parity and data loss (so called RAID-5 write hole). | array can lead to inconsistent parity and data loss (so called | |||
RAID-5 write hole). | ||||
To plug the write hole, from Linux 4.4 (to be confirmed), | To plug the write hole md supports two mechanisms described below. | |||
.I md | ||||
supports write ahead journal for RAID456. When the array is created, | .TP | |||
an additional journal device can be added to the array through | DIRTY STRIPE JOURNAL | |||
.IR write-journal | From Linux 4.4, md supports write ahead journal for RAID456. | |||
option. The RAID write journal works similar to file system journals. | When the array is created, an additional journal device can be added to | |||
Before writing to the data disks, md persists data AND parity of the | the array through write-journal option. The RAID write journal works | |||
stripe to the journal device. After crashes, md searches the journal | similar to file system journals. Before writing to the data | |||
device for incomplete write operations, and replay them to the data | disks, md persists data AND parity of the stripe to the journal | |||
disks. | device. After crashes, md searches the journal device for | |||
incomplete write operations, and replay them to the data disks. | ||||
When the journal device fails, the RAID array is forced to run in | When the journal device fails, the RAID array is forced to run in | |||
read-only mode. | read-only mode. | |||
.TP | ||||
PARTIAL PARITY LOG | ||||
From Linux 4.12 md supports Partial Parity Log (PPL) for RAID5 arrays only. | ||||
Partial parity for a write operation is the XOR of stripe data chunks not | ||||
modified by the write. PPL is stored in the metadata region of RAID member drive | ||||
s, | ||||
no additional journal drive is needed. | ||||
After crashes, if one of the not modified data disks of | ||||
the stripe is missing, this updated parity can be used to recover its | ||||
data. | ||||
This mechanism is documented more fully in the file | ||||
Documentation/md/raid5-ppl.rst | ||||
.SS WRITE-BEHIND | .SS WRITE-BEHIND | |||
From Linux 2.6.14, | From Linux 2.6.14, | |||
.I md | .I md | |||
supports WRITE-BEHIND on RAID1 arrays. | supports WRITE-BEHIND on RAID1 arrays. | |||
This allows certain devices in the array to be flagged as | This allows certain devices in the array to be flagged as | |||
.IR write-mostly . | .IR write-mostly . | |||
MD will only read from such devices if there is no | MD will only read from such devices if there is no | |||
other option. | other option. | |||
skipping to change at line 1040 | skipping to change at line 1089 | |||
Each block device appears as a directory in | Each block device appears as a directory in | |||
.I sysfs | .I sysfs | |||
(which is usually mounted at | (which is usually mounted at | |||
.BR /sys ). | .BR /sys ). | |||
For MD devices, this directory will contain a subdirectory called | For MD devices, this directory will contain a subdirectory called | |||
.B md | .B md | |||
which contains various files for providing access to information about | which contains various files for providing access to information about | |||
the array. | the array. | |||
This interface is documented more fully in the file | This interface is documented more fully in the file | |||
.B Documentation/md.txt | .B Documentation/admin-guide/md.rst | |||
which is distributed with the kernel sources. That file should be | which is distributed with the kernel sources. That file should be | |||
consulted for full documentation. The following are just a selection | consulted for full documentation. The following are just a selection | |||
of attribute files that are available. | of attribute files that are available. | |||
.TP | .TP | |||
.B md/sync_speed_min | .B md/sync_speed_min | |||
This value, if set, overrides the system-wide setting in | This value, if set, overrides the system-wide setting in | |||
.B /proc/sys/dev/raid/speed_limit_min | .B /proc/sys/dev/raid/speed_limit_min | |||
for this array only. | for this array only. | |||
Writing the value | Writing the value | |||
skipping to change at line 1101 | skipping to change at line 1150 | |||
.TP | .TP | |||
.B md/preread_bypass_threshold | .B md/preread_bypass_threshold | |||
This is only available on RAID5 and RAID6. This variable sets the | This is only available on RAID5 and RAID6. This variable sets the | |||
number of times MD will service a full-stripe-write before servicing a | number of times MD will service a full-stripe-write before servicing a | |||
stripe that requires some "prereading". For fairness this defaults to | stripe that requires some "prereading". For fairness this defaults to | |||
1. Valid values are 0 to stripe_cache_size. Setting this to 0 | 1. Valid values are 0 to stripe_cache_size. Setting this to 0 | |||
maximizes sequential-write throughput at the cost of fairness to threads | maximizes sequential-write throughput at the cost of fairness to threads | |||
doing small or random writes. | doing small or random writes. | |||
.TP | ||||
.B md/bitmap/backlog | ||||
The value stored in the file only has any effect on RAID1 when write-mostly | ||||
devices are active, and write requests to those devices are proceed in the | ||||
background. | ||||
This variable sets a limit on the number of concurrent background writes, | ||||
the valid values are 0 to 16383, 0 means that write-behind is not allowed, | ||||
while any other number means it can happen. If there are more write requests | ||||
than the number, new writes will by synchronous. | ||||
.TP | ||||
.B md/bitmap/can_clear | ||||
This is for externally managed bitmaps, where the kernel writes the bitmap | ||||
itself, but metadata describing the bitmap is managed by mdmon or similar. | ||||
When the array is degraded, bits mustn't be cleared. When the array becomes | ||||
optimal again, bit can be cleared, but first the metadata needs to record | ||||
the current event count. So md sets this to 'false' and notifies mdmon, | ||||
then mdmon updates the metadata and writes 'true'. | ||||
There is no code in mdmon to actually do this, so maybe it doesn't even | ||||
work. | ||||
.TP | ||||
.B md/bitmap/chunksize | ||||
The bitmap chunksize can only be changed when no bitmap is active, and | ||||
the value should be power of 2 and at least 512. | ||||
.TP | ||||
.B md/bitmap/location | ||||
This indicates where the write-intent bitmap for the array is stored. | ||||
It can be "none" or "file" or a signed offset from the array metadata | ||||
- measured in sectors. You cannot set a file by writing here - that can | ||||
only be done with the SET_BITMAP_FILE ioctl. | ||||
Write 'none' to 'bitmap/location' will clear bitmap, and the previous | ||||
location value must be write to it to restore bitmap. | ||||
.TP | ||||
.B md/bitmap/max_backlog_used | ||||
This keeps track of the maximum number of concurrent write-behind requests | ||||
for an md array, writing any value to this file will clear it. | ||||
.TP | ||||
.B md/bitmap/metadata | ||||
This can be 'internal' or 'clustered' or 'external'. 'internal' is set | ||||
by default, which means the metadata for bitmap is stored in the first 256 | ||||
bytes of the bitmap space. 'clustered' means separate bitmap metadata are | ||||
used for each cluster node. 'external' means that bitmap metadata is managed | ||||
externally to the kernel. | ||||
.TP | ||||
.B md/bitmap/space | ||||
This shows the space (in sectors) which is available at md/bitmap/location, | ||||
and allows the kernel to know when it is safe to resize the bitmap to match | ||||
a resized array. It should big enough to contain the total bytes in the bitmap. | ||||
For 1.0 metadata, assume we can use up to the superblock if before, else | ||||
to 4K beyond superblock. For other metadata versions, assume no change is | ||||
possible. | ||||
.TP | ||||
.B md/bitmap/time_base | ||||
This shows the time (in seconds) between disk flushes, and is used to looking | ||||
for bits in the bitmap to be cleared. | ||||
The default value is 5 seconds, and it should be an unsigned long value. | ||||
.SS KERNEL PARAMETERS | .SS KERNEL PARAMETERS | |||
The md driver recognised several different kernel parameters. | The md driver recognised several different kernel parameters. | |||
.TP | .TP | |||
.B raid=noautodetect | .B raid=noautodetect | |||
This will disable the normal detection of md arrays that happens at | This will disable the normal detection of md arrays that happens at | |||
boot time. If a drive is partitioned with MS-DOS style partitions, | boot time. If a drive is partitioned with MS-DOS style partitions, | |||
then if any of the 4 main partitions has a partition type of 0xFD, | then if any of the 4 main partitions has a partition type of 0xFD, | |||
then that partition will normally be inspected to see if it is part of | then that partition will normally be inspected to see if it is part of | |||
an MD array, and if any full arrays are found, they are started. This | an MD array, and if any full arrays are found, they are started. This | |||
End of changes. 6 change blocks. | ||||
16 lines changed or deleted | 135 lines changed or added |