"Fossies" - the Fresh Open Source Software Archive

Member "aoe-aoe6-86/EtherDrive-2.6-HOWTO.sgml" (4 Jul 2015, 60421 Bytes) of archive /linux/misc/aoe-aoe6-86.tar.gz:


As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) XML source code syntax highlighting (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 <!doctype linuxdoc system>
    2 <!-- This document is the SGML "linuxdoc" flavor described in the 
    3      "Howtos-with-LinuxDoc-mini-HOWTO", found at the following URL.
    4 
    5      http://www.tldp.org/HOWTO/Howtos-with-LinuxDoc.html
    6 
    7      This HOWTO was originally written by Sam Hopkins and is currently
    8      maintained by Ed L. Cashin.
    9 -->
   10 
   11 <article>
   12 <title>EtherDrive&reg; storage and Linux 2.6
   13 <!-- a technical "How To Guide" -->
   14 <author>Sam Hopkins and Ed L. Cashin <tt/{sah,ecashin}@coraid.com/
   15 <date>April 2008
   16 
   17 <abstract>
   18 
   19 Using network data storage with <url url="http://www.coraid.com/documents/AoEr10.txt"
   20 name="ATA over Ethernet"> is easy after understanding a few
   21 simple concepts.
   22 This document explains how to use AoE targets from a Linux-based
   23 Operating System, but the basic principles are applicable to other
   24 systems 
   25 that use AoE devices.  Below we begin by explaining the
   26 key components of the network
   27 communication method, ATA over Ethernet (AoE).  Next, we discuss the
   28 way a Linux host uses AoE devices, providing serveral examples.
   29 A list of frequently asked questions follows, and the document ends
   30 with 
   31 supplementary information.
   32 
   33 </abstract>
   34 
   35 <toc>
   36 
   37 <sect>The EtherDrive System
   38 
   39 <p>
   40 The ATA over Ethernet network protocol allows any type of data
   41 storage to be used over a local ethernet network.  An "AoE target"
   42 receives ATA read and write commands, executes them, and returns
   43 responses to the "AoE initiator" that is using the storage.
   44 
   45 These AoE commands and responses appear on the network as ethernet
   46 frames with type 0x88a2, the IANA registered Ethernet type for <url
   47 url="http://www.coraid.com/documents/AoEr10.txt" name="ATA over
   48 Ethernet (AoE)">.  An AoE target is identified by a pair of numbers:
   49 the shelf address, and the slot address.
   50 
   51 For example, the Coraid SR appliance can perform RAID internally on
   52 its SATA disks, making the resulting storage capacity available on the
   53 ethernet network as one or more AoE targets.  All of the targets will
   54 have the same shelf address because they are all exported by the same
   55 SR.  They will have different AoE slot addresses, so that each AoE
   56 target is individually addressable.  The SR documentation calls each
   57 target a "LUN".  Each LUN behaves like a network disk.
   58 
   59 Using EtherDrive technology like the SR appliance is as simple as
   60 sending and receiving AoE packets.
   61 
   62 To a Linux-based system running the "aoe" driver, it doesn't matter
   63 what the remote AoE device really is.  All that matters is that the
   64 AoE protocol can be used to communicate with a device identified by a
   65 certain shelf and slot address.
   66 
   67 <sect>How Linux Uses The EtherDrive System
   68 <p>
   69 For security and performance reasons, many people use a second,
   70 dedicated network
   71 interface card (NIC) for ATA over
   72 Ethernet traffic.  
   73 
   74 A NIC must be up before it can perform any networking, including AoE.
   75 On examining the output of the <tt>ifconfig</tt> command, you should
   76 see your AoE NIC listed as "UP" before attempting to use an AoE device
   77 reachable via that NIC.
   78 
   79 You can <bf>activate the NIC</bf> with a simple <tt>ifconfig eth1
   80 up</tt>, using the appropriate device name instead of "eth1".  Note
   81 that assigning an IP address is not necessary if the NIC is being used
   82 only for AoE traffic, but having an IP address on a NIC used for AoE
   83 will not interfere with AoE.
   84 
   85 On a Linux system, block devices are used via special files called
   86 device nodes.  A familiar example is <tt>/dev/hda</tt>.  When a block
   87 device node is opened and used, the kernel translates operations on
   88 the file into operations on the corresponding hardware EtherDrive.
   89 
   90 Each accessible AoE target on your network is represented by a disk
   91 device node in the <tt>/dev/etherd/</tt> directory and can be used
   92 just like any other direct attached disk.  The "aoe" device driver is
   93 an open-source loadable kernel module authored by Coraid.  It
   94 translates system reads/writes on a device into AoE request frames for
   95 the associated remote EtherDrive storage device, retransmitting requests if needed.  When the AoE
   96 responses from the device are received, the appropriate system
   97 read/write call is acknowledged as complete.  The aoe device driver
   98 handles retransmissions in the event of network congestion.
   99 
  100 The association of AoE targets on your network to device nodes in
  101 <tt>/dev/etherd/</tt> follows a simple naming scheme.  Each device
  102 node is named eX.Y, where X represents a shelf address and Y
  103 represents a slot address.  Both X and Y are decimal integers.  As an
  104 example, the following command displays the first 4 KiB of data from
  105 the AoE target with shelf address 0 and slot address 1.
  106 
  107 <tscreen><verb>
  108 dd if=/dev/etherd/e0.1 bs=1024 count=4 | hexdump -C
  109 </verb></tscreen>
  110 
  111 Creating an ext3 filesystem on the same AoE target is as simple
  112 as ...
  113 
  114 <tscreen><verb>
  115 mkfs.ext3 /dev/etherd/e0.1
  116 </verb></tscreen>
  117 
  118 Notice that the filesystem goes directly on the block device.  There's
  119 no need for any intermediate "format" or partitioning step.
  120 
  121 Although partitions are not usually needed, they may be created using
  122 a tool like fdisk or GNU parted.
  123 Please see the <ref id="dospart" name="FAQ entry about partition
  124 tables"> for important caveats.
  125 
  126 Partitions are used by adding "p" and the partition number to
  127 the device name.  For example, <tt>/dev/etherd/e0.3p1</tt> is the
  128 first partition on the AoE target with shelf address zero and slot
  129 address three.
  130 
  131 After creating a filesystem, it can be mounted in the normal way.  It
  132 is important to remember to unmount the filesystem before shutting
  133 down your network devices.  Without networking, there is no way to
  134 unmount a filesystem that resides on a disk across the network.
  135 
  136 It is best to update your init scripts so that filesystems on
  137 EtherDrive storage is unmounted early in the system-shutdown
  138 procedure, before network interfaces are shut down.
  139 <ref
  140 id="aoeinit" name="An example"> is found below in the <ref id="faq"
  141 name="list of Frequently Asked Questions">.
  142 
  143 The device nodes in <tt>/dev/etherd/</tt> are usually created in one
  144 of three ways:
  145 
  146 <enum>
  147 <item>Most distributions today use udev to dynamically create device nodes
  148 as needed.  You can configure udev to create the device nodes for your
  149 AoE disks.  (For an example of udev
  150 configuration rules, see <ref id="udev" name="Why do my device nodes
  151 disappear after a reboot?"> in the <ref id="faq" name="FAQ section"> below.)
  152 
  153 <item>If you are using the standalone aoe driver, as opposed to the
  154 one distributed with the Linux kernel, and you are not using udev, the
  155 Makefile will create device
  156 nodes for you when you do a "make install".
  157 
  158 <item>If you are not using udev you can use static device nodes.  Use
  159 the <tt>aoe_dyndevs=0</tt> module load option for the aoe driver.
  160 (You do not need this option if your aoe driver is older than version
  161 aoe6-50.)  Then the
  162 <tt>aoe-mkdevs</tt> and <tt>aoe-mkshelf</tt> scripts in the <url
  163 url="http://aoetools.sourceforge.net/" name="aoetools"> package can be
  164 used to
  165 create the static device nodes manually.  It is very important to
  166 avoid using these static device nodes with an aoe driver that has the
  167 aoe_dyndevs module parameter set to 1, because you could accidentally
  168 use the wrong device.
  169 
  170 </enum>
  171 
  172 <sect>The ATA over Ethernet Tools
  173 <p>
  174 The aoe kernel driver allows Linux to do ATA over Ethernet.  In
  175 addition to the aoe driver, there is a collection of helpful programs
  176 that operate outside of the kernel, in "user space".  This collection
  177 of tools and documentation is called the aoetools, and may be found at
  178 <url 
  179 url="http://aoetools.sourceforge.net/"
  180 name="http://aoetools.sourceforge.net/">.
  181 
  182 Current aoe drivers from the Coraid website are bundled with a
  183 compatible version of the aoetools.  This HOWTO may make reference to
  184 commands from the aoetools, like the aoe-stat command.
  185 
  186 <sect1>Limiting AoE traffic to certain network interfaces
  187 <p>
  188 By default, the aoe driver will use any local network interface
  189 available to reach an AoE target.  Most of the time, though, the
  190 administrator expects legitimate AoE targets to appear only on certain
  191 ethernet interfaces, e.g., "eth1" and "eth2".
  192 
  193 Using the <tt>aoe-interfaces</tt> command from the aoetools package
  194 allows the administrator to limit AoE activity to a set list of
  195 ethernet interfaces.
  196 
  197 This configuration is especially important when some ethernet
  198 interfaces are on networks where an unexpected AoE target with the
  199 same shelf and slot address as a production AoE target might appear.
  200 
  201 Please see the <tt>aoe-interfaces</tt> manpage
  202 for more information.
  203 
  204 At module load time the list of allowable interfaces may be set with
  205 the "aoe_iflist" module parameter.
  206 
  207 <tscreen><verb>
  208 modprobe aoe 'aoe_iflist=eth2 eth3'
  209 </verb></tscreen>
  210 
  211 <sect>EtherDrive storage and Linux Software RAID
  212 <p>
  213 Some AoE devices are internally redundant.  A Coraid SR1521, for example,
  214 might be exporting a 14-disk RAID 5 as a single 9.75 terabyte LUN.
  215 In that case, the AoE target itself is performing RAID, enhancing
  216 performance and reliability.
  217 
  218 You can also perform RAID on the AoE initiator.  Linux Software RAID
  219 can increase performance by striping over multiple AoE targets and
  220 reliability by using data redundancy.  Reading the <url
  221 url="http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html" name="Linux
  222 Software RAID HOWTO"> before you start to work with RAID will likely
  223 save time in the long run.  The Linux
  224 kernel has an "md" driver that performs the Software RAID, and there
  225 are 
  226 several tool
  227 sets that allow you to use this kernel feature.
  228 
  229 The main software package for using the md driver is <url
  230 url="http://www.cse.unsw.edu.au/~neilb/source/mdadm/" name="mdadm">.
  231 Less popular alternatives include the older raidtools package <ref
  232 id="archives" name="(discussed in the Archives below)">, and <url
  233 url="http://evms.sourceforge.net/" name="EVMS">.
  234 
  235 <sect1>Example: RAID 5 with mdadm
  236 <p>
  237 In this example we have five AoE targets in shelves 0-4, with each
  238 shelf exporting a single LUN 0.  The following mdadm command uses these five
  239 AoE devices as RAID components, creating a level-5 RAID array.  The md
  240 configuration information is stored on the components themselves in
  241 "md superblocks", which can be examined with another mdadm command.
  242 
  243 <tscreen><verb>
  244 # mdadm -C -n 5 --level=raid5 --auto=md /dev/md0 /dev/etherd/e[0-4].0
  245 mdadm: array /dev/md0 started.
  246 # mdadm --examine /dev/etherd/e0.0
  247 /dev/etherd/e0.0:
  248           Magic : a92b4efc
  249         Version : 00.90.00
  250            UUID : 46079e2f:a285bc60:743438c8:144532aa (local to host ellijay)
  251 ...
  252 </verb></tscreen>
  253 
  254 <p>
  255 The <tt>/proc/mdstats</tt> file contains summary information about the
  256 RAID as reported by the kernel itself.
  257 
  258 <tscreen><verb>
  259 # cat /proc/mdstat 
  260 Personalities : [raid5] [raid4] 
  261 md0 : active raid5 etherd/e4.0[5] etherd/e3.0[3] etherd/e2.0[2] etherd/e1.0[1] etherd/e0.0[0]
  262       5860638208 blocks level 5, 64k chunk, algorithm 2 [5/4] [UUUU_]
  263       [>....................]  recovery =  0.0% (150272/1465159552) finish=23605.3min speed=1032K/sec
  264       
  265 unused devices: <none>
  266 </verb></tscreen>
  267 
  268 Until md finishes initializing the parity of the RAID, performance is
  269 sub-optimal, and the RAID will not be usable if one of the components
  270 fails during initialization.  After initialization is complete, the md
  271 device can continue 
  272 to be used even if one component fails.
  273 
  274 Later the array can be stopped in order to shut it down cleanly in
  275 preparation for a system reboot or halt.
  276 
  277 <tscreen><verb>
  278 # mdadm -S /dev/md0
  279 </verb></tscreen>
  280 
  281 In a system init script (see <ref id="aoeinit" name="the aoe-init
  282 example in the FAQ">) an mdadm command can assemble the RAID
  283 components using the configuration information that was stored on them
  284 when the RAID was created.
  285 
  286 <tscreen><verb>
  287 # mdadm -A /dev/md0 /dev/etherd/e[0-4].0
  288 mdadm: /dev/md0 has been started with 5 drives.
  289 </verb></tscreen>
  290 
  291 To make an xfs filesystem on the RAID array and mount it, the
  292 following commands can be issued:
  293 
  294 <tscreen><verb>
  295 # mkfs -t xfs /dev/md0
  296 # mkdir /mnt/raid
  297 # mount /dev/md0 /mnt/raid
  298 </verb></tscreen>
  299 
  300 Once md has finished initializing the RAID, the storage is
  301 single-fault tolerant: Any of the components can fail without making
  302 the storage unavailable.  Once a single component has failed, the md
  303 device is said to be in a "degraded" state.  Using a degraded array is
  304 fine, but a degraded array cannot remain usable if another component
  305 fails.
  306 
  307 Adding hot spares makes the array even more robust.  Having hot spares
  308 allows md to bring a new component into the RAID as soon as one of its
  309 components has failed so that the normal state may be achieved as
  310 quickly as possible.  You can check <tt>/proc/mdstat</tt> for
  311 information on the initialization's progress.
  312 
  313 The new write-intent bitmap feature can dramatically reduce the time
  314 needed for re-initialization after a component fails and is later
  315 added back to the array.  Reducing the time the RAID spends in
  316 degraded mode makes a double fault less likely.  Please see the mdadm
  317 manpages for details.
  318 
  319 <sect1>Important notes
  320 <p>
  321 
  322 <enum>
  323 
  324 <item>Some Linux distributions come with an mdmonitor service running
  325 by default.  Unless you configure the mdmonitor to do what you want,
  326 consider turning off this service with <tt>chkconfig mdmonitor
  327 off</tt> and <tt>/etc/init.d/mdmonitor stop</tt> or your system's
  328 equivalent commands.  If mdadm is running in its "monitor" mode
  329 without being properly configured, it may interfere with failover to
  330 hot spares, the stopping of the RAID, and other actions.
  331 
  332 <item>There is a problem with the way some 2.6 kernels determine
  333 whether an I/O device is idle.  On these kernels, RAID initialization
  334 is about five times slower than it needs to be.
  335 
  336 On these kernels you can do the following to work around the problem:
  337 
  338 <tscreen><verb>
  339 echo 100000 > /proc/sys/dev/raid/speed_limit_max
  340 echo 100000 > /proc/sys/dev/raid/speed_limit_min
  341 </verb></tscreen>
  342 
  343 </enum>
  344 
  345 <sect>FAQ (contains important info)<label id="faq">
  346 <p>
  347 
  348 <sect1>Q: How does the system know about the AoE targets on the network?
  349 <p>
  350 A: When an AoE target comes online, it emits a broadcast
  351     frame indicating its presence.  In addition to this mechanism, 
  352     the AoE initiator may send out a query frame to discover
  353     any new AoE targets.
  354 
  355     The Linux aoe driver, for example, sends an
  356     AoE query once per minute.  The discovery can be triggered
  357     manually with the "aoe-discover" tool, one of the
  358     <url url="http://aoetools.sourceforge.net/" name="aoetools">.
  359 
  360 <sect1>Q: How do I see what AoE devices the system knows about?
  361 <p>
  362 A: The /usr/sbin/aoe-stat program (from the <url
  363     url="http://aoetools.sourceforge.net/" name="aoetools">) lists the devices
  364     the system considers valid.  It also displays the
  365     status of the device (up or down).  For example:
  366  
  367 <tscreen><verb>
  368 root@makki root# aoe-stat
  369       e0.0     10995.116GB   eth0 up            
  370       e0.1     10995.116GB   eth0 up            
  371       e0.2     10995.116GB   eth0 up            
  372       e1.0      1152.874GB   eth0 up            
  373       e7.0       370.566GB   eth0 up
  374 </verb></tscreen>
  375 
  376 <sect1>Q: What is the "closewait" state?
  377 <p>
  378 A: The "down,closewait" status means that the device went down but at
  379 least one process still has it open.  After all processes close the
  380 device, it will become "up" again if it the remote AoE device is
  381 available and ready.
  382 
  383 The user can also use the "aoe-revalidate" command to manually cause
  384 the aoe driver to query the AoE device.  If the AoE device is
  385 available and ready, the device state on the Linux host will change
  386 from "down,closewait" to "up".
  387 
  388 <sect1>Q: How does the system know an AoE device has failed?
  389 <p>
  390 A: When an AoE target cannot complete a requested command it will
  391     indicate so in the response to the failed request.
  392     The Linux aoe driver will mark the AoE device as failed upon
  393     reception of such a response.  In addition, if an AoE target
  394     has not responded to a prior request within a default
  395     timeout (currently three minutes) the aoe driver will fail
  396     the device.
  397 
  398 <sect1>Q: How do I take an AoE device out of the failed state?
  399 <p>
  400 A: If the aoe driver shows the device state to be "down", first
  401 check the EtherDrive storage itself and the AoE network.  Once any
  402 problem has been rectified, you can use the "aoe-revalidate" command
  403 from the <url
  404     url="http://aoetools.sourceforge.net/" name="aoetools"> to ask
  405     the aoe driver to change the state back to "up".
  406 
  407 <p>
  408 If the Linux Software RAID driver has marked the
  409 device as "failed" (so 
  410 that an "F" shows up in the output of "cat /proc/mdstat"), then you
  411 first 
  412 need to remove the device from the RAID using mdadm.  Next you add the
  413 device back to the array with mdadm.  
  414 
  415 <p>
  416 An example follows, showing how (after manually failing e10.0) the
  417 device is removed from the array and then added back.  After adding
  418 it back to the RAID, the md driver begins rebuilding the redundancy of
  419 the array.
  420 
  421 <tscreen><verb>
  422 root@kokone ~# cat /proc/mdstat
  423 Personalities : [raid1] [raid5] 
  424 md0 : active raid1 etherd/e10.1[1] etherd/e10.0[0]
  425       524224 blocks [2/2] [UU]
  426       
  427 unused devices: <none>
  428 root@kokone ~# mdadm --fail /dev/md0 /dev/etherd/e10.0
  429 mdadm: set /dev/etherd/e10.0 faulty in /dev/md0
  430 root@kokone ~# cat /proc/mdstat
  431 Personalities : [raid1] [raid5] 
  432 md0 : active raid1 etherd/e10.1[1] etherd/e10.0[2](F)
  433       524224 blocks [2/1] [_U]
  434       
  435 unused devices: <none>
  436 root@kokone ~# mdadm --remove /dev/md0 /dev/etherd/e10.0
  437 mdadm: hot removed /dev/etherd/e10.0
  438 root@kokone ~# mdadm --add /dev/md0 /dev/etherd/e10.0
  439 mdadm: hot added /dev/etherd/e10.0
  440 root@kokone ~# cat /proc/mdstat
  441 Personalities : [raid1] [raid5] 
  442 md0 : active raid1 etherd/e10.0[2] etherd/e10.1[1]
  443       524224 blocks [2/1] [_U]
  444       [=>...................]  recovery =  5.0% (26944/524224) finish=0.6min speed=13472K/sec
  445 unused devices: <none>
  446 root@kokone ~# 
  447 </verb></tscreen>
  448 
  449 <sect1>Q: How can I use LVM with my EtherDrive storage?
  450 <p>
  451 A: With older <url url="http://sources.redhat.com/lvm2/"
  452 name="LVM2"> releases, you may need to edit
  453 lvm.conf, but the current version of LVM2 supports AoE
  454 devices "out of the box".
  455 
  456 You can also create md devices from your aoe devices and tell LVM to
  457 use the md devices.
  458 
  459 It's necessary to understand LVM itself in order to use AoE devices
  460 with LVM.  Besides the manpages for the LVM commands, the <url
  461 url="http://tldp.org/HOWTO/LVM-HOWTO/" name="LVM HOWTO"> is a big help
  462 getting started if you are starting out with LVM.
  463 
  464 If you have an old LVM2 that does not already detect and work with AoE
  465 devices, you can add this line to the "devices" block of your
  466 lvm.conf.
  467 
  468 <tscreen><verb>
  469 types = [ "aoe", 16 ]
  470 </verb></tscreen>
  471 
  472 If you are creating physical volumes out of RAIDs over EtherDrive
  473 storage, make sure to turn on md component detection so that LVM2
  474 doesn't go snooping around on the underlying EtherDrive disks.
  475 
  476 <tscreen><verb>
  477 md_component_detection = 1
  478 </verb></tscreen>
  479 
  480 The snapshots feature in LVM2 did not work in early 2.6 kernels.
  481 Lately, Coraid customers have reported success using snapshots on
  482 AoE-backed logical volumes when using a recent kernel and aoe driver.
  483 Older aoe drivers, like version 22, may need <url
  484 url="https://bugzilla.redhat.com/attachment.cgi?id=311070" name="a
  485 fix"> to work correctly with snapshots.
  486 
  487 Customers have reported data corruption and kernel panics when using
  488 striped logical volumes (created with the "-i" option to lvcreate)
  489 when using aoe driver versions prior to aoe6-48.  No such problems
  490 occur with normal logical volumes or with Software RAID's striping
  491 (RAID 0).
  492 
  493 Most systems have boot scripts that try to detect LVM physical volumes
  494 early in the boot process, before AoE devices are available.  In
  495 playing with LVM, you may need to help LVM to recognize AoE devices
  496 that are physical devices by running vgscan after loading the aoe
  497 module.
  498 
  499 There have been reports that partitions can interfere with LVM's
  500 ability to use an AoE device as a physical volume.  For example, with
  501 partitions e0.1p1 and e0.1p2 residing on e0.1, <tt>pvcreate /dev/etherd/e0.1</tt> might
  502 complain,
  503 
  504 <tscreen><verb>
  505 Device /dev/etherd/e0.1 not found.
  506 </verb></tscreen>
  507 
  508 Removing the partitions allows LVM to create a physical volume from
  509 e0.1.
  510 
  511 <sect1>Q: I get an "invalid module format" error on modprobe.  Why?
  512 <p>
  513 A: The aoe module and the kernel must be built to match one another.
  514 On module load, the kernel version, SMP support (yes or no), the
  515 compiler version, and the target processor must be the same for the
  516 module as it was building the kernel.
  517 
  518 <sect1>Q: Can I allow multiple Linux hosts to use a filesystem that is on my EtherDrive storage?
  519 <p>
  520 A: Yes, but you're now taking advantage of the flexibility of
  521 EtherDrive storage, using it like a SAN.  Your software
  522 must be "cluster aware", like <url
  523 url="http://sources.redhat.com/cluster/gfs/" name="GFS">.  Otherwise,
  524 each host will assume 
  525 it is the sole user of the filesystem and data corruption will
  526 result. 
  527 
  528 <sect1>Q: Can you give me an overview of GFS and related software?
  529 <p>
  530 A: Yes, here's a brief overview.
  531 
  532 <sect2>Background
  533 <p>
  534   GFS is a scalable, journaled filesystem designed to be used by more
  535   than one computer at a time.  There is a separate journal for each
  536   host using the filesystem.  All the hosts working together are
  537   called a cluster, and each member of the cluster is called a cluster
  538   node.
  539 <p>
  540   To achieve acceptible performance, each cluster node remembers what
  541   was on the block device the last time it looked.  This is caching,
  542   where data from copies in RAM are used temporarily instead of data
  543   directly from the block device.
  544 <p>
  545   To avoid chaos, the data in the RAM cache of every cluster node has
  546   to match what's on the block device.  The members of the cluster
  547   (called "cluster nodes") communicate over TCP/IP to agree on who is
  548   in the cluster and who has the right to use a particular part of the
  549   shared block device.
  550 
  551 <sect2>Hardware
  552 <p>
  553   To allow the cluster nodes to control membership in the cluster and
  554   to control access to the shared block storage, "fencing" hardware
  555   can be used.
  556 <p>
  557   Some network switches can be dynamically configured to turn single
  558   ports on and off, effectively fencing a node off from the rest of
  559   the network.
  560 <p>
  561   Remote power switches can be told to turn an outlet off, powering a
  562   cluster node down, so that it is certainly not accessing the shared
  563   storage.
  564 
  565 <sect2>Software
  566 <p>
  567   The RedHat Cluster Suite developers have created several pieces of
  568   software besides the GFS filesystem itself to allow the cluster
  569   nodes to coordinate cluster membership and to control access to the
  570   shared block device.
  571 <p>
  572   These parts are listed here, on the GFS Project Page.
  573 <p>
  574    <url url="http://sources.redhat.com/cluster/gfs/" name=" http://sources.redhat.com/cluster/gfs/">
  575 <p>
  576   GFS and its related software are undergoing continuous heavy
  577   development and are maturing slowly but steadily.
  578 <p>
  579   As might be expected, the devleopers working for RedHat target
  580   RedHat Enterprise Linux as the ultimate platform for GFS and its
  581   related software.  They also use Fedora Core as a platform for
  582   testing and innovation.
  583 <p>
  584   That means that when choosing a distribution for running GFS, recent
  585   versions of Fedora Core, RedHat Enterprise Linux (RHEL), and RHEL
  586   clones like CentOS should be considered.  On these platforms, RPMs
  587   are available that have a good chance of working "out of the box."
  588 <p>
  589   With a RedHat-based distro like Fedora Core, using GFS means seeking
  590   out the appropriate documentation, installing the necessary RPMs,
  591   and creating a few text files for configuring the software.
  592 <p>
  593   Here is a good overview of what the process is generally like.  Note
  594   that if you're using RPMs, then building and installing the software
  595   will not be necessary.
  596 <p>
  597     <url url="http://sources.redhat.com/cluster/doc/usage.txt" name="http://sources.redhat.com/cluster/doc/usage.txt">
  598 
  599 <sect2>Use
  600 <p>
  601   Once you have things ready, using the GFS is like using any other
  602   filesystem.
  603 <p>
  604   Performance will be greatest when the filesystem operations of the
  605   different nodes do not interfere with one another.  For instance, if
  606   all the nodes try to write to the same place in a directory or file,
  607   much time will be spent in coordinating access (locking).
  608 <p>
  609   An easy way to eliminate a large amount of locking is to use the
  610   "noatime" (no access time update) mount option.  Even in traditional
  611   filesystems the use of 
  612   this option often results in a dramatic performance benefit, because
  613   it eliminates the need to write to the block storage just to record
  614   the time that the file was last accessed.
  615 
  616 <sect2>Fencing
  617 <p>
  618   There are several ways to keep a cluster node from accessing shared
  619   storage when that node might have outdated assumptions about the
  620   state of the cluster or the storage.  Preventing the node from
  621   accessing the storage is called "fencing", and it can be
  622   accomplished in several ways.
  623 <p>
  624   One popular way is to simply kill the power to the fenced node by
  625   using a remote power switch.  Another is to use a network switch
  626   that has ports that can be turned on and off remotely.
  627 <p>
  628   When the shared storage resource is a LUN on an SR, it is
  629   possible to manipulate the LUN's mask list in order to accomplish
  630   fencing.  You can read about this technique in the <url
  631   url="/support/linux/contrib/" name="Contributions area">.
  632 
  633 <sect1>Q: How can I make a RAID of more than 27 components?
  634 <p>
  635 A: For Linux Software RAID, the kernel limits the number of disks in
  636 one RAID to 27.  However, you can easily overcome this limitation by
  637 creating another level of RAID.
  638 <p>
  639 For example, to create a RAID 0 of thirty block devices,
  640 you may create three ten-disk RAIDs (md1, md2, and md3) and then
  641 stripe across them (md0 is a stripe over md1, md2, and md3).
  642 <p>
  643 Here is an example raidtools configuration file that implements the
  644 above scenario for shelves 5, 6, and 7: <url
  645 url="raid0-30component.conf" name="multi-level RAID 0 configuration
  646 file">.  Non-trivial raidtab configuration files are easier to
  647 generate from a script than to create by hand.
  648 <p>
  649 EtherDrive storage gives you a lot of freedom, so be creative.
  650 
  651 <sect1>Q: Why do my device nodes disappear after a reboot?<label id="udev">
  652 <p>
  653 A: Some Linux distributions create device nodes dynamically.  The
  654 upcoming method of choice is called "udev".  The aoe driver and udev
  655 work together when the following rules are installed.
  656 <p>
  657 These rules go into a file with a name like <tt>60-aoe.rules</tt>.
  658 Look in your <tt>udev.conf</tt> file (usually
  659 <tt>/etc/udev/udev.conf</tt>) for the line starting with <tt>udev_rules=</tt> to find out where rules go (usually <tt>/etc/udev/rules.d</tt>).
  660 
  661 <tscreen><verb>
  662 # These rules tell udev what device nodes to create for aoe support.
  663 # They may be installed along the following lines.  Check the section
  664 # 8 udev manpage to see whether your udev supports SUBSYSTEM, and 
  665 # whether it uses one or two equal signs for SUBSYSTEM and KERNEL.
  666 
  667 # aoe char devices
  668 SUBSYSTEM=="aoe", KERNEL=="discover",   NAME="etherd/%k", GROUP="disk", MODE="0220"
  669 SUBSYSTEM=="aoe", KERNEL=="err",    NAME="etherd/%k", GROUP="disk", MODE="0440"
  670 SUBSYSTEM=="aoe", KERNEL=="interfaces", NAME="etherd/%k", GROUP="disk", MODE="0220"
  671 SUBSYSTEM=="aoe", KERNEL=="revalidate", NAME="etherd/%k", GROUP="disk", MODE="0220"
  672 SUBSYSTEM=="aoe", KERNEL=="flush",  NAME="etherd/%k", GROUP="disk", MODE="0220"
  673 
  674 # aoe block devices     
  675 KERNEL=="etherd*",       NAME="%k", GROUP="disk"
  676 </verb></tscreen>
  677 
  678 <p>
  679 Unfortunately the syntax for the udev rules file has changed several
  680 times as new versions of udev appear.  You will probably have to
  681 modify the example above for your system, but the existing rules and
  682 the udev documentation should help you.
  683 
  684 <p>
  685 There is an example script in the aoe driver,
  686 <tt>linux/Documentation/aoe/udev-install.sh</tt>, that can install the
  687 rules on most systems.
  688 
  689 <p>
  690 The udev system can only work with the aoe driver if the aoe driver is
  691 loaded.  To avoid confusion, make sure that you load the aoe driver at
  692 boot time.
  693 
  694 <sect1>Q: Why does RAID initialization seem slow?
  695 <p>
  696 A: The 2.6 Linux kernel has a problem with its RAID initialization
  697 rate limiting feature.  You can override this feature and speed up
  698 RAID initialization by using the following commands.  Note that these
  699 commands change kernel memory, so the commands must be re-run after a
  700 reboot.
  701 
  702 <tscreen><verb>
  703 echo 100000 > /proc/sys/dev/raid/speed_limit_max
  704 echo 100000 > /proc/sys/dev/raid/speed_limit_min
  705 </verb></tscreen>
  706 
  707 
  708 <sect1>Q: I can only use shelf zero!  Why won't e1.9 work?
  709 <p>
  710 A: Every block device has a device file, usually in /dev, that has a
  711 major and minor number.  You can see these numbers using ls.  Note the
  712 high major numbers (1744, 2400, and 2401) in the example below.
  713 
  714 <tscreen><verb>
  715 ecashin@makki ~$ ls -l /dev/etherd/
  716 total 0
  717 brw-------  1 root disk 152, 1744 Mar  1 14:35 e10.9
  718 brw-------  1 root disk 152, 2400 Feb 28 12:21 e15.0
  719 brw-------  1 root disk 152, 2401 Feb 28 12:21 e15.0p1
  720 </verb></tscreen>
  721 
  722 The 2.6 Linux kernel allows high minor device numbers like this, but
  723 until recently, 255 was the highest minor number one could use.  Some
  724 distributions contain userland software that cannot understand the
  725 high minor numbers that 2.6 makes possible.  
  726 
  727 Here's a crude but reliable test that can determine whether your
  728 system is ready to use devices with high minor numbers.  In the
  729 example below, we tried to create a device node with a minor number of
  730 1744, but ls shows it as 208.
  731 
  732 <tscreen><verb>
  733 root@kokone ~# mknod e10.9 b 152 1744
  734 root@kokone ~# ls -l e10.9
  735 brw-r--r--  1 root root 158, 208 Mar  2 15:13 e10.9
  736 </verb></tscreen>
  737 
  738 On systems like this, you can still use the aoe driver to use up to
  739 256 disks if you're willing to live without support for partitions.
  740 Just make sure that the device nodes and the aoe driver are both
  741 created with one partition per device.
  742 
  743 The commands below show how to create a driver without partition
  744 support and then to create compatible device nodes for shelf 10.
  745 
  746 <tscreen><verb>
  747 make install AOE_PARTITIONS=1
  748 rm -rf /dev/etherd
  749 env n_partitions=1 aoe-mkshelf /dev/etherd 10
  750 </verb></tscreen>
  751 
  752 As of version 1.9.0, the mdadm command supports large minor device
  753 numbers.  The mdadm versions before 1.9.0 do not.  If you would like
  754 to use versions of mdadm older than 1.9.0, you can configure your
  755 driver and device nodes as outlined above.  Be aware that it's easy
  756 confuse yourself by creating a driver that doesn't match the device
  757 nodes.
  758 
  759 <sect1>Q: How can I start my AoE storage on boot and shut it down when the system shuts down?<label id="aoeinit">
  760 <p>
  761 A: That is really a question about your own system, so it's a question
  762 you, as the system administrator, are in the best position to answer.
  763 
  764 <p>
  765 In general, though, many Linux distributions follow the same patterns
  766 when it comes to system "init scripts".  Most use a System V style.
  767 
  768 <p>
  769 The example below should help get you started if you have never
  770 created and installed an init script.  Start by reading the comments
  771 at the top.  Make sure you understand how your system works and what
  772 the script does, because every system is different.
  773 
  774 Here is an overview of what happens when the aoe module is loaded and
  775 the aoe module begins AoE device discovery.  It should help you to
  776 understand the example script below.  Starting up the aoe module on
  777 boot can be tricky if necessary parts of the system are not ready when
  778 you want to use AoE.
  779 
  780 To discover an AoE device, the aoe driver must receive a Query Config
  781 reponse packet that indicates the device is available.  A Coraid SR
  782 broadcasts this response unsolicited when you run the <tt>online</tt>
  783 SR command, but it is usually sent in response to an AoE initiator
  784 broadcasting a Query Config command to discover devices on the
  785 network.  Once an AoE device has been discovered, the aoe driver sends
  786 an ATA Device Identify command to get information about the disk
  787 drive.  When the disk size is known, the aoe driver will install the
  788 new block device in the system.
  789 
  790 The aoe driver will broadcast this AoE discovery command when loaded,
  791 and then once a minute thereafter.
  792 
  793 The AoE discovery that takes place on loading the aoe driver does not
  794 take long, but it does take some time.  That's why you'll see "sleep"
  795 commands in the example aoe-init script below.  If AoE discovery is
  796 failing, try unloading the aoe module and tuning your init script by
  797 invoking it at the command line.
  798 
  799 You will often find that a delay is necessary after loading your
  800 network drivers (and before loading the aoe driver).  This delay
  801 allows the network interface to initialize and to become usable.  An
  802 additional delay is necessary after loading the aoe driver, so that
  803 AoE discovery has time to take place before any AoE storage is used.
  804 
  805 Without such a delay, the initial AoE Config Query broadcast packet
  806 might never go out onto the AoE network, and then the AoE initiator
  807 will not know about any AoE targets until the next periodic Config
  808 Query broadcast occurs, usually one minute later.
  809 
  810 <tscreen><verb>
  811 #! /bin/sh
  812 # aoe-init - example init script for ATA over Ethernet storage
  813 # 
  814 #   Edit this script for your purposes.  (Changing "eth1" to the
  815 #   appropriate interface name, adding commands, etc.)  You might
  816 #   need to tune the sleep times.
  817 #
  818 #   Install this script in /etc/init.d with the other init scripts.
  819 #
  820 #   Make it executable:
  821 #     chmod 755 /etc/init.d/aoe-init
  822 #
  823 #   Install symlinks for boot time:
  824 #     cd /etc/rc3.d && ln -s ../init.d/aoe-init S99aoe-init
  825 #     cd /etc/rc5.d && ln -s ../init.d/aoe-init S99aoe-init
  826 #
  827 #   Install symlinks for shutdown time:
  828 #     cd /etc/rc0.d && ln -s ../init.d/aoe-init K01aoe-init
  829 #     cd /etc/rc1.d && ln -s ../init.d/aoe-init K01aoe-init
  830 #     cd /etc/rc2.d && ln -s ../init.d/aoe-init K01aoe-init
  831 #     cd /etc/rc6.d && ln -s ../init.d/aoe-init K01aoe-init
  832 #
  833 
  834 case "$1" in
  835     "start")
  836         # load any needed network drivers here
  837 
  838         # replace "eth1" with your aoe network interface
  839             ifconfig eth1 up
  840 
  841         # time for network interface to come up
  842                 sleep 4
  843 
  844                 modprobe aoe
  845 
  846         # time for AoE discovery and udev
  847                 sleep 7
  848 
  849                 # add your raid assemble commands here
  850         # add any LVM commands if needed (e.g. vgchange)
  851                 # add your filesystem mount commands here
  852 
  853         test -d /var/lock/subsys && touch /var/lock/subsys/aoe-init
  854                 ;;
  855         "stop")
  856                 # add your filesystem umount commands here
  857         # deactivate LVM volume groups if needed
  858                 # add your raid stop commands here
  859         rmmod aoe
  860         rm -f /var/lock/subsys/aoe-init
  861         ;;
  862     *)
  863             echo "usage: `basename $0` {start|stop}" 1>&2
  864         ;;
  865 esac
  866 </verb></tscreen>
  867 
  868 <sect1>Q: Why do I get "permission denied" when I'm root?
  869 <p>
  870 A: Some newer systems come with SELinux (Security-Enhanced Linux),
  871 which can limit what the root user can do.
  872 
  873 <p>
  874 SELinux is usually good about creating entries in the system logs when
  875 it prevents root from doing something, so examine your logs for such
  876 messages.
  877 
  878 <p>
  879 Check the SELinux documentation for information on how to configure
  880 or disable SELinux according to your needs.
  881 
  882 <sect1>Q: Why does fdisk ask me for the number of cylinders?<label id="dospart">
  883 
  884 <p>
  885 A: Your fdisk is probably asking the kernel for the size of the disk
  886 with a BLKGETSIZE block device ioctl, which returns the sector
  887 count of the disk in a 32-bit number.  If the size of the disk exceeds
  888 the ability to be stored in this 32-bit number (2 TB is the limit),
  889 the ioctl returns ETOOBIG as an error.  This error indicates that the
  890 program should try the 64-bit ioctl (BLKGETSIZE64), but when fdisk
  891 doesn't do that, it just asks the user to supply the number of
  892 cylinders.
  893 
  894 You can
  895 tell fdisk the number of cylinders yourself.  The number to use
  896 (sectors / (255 * 63)) is printed by the following commands.  Use the
  897 appropriate device instead of "e0.0".
  898 
  899 <tscreen><verb>
  900 sectors=`cat /sys/block/etherd\!e0.0/size`
  901 echo $sectors 255 63 '*' / p | dc
  902 </verb></tscreen>
  903 
  904 But no MSDOS partition table can ever work with more than 2TB.  The
  905 reason is that the numbers in the partition table itself are only 32
  906 bits in size.  That means you can't have a partition larger than 2TB
  907 in size or starting further than 2TB from the beginning of the device.
  908 
  909 Some options for multi-terabyte volumes are:
  910 
  911 <enum>
  912 <item>By doing without partitions, the filesystem can be created
  913 directly on the AoE device itself (e.g., <tt>/dev/etherd/e1.0</tt>),
  914 <item>LVM2, the Logical Volume Manager, is a sophisticated way of
  915 allocating storage to create logical volumes of desired sizes, and
  916 <item>GPT partition tables.
  917 </enum>
  918 
  919 The last item in the list above is a new kind of partition table that
  920 overcomes the limitations of the older MSDOS-style partition table.
  921 Andrew Chernow has related his successful experiences using GPT
  922 partition tables on large AoE devices in <url
  923 url="/support/linux/contrib/chernow/gpt.html"
  924 name="this contributed document">.
  925 
  926 Please note that some versions of the GNU parted tool, such as version
  927 1.8.6, have a bug.  This bug allows the user to create an MSDOS-style
  928 partition table with partitions larger than two terabytes even though
  929 these partitions are too large for an MSDOS partition table.  The
  930 result is that the filesystems on these partitions will only be usable
  931 until the next reboot.
  932 
  933 <sect1>Q: Can I use AoE equipment with Oracle software?
  934 
  935 <p>
  936 A: Oracle used to have a <url
  937 url="http://www.oracle.com/technology/deploy/availability/htdocs/oscp.html"
  938 name="Oracle Storage Compatibility Program">, but simple block-level
  939 storage technologies do not require Oracle validation.  ATA over
  940 Ethernet provides simple, block-level storage.
  941 
  942 Oracle used to have a list of a frequently asked questions about
  943 running Oracle on Linux, but they have replaced it with <url
  944 url="http://www.oracle.com/technology/tech/linux/htdocs/oracleonlinux_faq.html"
  945 name="documentation
  946 about their own Linux distribution list covering">.  A third party
  947 site continues to maintain a <url
  948 url="http://www.orafaq.com/faqlinux.htm"
  949 name="FAQ about running Oracle on Linux">.
  950 
  951 <sect1>Q: Why do I have intermittent problems?
  952 
  953 <p>
  954 A: Make sure your network is in good shape.  Having good patch cables,
  955 reliable network switches with good flow control, and good network
  956 cards will keep your network storage happy.
  957 
  958 <sect1>Q: How can I avoid running out of memory when copying large files?
  959 
  960 <p>
  961 A: You can tell the Linux kernel not to wait so long before writing
  962 data out to backing storage.
  963 
  964 <tscreen><verb>
  965 echo 3 > /proc/sys/vm/dirty_ratio 
  966 echo 4 > /proc/sys/vm/dirty_background_ratio 
  967 echo 32768 > /proc/sys/vm/min_free_kbytes
  968 </verb></tscreen>
  969 
  970 When a large MTU, like 9000, is in being used on the AoE-side network
  971 interfaces, a larger min_free_kbytes setting could be helpful.  The more
  972 RAM you have, the larger the number you might have to use.
  973 
  974 There are also alternative settings to the above "ratio" settings, available as of kernel version 2.6.29.  They are <tt>dirty_bytes</tt> and <tt>dirty_background_bytes</tt>, and they provide finer control for systems with large amounts of RAM.
  975 
  976 If you find the /proc settings to be helpful, you can make them
  977 permanent by editing /etc/sysctl.conf or by creating an init script
  978 that performs the settings at boot time.
  979 
  980 The Documentation/sysctl/vm.txt file for your kernel has details on the settings
  981 available for your particular kernel, but some guiding principles are...
  982 
  983 <itemize>
  984 <item>Linux will use free RAM to cache the data that is on AoE targets, which is helpful.
  985 <item>Writes to the AoE target go first to RAM, updating the cache.  Those updated parts of the cached data are "dirty" until the changes are written out to the AoE target.  Then they're "clean".
  986 <item>If the system needs RAM for something else, clean parts of the cache can be repurposed immediately.
  987 <item>The RAM that is holding dirty cache data cannot be reclaimed immediately, because it reflects updates to the AoE target that have not yet made it to the AoE target.
  988 <item>Systems with much RAM and doing many writes will accumulate dirty data quickly.
  989 <item>If the processes creating the write workload are forced by the Linux kernel to wait for the dirty data to be flushed out to the backing store (AoE targets), then I/O goes fast but the producers are naturally throttled, and the system stays responsive and stable.
  990 <item>If the dirty data is flushed in "the background", though, then when there's too much dirty data to flush out, the system becomes unresponsive.</item>
  991 <item>Telling Linux to maintain a certain amount of truly free RAM, not used for caching, allows the system to have plenty of RAM for doing the work of flushing out the dirty data.
  992 <item>Telling Linux to push dirty data out sooner keeps the backing store more consistent while it is being used (with regard to the danger of power failures, network failures, and the like).  It also allows the system to quickly reclaim memory used for caching when needed, since the data is clean.
  993 </itemize>
  994 
  995 <sect1>Q: Why doesn't the aoe driver notice that an AoE device has disappeared or changed size?
  996 
  997 <p>
  998 A: Prior to the aoe6-15 driver, aoe drivers only learned an AoE device's
  999 characteristics once, and the only way to use an AoE device that had
 1000 grown or to get rid of "phantom" AoE devices that were no longer
 1001 present was to re-load the aoe module completely.
 1002 
 1003 <tscreen><verb>
 1004 rmmod aoe
 1005 modprobe aoe
 1006 </verb></tscreen>
 1007 
 1008 Since aoe6-15, aoe drivers have supported the aoe-revalidate command.
 1009 See the aoe-revalidate manpage for more information.
 1010 
 1011 <sect1>Q: My NFS client hangs when I export a filesystem on an AoE device.
 1012 
 1013 <p>
 1014 A: If you are exporting a filesystem over NFS, then that filesystem
 1015 resides on a block device.  Every block device has a major and minor
 1016 device number that you can see by running "ls -l".
 1017 
 1018 If the block device has a "high" minor number, over 255, and you're
 1019 trying to export a filesystem on that device, then NFS will have
 1020 trouble using the minor number to identify the filesystem.  You can
 1021 tell the NFS server to use a different number by using the "fsid"
 1022 option in your /etc/exports file.  
 1023 
 1024 The fsid option is documented in the "exports" manpage.  Here's an
 1025 example of how its use might look in /etc/exports.
 1026 
 1027 <tscreen><verb>
 1028 /mnt/alpha 205.185.197.207(rw,sync,no_root_squash,fsid=20)
 1029 </verb></tscreen>
 1030 
 1031 As the manpage says, each filesystem needs its own unique fsid.
 1032 
 1033 <sect1>Q: Why do I see "unknown partition table" errors in my logs?
 1034 
 1035 <p>
 1036 A: Those are probably not errors.  
 1037 Usually this message means that your disk doesn't have a partition
 1038 table.  With AoE devices, that's the common case.
 1039 
 1040 When a new block device is detected
 1041 by the kernel, the kernel tries to read the part of the block device
 1042 where a partition table is conventially stored.
 1043 
 1044 The kernel checks to see whether the data there looks like any kind of
 1045 partition table that it knows about.  It can't tell the difference
 1046 between a disk with a kind of partition table it doesn't know about
 1047 and a disk with no partition table at all.
 1048 
 1049 <sect1>Q: Why do I get better throughput to a file on an AoE device than to the device itself?
 1050 
 1051 <p>
 1052 Most of the time a filesystem resides on a block device, so that the
 1053 filesystem can be mounted and the storage is used by reading and
 1054 writing files and directories.
 1055 When you are not using a filesystem at all, you might see somewhat
 1056 degraded performance.  Sometimes this degradation comes as a surprise
 1057 to new AoE users when they first try out an AoE device with the dd
 1058 command, for example, before creating a filesystem on the device.
 1059 
 1060 If the AoE device has an odd
 1061 number of sectors, the block layer of the Linux kernel presents the
 1062 aoe driver with 512-byte I/O jobs.  Each AoE packet winds up with only
 1063 one sector of data, doubling the number of AoE packets when normal
 1064 ethernet frames are in use.
 1065 
 1066 The Linux kernel's block layer gives special treatment to filesystem
 1067 I/O, giving the aoe driver I/O jobs in the filesystem block size, so
 1068 there is no performance penalty to using a filesystem on an AoE device
 1069 that has an odd number of sectors.  Since there isn't a large demand for
 1070 non-filesystem I/O, the complexity associated with coalescing
 1071 multiple I/O jobs in the aoe driver is probably not worth the
 1072 potential driver instability it could introduce.
 1073 
 1074 One way to work around this issue is to use the O_DIRECT flag to the 
 1075 "open" system call. For recent versions of dd, you can use the option, 
 1076 "oflag=direct" to tell dd to use this O_DIRECT flag.  You should combine 
 1077 this option with a large blocksize, such as "bs=4M" in order to take use 
 1078 the larger possible I/O batch size.
 1079 
 1080 Another way to work around this issue is to
 1081 use a trivial md device as a wrapper.  (Almost everyone uses a
 1082 filesystem.  This technique is only interesting to those who are not
 1083 using a filesystem, so most people should ignore this idea.)  In the
 1084 example below, a single-disk RAID 0 is created for the AoE device
 1085 e0.3.  Although e0.3 has an odd number of sectors, the md1 device does
 1086 not, and tcpdump confirms that each AoE packet has 1 KiB of data as we
 1087 would like.
 1088 
 1089 <tscreen><verb>
 1090 makki:~# mdadm -C -l 0 -n 1 --auto=md  /dev/md1 /dev/etherd/e0.3
 1091 mdadm: '1' is an unusual number of drives for an array, so it is probably
 1092      a mistake.  If you really mean it you will need to specify --force before
 1093      setting the number of drives.
 1094 makki:~# mdadm -C -l 0 --force -n 1 --auto=md  /dev/md1 /dev/etherd/e0.3
 1095 mdadm: array /dev/md1 started.
 1096 makki:~# cat /sys/block/etherd\!e0.3/size
 1097 209715201
 1098 makki:~# cat /sys/block/md1/size
 1099 209715072
 1100 </verb></tscreen>
 1101 
 1102 <sect1>Q: How can I boot diskless systems from my Coraid EtherDrive devices?
 1103 
 1104 <p>
 1105 Booting from AoE devices is similar to other kinds of network
 1106 booting.  Customers have contributed examples of successful strategies
 1107 in the <url
 1108 url="/support/linux/contrib/"
 1109 name="Contributions Area">
 1110 of the Coraid website.
 1111 
 1112 <url
 1113 url="/support/linux/contrib/index.html#jvboot"
 1114 name="Jayson Vantuyl: Making A Flexible Initial Ramdisk">
 1115 
 1116 <url
 1117 url="/support/linux/contrib/index.html#jmboot"
 1118 name="Jason McMullan: Add root filesystem on AoE support to aoe driver">
 1119 
 1120 <p>
 1121 Keep in mind that if you intend to use AoE devices before udev is
 1122 running, you must use static minor numbers for the device nodes.  An
 1123 aoe6 driver version 50 or above can be instructed to use static minor
 1124 numbers by being loaded with the <tt>aoe_dyndevs=0</tt> module
 1125 parameter.  (Previous aoe drivers only used static minor device
 1126 numbers.)
 1127 
 1128 <sect1>Q: What filesystems do you recommend for very large block devices?
 1129 
 1130 <p>
 1131 The filesystem you choose will depend on how you want to use the
 1132 storage.  Here are some generalizations that may serve as a starting
 1133 point.
 1134 
 1135 There are two major classes of filesystems: cluster filesystems and
 1136 traditional filesystems.  Cluster filesystems are more complex and
 1137 support simultaneous access from multiple independent computers to a
 1138 single filesystem stored on a shared block device.
 1139 
 1140 Traditional filesystems are only mounted by one host at a time.  Some
 1141 traditional filesystems that scale to sizes larger than those
 1142 supported by ext3 include the following journalling filesystems.
 1143 
 1144 <url
 1145 url="http://oss.sgi.com/projects/xfs/"
 1146 name="XFS">, developed at SGI, specializes in high throughput to large files.
 1147 
 1148 <url
 1149 url="http://www.namesys.com/"
 1150 name="Reiserfs">, an often experimental filesystem can perform well
 1151 with many 
 1152 small files.
 1153 
 1154 <url
 1155 url="http://jfs.sourceforge.net/"
 1156 name="JFS">, developed at IBM, is a general purpose filesystem.
 1157 
 1158 <sect1>Q: Why does umount say, "device is busy"?
 1159 
 1160 <p>
 1161 A: That just means you're still using the filesystem on that device.
 1162 
 1163 Unless something has gone very wrong, you should be able to unmount
 1164 after you stop using the filesystem.  Here are a few ways you might be
 1165 using the filesystem without knowing it:
 1166 
 1167 <itemize>
 1168 <item> NFS might be exporting it.  Stopping the NFS service will unuse
 1169     the filesystem.
 1170 
 1171 <item> A process might be holding open a file on the filesystem.  Killing
 1172     the process will unuse the filesystem.
 1173 
 1174 <item> A process might have some directory on the filesystem as its
 1175     current working directory.  In that case, you can kill the process
 1176     or (if it's a shell) cd to some other directory that's not on the
 1177     fs you're trying to unmount.
 1178 </itemize>
 1179 
 1180 The <tt/lsof/
 1181 command can be helpful in finding processes that are using files.
 1182 
 1183 <sect1>Q: How do I use the multiple network path support in driver versions 33 and up?
 1184 
 1185 <p>
 1186 A: You don't have to do anything to benefit from the aoe driver's
 1187 ability to use multiple network paths to the same AoE target.
 1188 
 1189 The aoe driver will automatically use each end-to-end path in an
 1190 essentially round-robin fashion.  If one network path becomes
 1191 unusable, the aoe driver will attempt to use the remaining network
 1192 paths to reach the AoE target, even retransmitting any lost packets
 1193 through one of the remaining paths.
 1194 
 1195 <sect1>Q: Why does "xfs_check" say "out of memory"?
 1196 
 1197 <p>
 1198 A: The xfstools use a huge amount of virtual memory when operating on
 1199 large filesystems.  The CLN HOWTO has some helpful information about
 1200 using temporary swap space when necessary for accomodating the
 1201 xfstools' virtual memory requirements.
 1202 
 1203 <url
 1204 url="/support/cln/CLN-HOWTO/ar01s05.html#id2515012"
 1205 name="CLN HOWTO: Repairing a Filesystem">
 1206 
 1207 The 32-bit xfstools are limited in the size of the filesystem they can
 1208 operate on, but 64-bit systems overcome this limitation.  This limit
 1209 is likely to be encountered with 32-bit xfstools for filesystems over
 1210 2 TiB in size.
 1211 
 1212 <sect1>Q: Can virtual machines running on VMware ESX use AoE over jumbo frames?
 1213 
 1214 <p>
 1215 A: It is somewhat difficult to find public information about the
 1216 ESX configuration necessary to use jumbo frames, but there is
 1217 information in the public forum at the URL below.
 1218 
 1219 <url
 1220 url="http://communities.vmware.com/thread/135691"
 1221 name="How to setup TCP/IP Jumbo packet support in VMware ESX 3.5 on W2K3 VMs">
 1222 
 1223 <sect1>Q: Can I use SMART with my AoE devices?
 1224 
 1225 <p>
 1226 A: The early Coraid products like the EtherDrive PATA blades simply
 1227 passed ATA commands through to the attached PATA disk,
 1228 including SMART commands.  While there was no way to ask the aoe driver
 1229 to send SMART commands, one could ask aoeping to send SMART commands.
 1230 The aoeping manpage has more information.
 1231 
 1232 <p>
 1233 The Coraid SR and VS storage appliances present AoE targets that are
 1234 LUNs, not corresponding to a specific disk.  The SR supports SMART
 1235 internally, on its command line, but the AoE LUNs do not support SMART.
 1236 
 1237 <sect>Jumbo Frames
 1238 
 1239 <p>
 1240 Data is transmitted over the ethernet in frames, usually with a
 1241 maximum frame size of 1500.  Receiving or transmitting a frame of data
 1242 takes time, and by increasing the amount of data per frame, data can
 1243 often be transmitted more efficiently over an ethernet network.
 1244 
 1245 Frames larger than 1500 octets are called "jumbo frames."  There is
 1246 plenty of information about jumbo frames out there, so in this section
 1247 we're going to focus on how jumbo frames relate to the use of AoE.
 1248 
 1249 When you change the MTU on your Linux host's network interface, the
 1250 interface must essentially reboot itself.  Once this has completed and
 1251 the interface is back up, you should run the <tt>aoe-discover</tt>
 1252 command to
 1253 trigger the reevaluation of the aoe device's jumbo frame capability.
 1254 You should see lines in your log (or in the output of the
 1255 <tt>dmesg</tt> 
 1256 command) indicating that the outstanding frame
 1257 size has changed.  The example text below appears after setting the
 1258 MTU on eth1 to 4200, enough for 4 KiB of data, plus headers.
 1259 
 1260 <tscreen><verb>
 1261 aoe: e7.0: setting 4096 byte data frames on eth1:003048865ed2
 1262 </verb></tscreen>
 1263 
 1264 If you do not see this output, try running <tt>aoe-revalidate</tt> on
 1265 the device in question.  If you have a switch inbetween your SR and
 1266 your linux client that does not have jumbo frames enabled, the aoe
 1267 driver will fall back to 1 KiB of data per packet until a forced
 1268 revalidation occurs.
 1269 
 1270 For larger frames to be used, the whole network path must support
 1271 them.  For example, consider a scenario where you are using ...
 1272 
 1273 <enum>
 1274 <item>a LUN
 1275 from a Coraid SR1521 as your AoE target, 
 1276 <item>a Linux host with an Intel
 1277 gigabit NIC as your AoE initiator, and
 1278 <item>a gigabit switch between the
 1279 target and initiator.
 1280 </enum>
 1281 
 1282 In that case, all three points on the network must be configured to
 1283 handle large frames in order for AoE data to be transmitted in jumbo
 1284 frames.
 1285 
 1286 <sect1>Linux NIC MTU
 1287 <p>
 1288 Check the documentation for your network card's driver to find out how
 1289 to change its maximum transmission unit (MTU).  For example, if you
 1290 have a gigabit Intel NIC, you can read the
 1291 <tt>Documentation/networking/e1000.txt</tt> file in the kernel
 1292 sources to find out that the following command increases the MTU to
 1293 4200.
 1294 
 1295 <tscreen><verb>
 1296 ifconfig ethx mtu 4200 up
 1297 </verb></tscreen>
 1298 
 1299 The real name of your interface (e.g., "eth1") should be used instead
 1300 of "ethx".
 1301 
 1302 <sect1>Network Switch MTU
 1303 <p>
 1304 Usually you have to turn on jumbo frames in a switch that supports
 1305 them.  Doing jumbo frames requires a different buffer allocation in
 1306 the switch that's not usually sensible for ethernet with standard
 1307 frames.  Check the documentation for your switch for details.
 1308 
 1309 <sect1>SR MTU
 1310 <p>
 1311 No special configuration steps need to be taken on the Coraid
 1312 SATA+RAID unit for it to use jumbo frames if the firmware release is
 1313 20060316 or newer.  
 1314 
 1315 You can see what firmware release your SR is running by issuing the
 1316 "release" command at its command line.
 1317 
 1318 <sect>Appendix A: Archives<label id="archives">
 1319 
 1320 <p>
 1321 This section contains material that is no longer relevant to a
 1322 majority of readers.  It has been placed in this appendix with minimal
 1323 editing.
 1324 
 1325 <sect1>Example: RAID 5 with the raidtools
 1326 <p>
 1327 Let us assume we have five AoE targets that are virtual LUNs numbered
 1328 0 through 4, exported from a Coraid VS appliance that has been
 1329 assigned shelf address 0.  Let us further assume we want to use these
 1330 five LUNs to create a level-5 RAID array.  Using a text editor, we
 1331 create 
 1332 a Software RAID configuration file named "/etc/rt".  The transcript
 1333 below shows its contents.
 1334 
 1335 <tscreen><verb>
 1336 $ cat /etc/rt
 1337 raiddev /dev/md0
 1338         raid-level      5
 1339         nr-raid-disks   5
 1340         chunk-size      32
 1341         persistent-superblock 1
 1342         device          /dev/etherd/e0.0
 1343         raid-disk       0
 1344         device          /dev/etherd/e0.1
 1345         raid-disk       1
 1346         device          /dev/etherd/e0.2
 1347         raid-disk       2
 1348         device          /dev/etherd/e0.3
 1349         raid-disk       3
 1350         device          /dev/etherd/e0.4
 1351         raid-disk       4
 1352 </verb></tscreen>
 1353 
 1354 Here is an example for setting up and using the RAID array described
 1355 by the above configuration file, <tt>/etc/rt</tt>.
 1356 
 1357 <tscreen><verb>
 1358 $ mkraid -c /etc/rt /dev/md0
 1359 DESTROYING the contents of /dev/md0 in 5 seconds, Ctrl-C if unsure!
 1360 handling MD device /dev/md0
 1361 analyzing super-block
 1362 disk 0: /dev/etherd/00:00, 19535040kB, raid superblock at 19534976kB
 1363 disk 1: /dev/etherd/00:01, 19535040kB, raid superblock at 19534976kB
 1364 disk 2: /dev/etherd/00:02, 19535040kB, raid superblock at 19534976kB
 1365 disk 3: /dev/etherd/00:03, 19535040kB, raid superblock at 19534976kB
 1366 disk 4: /dev/etherd/00:04, 19535040kB, raid superblock at 19534976kB
 1367 $
 1368 </verb></tscreen>
 1369 
 1370 To make an ext3 filesystem on the RAID array and mount it, the
 1371 following commands can be issued:
 1372 
 1373 <tscreen><verb>
 1374 $ mkfs.ext3 /dev/md0
 1375 ... (mkfs output)
 1376 $ mount /dev/md0 /mnt/raid
 1377 $
 1378 </verb></tscreen>
 1379 
 1380 The resulting storage is single-fault tolerant.  Add hot spares to
 1381 make the array even more robust (see the Software RAID documentation
 1382 for more information.)  Remember that it takes the md driver some time
 1383 to initialize a new RAID 5 array.  During that time, you can use the
 1384 device, but performance is sub-optimal until md finishes.  Check
 1385 <tt>/proc/mdstat</tt> for information on the initialization's
 1386 progress.
 1387 
 1388 <sect1>Example: RAID 10 with mdadm
 1389 
 1390 <p>
 1391 Today, the Linux kernel supports a raid10 personality, and you can
 1392 create a RAID 10 with one mdadm command.  Things used to be more
 1393 complicated.  The section below shows the steps that used to be
 1394 necessary to create a RAID 10 by first creating several RAID 1 mirrors
 1395 that could serve as components for the larger RAID 0.
 1396 
 1397 <p>
 1398 RAID 10 is striping over mirrors.  That is, a RAID 0 is created to
 1399 stripe data over several RAID 1 devices.  Each RAID 1 is a mirrored
 1400 pair of disks.  For a given (even) number of disks, a RAID 10 has less
 1401 capacity and throughput than a RAID 5.  Nevertheless, storage experts
 1402 often prefer RAID 10 for its superior resiliancy to failure,
 1403 its low re-initialization time, and its low computational overhead.
 1404 
 1405 The first example shows how to create a RAID 10 and a hot spare from
 1406 eight AoE targets that share shelf address 1.  After checking the
 1407 mdadm manpage, it should be easy for you to create startup and
 1408 shutdown scripts.
 1409 
 1410 <tscreen><verb>
 1411 # make-raid10.sh
 1412 # create a RAID 10 from shelf 1 to be used with mdadm-aoe.conf
 1413 
 1414 set -xe     # shell flags: be verbose, exits on errors
 1415 shelf=1
 1416 
 1417 # create the mirrors
 1418 mdadm -C /dev/md1 -l 1 -n 2 /dev/etherd/e$shelf.0 /dev/etherd/e$shelf.1
 1419 mdadm -C /dev/md2 -l 1 -n 2 /dev/etherd/e$shelf.2 /dev/etherd/e$shelf.3
 1420 mdadm -C /dev/md3 -l 1 -n 2 /dev/etherd/e$shelf.4 /dev/etherd/e$shelf.5
 1421 mdadm -C /dev/md4 -l 1 -n 2 -x 2 /dev/etherd/e$shelf.6 /dev/etherd/e$shelf.7 \
 1422     /dev/etherd/e$shelf.8
 1423 sleep 1
 1424 # create the stripe over the mirrors
 1425 mdadm -C /dev/md0 -l 0 -n 4 /dev/md1 /dev/md2 /dev/md3 /dev/md4
 1426 </verb></tscreen>
 1427 
 1428 Notice that the <tt>make-raid10.sh</tt> script above sets up
 1429 <tt>md4</tt> with the hot spare drive.  What if one of the drives in
 1430 <tt>md1</tt> fails?  The "spare group" mdadm feature allows an mdadm
 1431 process running in monitor mode to dynamically allocate hot spares as
 1432 needed, so that the single hot spare can replace a faulty disk in any
 1433 RAID 1 of the four.
 1434 
 1435 The configuration file below tells the mdadm monitor process that it
 1436 can use the hot spare to replace any drive in the RAID 10.
 1437 
 1438 <tscreen><verb>
 1439 # mdadm-aoe.conf
 1440 # see mdadm.conf manpage for syntax and info
 1441 #
 1442 # There's a "spare group" called e1, after the shelf
 1443 # with address 1, so that mdadm can use hot spares for
 1444 # any RAID 1 in the RAID 10 on shelf 1.
 1445 # 
 1446 
 1447 DEVICE /dev/etherd/e1.[0-9]
 1448 
 1449 ARRAY /dev/md1
 1450   devices=/dev/etherd/e1.0,/dev/etherd/e1.1
 1451   spare-group=e1
 1452 ARRAY /dev/md2
 1453   devices=/dev/etherd/e1.2,/dev/etherd/e1.3
 1454   spare-group=e1
 1455 ARRAY /dev/md3
 1456   devices=/dev/etherd/e1.4,/dev/etherd/e1.5
 1457   spare-group=e1
 1458 ARRAY /dev/md4
 1459   devices=/dev/etherd/e1.6,/dev/etherd/e1.7,/dev/etherd/e1.8
 1460   spare-group=e1
 1461 
 1462 ARRAY /dev/md0
 1463   devices=/dev/md1,/dev/md2,/dev/md3,/dev/md4
 1464 
 1465 MAILADDR root
 1466 
 1467 # This is normally a program that handles events instead
 1468 # of just /bin/echo.  If you run the mdadm monitor in the
 1469 # forground, though, using echo allows you to see what events
 1470 # are occurring.
 1471 #
 1472 PROGRAM /bin/echo
 1473 </verb></tscreen>
 1474 
 1475 <sect1>Important notes
 1476 <p>
 1477 
 1478 <enum>
 1479 <item>You may note above that the example creates the RAID device
 1480 configuration file as <tt>/etc/rt</tt> rather than the conventional
 1481 <tt>/etc/raidtab</tt>.  The kernel uses the existence of
 1482 <tt>/etc/raidtab</tt> to trigger starting the RAID device on boot
 1483 before any other initializations are performed.  This is done to
 1484 permit users the ability to use a Software RAID device for their root
 1485 filesystem.  Unfortunately, because the kernel has not yet initialized
 1486 the network it is unable to access the EtherDrive storage at this point
 1487 and the kernel hangs.  The workaround for this is to place
 1488 EtherDrive-based RAID configurations in another file such as /etc/rt
 1489 and add calls in an rc.local file similar to the following for startup
 1490 on boot:
 1491 
 1492 <tscreen><verb>
 1493 raidstart -c /etc/rt /dev/md0
 1494 mount /dev/md0 /mnt/raid
 1495 </verb></tscreen>
 1496 
 1497 </enum>
 1498 
 1499 <sect1>Old FAQ List
 1500 
 1501 <p>
 1502 These questions are no longer frequently asked, probably because they
 1503 relate to software that is no longer widely used.
 1504 
 1505 <sect2>Q: When I "modprobe aoe", it takes a long time. The system seems to hang.  What could be the problem?
 1506 <p>
 1507 A: When the hotplug service was first making its way into Linux
 1508 distributions, it could slow 
 1509 things down and 
 1510 cause problems when the aoe module loaded.  For some systems, it is
 1511 may be easiest to disable it on your system.  Usually the right commands look
 1512 like this:
 1513 
 1514 <tscreen><verb>
 1515 chkconfig hotplug off
 1516 /etc/init.d/hotplug stop
 1517 </verb></tscreen>
 1518 
 1519 More recent distributions may need hotplug working in conjunction with
 1520 udev.  See the udev question in this FAQ for more information.
 1521 
 1522 
 1523 </article>