"Fossies" - the Fresh Open Source Software Archive

Member "aoe-aoe6-86/EtherDrive-2.6-HOWTO.txt" (4 Jul 2015, 67457 Bytes) of archive /linux/misc/aoe-aoe6-86.tar.gz:


As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) plain text source code syntax highlighting (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1   EtherDrive(R) storage and Linux 2.6
    2   Sam Hopkins and Ed L. Cashin {sah,ecashin}@coraid.com
    3   April 2008
    4 
    5   Using network data storage with ATA over Ethernet
    6   <http://www.coraid.com/documents/AoEr10.txt> is easy after understand-
    7   ing a few simple concepts.  This document explains how to use AoE tar-
    8   gets from a Linux-based Operating System, but the basic principles are
    9   applicable to other systems that use AoE devices. Below we begin by
   10   explaining the key components of the network communication method, ATA
   11   over Ethernet (AoE). Next, we discuss the way a Linux host uses AoE
   12   devices, providing serveral examples.  A list of frequently asked
   13   questions follows, and the document ends with supplementary informa-
   14   tion.
   15   ______________________________________________________________________
   16 
   17   Table of Contents
   18 
   19 
   20 
   21   1. The EtherDrive System
   22   2. How Linux Uses The EtherDrive System
   23   3. The ATA over Ethernet Tools
   24      3.1 Limiting AoE traffic to certain network interfaces
   25 
   26   4. EtherDrive storage and Linux Software RAID
   27      4.1 Example: RAID 5 with mdadm
   28      4.2 Important notes
   29 
   30   5. FAQ (contains important info)
   31      5.1 Q: How does the system know about the AoE targets on the
   32          network?
   33      5.2 Q: How do I see what AoE devices the system knows about?
   34      5.3 Q: What is the "closewait" state?
   35      5.4 Q: How does the system know an AoE device has failed?
   36      5.5 Q: How do I take an AoE device out of the failed state?
   37      5.6 Q: How can I use LVM with my EtherDrive storage?
   38      5.7 Q: I get an "invalid module format" error on modprobe.
   39          Why?
   40      5.8 Q: Can I allow multiple Linux hosts to use a filesystem that is
   41          on my EtherDrive storage?
   42      5.9 Q: Can you give me an overview of GFS and related software?
   43         5.9.1 Background
   44         5.9.2 Hardware
   45         5.9.3 Software
   46         5.9.4 Use
   47         5.9.5 Fencing
   48      5.10 Q: How can I make a RAID of more than 27 components?
   49      5.11 Q: Why do my device nodes disappear after a reboot?
   50      5.12 Q: Why does RAID initialization seem slow?
   51      5.13 Q: I can only use shelf zero! Why won't e1.9 work?
   52      5.14 Q: How can I start my AoE storage on boot and shut it down when
   53           the system shuts down?
   54      5.15 Q: Why do I get "permission denied" when I'm root?
   55      5.16 Q: Why does fdisk ask me for the number of cylinders?
   56      5.17 Q: Can I use AoE equipment with Oracle software?
   57      5.18 Q: Why do I have intermittent problems?
   58      5.19 Q: How can I avoid running out of memory when copying large
   59           files?
   60      5.20 Q: Why doesn't the aoe driver notice that an AoE device has
   61           disappeared or changed size?
   62      5.21 Q: My NFS client hangs when I export a filesystem on an AoE
   63           device.
   64      5.22 Q: Why do I see "unknown partition table" errors in my
   65           logs?
   66      5.23 Q: Why do I get better throughput to a file on an AoE device
   67           than to the device itself?
   68      5.24 Q: How can I boot diskless systems from my Coraid EtherDrive
   69           devices?
   70      5.25 Q: What filesystems do you recommend for very large block
   71           devices?
   72      5.26 Q: Why does umount say, "device is busy"?
   73      5.27 Q: How do I use the multiple network path support in driver
   74           versions 33 and up?
   75      5.28 Q: Why does "xfs_check" say "out of memory"?
   76      5.29 Q: Can virtual machines running on VMware ESX use AoE over
   77           jumbo frames?
   78      5.30 Q: Can I use SMART with my AoE devices?
   79 
   80   6. Jumbo Frames
   81      6.1 Linux NIC MTU
   82      6.2 Network Switch MTU
   83      6.3 SR MTU
   84 
   85   7. Appendix A: Archives
   86      7.1 Example: RAID 5 with the raidtools
   87      7.2 Example: RAID 10 with mdadm
   88      7.3 Important notes
   89      7.4 Old FAQ List
   90         7.4.1 Q: When I "modprobe aoe", it takes a long time. The
   91               system seems to hang. What could be the problem?
   92 
   93 
   94   ______________________________________________________________________
   95 
   96   11..  TThhee EEtthheerrDDrriivvee SSyysstteemm
   97 
   98   The ATA over Ethernet network protocol allows any type of data storage
   99   to be used over a local ethernet network. An "AoE target" receives ATA
  100   read and write commands, executes them, and returns responses to the
  101   "AoE initiator" that is using the storage.
  102 
  103   These AoE commands and responses appear on the network as ethernet
  104   frames with type 0x88a2, the IANA registered Ethernet type for ATA
  105   over Ethernet (AoE) <http://www.coraid.com/documents/AoEr10.txt>. An
  106   AoE target is identified by a pair of numbers: the shelf address, and
  107   the slot address.
  108 
  109   For example, the Coraid SR appliance can perform RAID internally on
  110   its SATA disks, making the resulting storage capacity available on the
  111   ethernet network as one or more AoE targets. All of the targets will
  112   have the same shelf address because they are all exported by the same
  113   SR. They will have different AoE slot addresses, so that each AoE
  114   target is individually addressable. The SR documentation calls each
  115   target a "LUN". Each LUN behaves like a network disk.
  116 
  117   Using EtherDrive technology like the SR appliance is as simple as
  118   sending and receiving AoE packets.
  119 
  120   To a Linux-based system running the "aoe" driver, it doesn't matter
  121   what the remote AoE device really is. All that matters is that the AoE
  122   protocol can be used to communicate with a device identified by a
  123   certain shelf and slot address.
  124 
  125   22..  HHooww LLiinnuuxx UUsseess TThhee EEtthheerrDDrriivvee SSyysstteemm
  126 
  127   For security and performance reasons, many people use a second,
  128   dedicated network interface card (NIC) for ATA over Ethernet traffic.
  129 
  130   A NIC must be up before it can perform any networking, including AoE.
  131   On examining the output of the ifconfig command, you should see your
  132   AoE NIC listed as "UP" before attempting to use an AoE device
  133   reachable via that NIC.
  134 
  135   You can aaccttiivvaattee tthhee NNIICC with a simple ifconfig eth1 up, using the
  136   appropriate device name instead of "eth1". Note that assigning an IP
  137   address is not necessary if the NIC is being used only for AoE
  138   traffic, but having an IP address on a NIC used for AoE will not
  139   interfere with AoE.
  140 
  141   On a Linux system, block devices are used via special files called
  142   device nodes. A familiar example is /dev/hda. When a block device node
  143   is opened and used, the kernel translates operations on the file into
  144   operations on the corresponding hardware EtherDrive.
  145 
  146   Each accessible AoE target on your network is represented by a disk
  147   device node in the /dev/etherd/ directory and can be used just like
  148   any other direct attached disk. The "aoe" device driver is an open-
  149   source loadable kernel module authored by Coraid. It translates system
  150   reads/writes on a device into AoE request frames for the associated
  151   remote EtherDrive storage device, retransmitting requests if needed.
  152   When the AoE responses from the device are received, the appropriate
  153   system read/write call is acknowledged as complete. The aoe device
  154   driver handles retransmissions in the event of network congestion.
  155 
  156   The association of AoE targets on your network to device nodes in
  157   /dev/etherd/ follows a simple naming scheme. Each device node is named
  158   eX.Y, where X represents a shelf address and Y represents a slot
  159   address. Both X and Y are decimal integers. As an example, the
  160   following command displays the first 4 KiB of data from the AoE target
  161   with shelf address 0 and slot address 1.
  162 
  163 
  164 
  165        dd if=/dev/etherd/e0.1 bs=1024 count=4 | hexdump -C
  166 
  167 
  168 
  169   Creating an ext3 filesystem on the same AoE target is as simple as ...
  170 
  171 
  172 
  173        mkfs.ext3 /dev/etherd/e0.1
  174 
  175 
  176 
  177   Notice that the filesystem goes directly on the block device. There's
  178   no need for any intermediate "format" or partitioning step.
  179 
  180   Although partitions are not usually needed, they may be created using
  181   a tool like fdisk or GNU parted.  Please see the ``FAQ entry about
  182   partition tables'' for important caveats.
  183 
  184   Partitions are used by adding "p" and the partition number to the
  185   device name. For example, /dev/etherd/e0.3p1 is the first partition on
  186   the AoE target with shelf address zero and slot address three.
  187 
  188   After creating a filesystem, it can be mounted in the normal way. It
  189   is important to remember to unmount the filesystem before shutting
  190   down your network devices. Without networking, there is no way to
  191   unmount a filesystem that resides on a disk across the network.
  192 
  193   It is best to update your init scripts so that filesystems on
  194   EtherDrive storage is unmounted early in the system-shutdown
  195   procedure, before network interfaces are shut down.  ``An example'' is
  196   found below in the ``list of Frequently Asked Questions''.
  197 
  198   The device nodes in /dev/etherd/ are usually created in one of three
  199   ways:
  200 
  201 
  202   1. Most distributions today use udev to dynamically create device
  203      nodes as needed. You can configure udev to create the device nodes
  204      for your AoE disks. (For an example of udev configuration rules,
  205      see ``Why do my device nodes disappear after a reboot?'' in the
  206      ``FAQ section'' below.)
  207 
  208   2. If you are using the standalone aoe driver, as opposed to the one
  209      distributed with the Linux kernel, and you are not using udev, the
  210      Makefile will create device nodes for you when you do a "make
  211      install".
  212 
  213   3. If you are not using udev you can use static device nodes. Use the
  214      aoe_dyndevs=0 module load option for the aoe driver.  (You do not
  215      need this option if your aoe driver is older than version aoe6-50.)
  216      Then the aoe-mkdevs and aoe-mkshelf scripts in the aoetools
  217      <http://aoetools.sourceforge.net/> package can be used to create
  218      the static device nodes manually. It is very important to avoid
  219      using these static device nodes with an aoe driver that has the
  220      aoe_dyndevs module parameter set to 1, because you could
  221      accidentally use the wrong device.
  222 
  223   33..  TThhee AATTAA oovveerr EEtthheerrnneett TToooollss
  224 
  225   The aoe kernel driver allows Linux to do ATA over Ethernet. In
  226   addition to the aoe driver, there is a collection of helpful programs
  227   that operate outside of the kernel, in "user space". This collection
  228   of tools and documentation is called the aoetools, and may be found at
  229   http://aoetools.sourceforge.net/ <http://aoetools.sourceforge.net/>.
  230 
  231   Current aoe drivers from the Coraid website are bundled with a
  232   compatible version of the aoetools. This HOWTO may make reference to
  233   commands from the aoetools, like the aoe-stat command.
  234 
  235   33..11..  LLiimmiittiinngg AAooEE ttrraaffffiicc ttoo cceerrttaaiinn nneettwwoorrkk iinntteerrffaacceess
  236 
  237   By default, the aoe driver will use any local network interface
  238   available to reach an AoE target. Most of the time, though, the
  239   administrator expects legitimate AoE targets to appear only on certain
  240   ethernet interfaces, e.g., "eth1" and "eth2".
  241 
  242   Using the aoe-interfaces command from the aoetools package allows the
  243   administrator to limit AoE activity to a set list of ethernet
  244   interfaces.
  245 
  246   This configuration is especially important when some ethernet
  247   interfaces are on networks where an unexpected AoE target with the
  248   same shelf and slot address as a production AoE target might appear.
  249 
  250   Please see the aoe-interfaces manpage for more information.
  251 
  252   At module load time the list of allowable interfaces may be set with
  253   the "aoe_iflist" module parameter.
  254 
  255 
  256 
  257        modprobe aoe 'aoe_iflist=eth2 eth3'
  258 
  259 
  260 
  261   44..  EEtthheerrDDrriivvee ssttoorraaggee aanndd LLiinnuuxx SSooffttwwaarree RRAAIIDD
  262 
  263   Some AoE devices are internally redundant. A Coraid SR1521, for
  264   example, might be exporting a 14-disk RAID 5 as a single 9.75 terabyte
  265   LUN.  In that case, the AoE target itself is performing RAID,
  266   enhancing performance and reliability.
  267 
  268   You can also perform RAID on the AoE initiator. Linux Software RAID
  269   can increase performance by striping over multiple AoE targets and
  270   reliability by using data redundancy. Reading the Linux Software RAID
  271   HOWTO <http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html> before you
  272   start to work with RAID will likely save time in the long run. The
  273   Linux kernel has an "md" driver that performs the Software RAID, and
  274   there are several tool sets that allow you to use this kernel feature.
  275 
  276   The main software package for using the md driver is mdadm
  277   <http://www.cse.unsw.edu.au/~neilb/source/mdadm/>.  Less popular
  278   alternatives include the older raidtools package ``(discussed in the
  279   Archives below)'', and EVMS <http://evms.sourceforge.net/>.
  280 
  281 
  282   44..11..  EExxaammppllee:: RRAAIIDD 55 wwiitthh mmddaaddmm
  283 
  284   In this example we have five AoE targets in shelves 0-4, with each
  285   shelf exporting a single LUN 0. The following mdadm command uses these
  286   five AoE devices as RAID components, creating a level-5 RAID array.
  287   The md configuration information is stored on the components
  288   themselves in "md superblocks", which can be examined with another
  289   mdadm command.
  290 
  291 
  292 
  293        # mdadm -C -n 5 --level=raid5 --auto=md /dev/md0 /dev/etherd/e[0-4].0
  294        mdadm: array /dev/md0 started.
  295        # mdadm --examine /dev/etherd/e0.0
  296        /dev/etherd/e0.0:
  297                  Magic : a92b4efc
  298                Version : 00.90.00
  299                   UUID : 46079e2f:a285bc60:743438c8:144532aa (local to host ellijay)
  300        ...
  301 
  302 
  303 
  304   The /proc/mdstats file contains summary information about the RAID as
  305   reported by the kernel itself.
  306 
  307 
  308 
  309        # cat /proc/mdstat
  310        Personalities : [raid5] [raid4]
  311        md0 : active raid5 etherd/e4.0[5] etherd/e3.0[3] etherd/e2.0[2] etherd/e1.0[1] etherd/e0.0[0]
  312              5860638208 blocks level 5, 64k chunk, algorithm 2 [5/4] [UUUU_]
  313              [>....................]  recovery =  0.0% (150272/1465159552) finish=23605.3min speed=1032K/sec
  314 
  315        unused devices: <none>
  316 
  317 
  318 
  319   Until md finishes initializing the parity of the RAID, performance is
  320   sub-optimal, and the RAID will not be usable if one of the components
  321   fails during initialization. After initialization is complete, the md
  322   device can continue to be used even if one component fails.
  323 
  324   Later the array can be stopped in order to shut it down cleanly in
  325   preparation for a system reboot or halt.
  326 
  327 
  328 
  329        # mdadm -S /dev/md0
  330 
  331 
  332 
  333   In a system init script (see ``the aoe-init example in the FAQ'') an
  334   mdadm command can assemble the RAID components using the configuration
  335   information that was stored on them when the RAID was created.
  336 
  337 
  338 
  339        # mdadm -A /dev/md0 /dev/etherd/e[0-4].0
  340        mdadm: /dev/md0 has been started with 5 drives.
  341 
  342 
  343 
  344   To make an xfs filesystem on the RAID array and mount it, the
  345   following commands can be issued:
  346 
  347 
  348 
  349        # mkfs -t xfs /dev/md0
  350        # mkdir /mnt/raid
  351        # mount /dev/md0 /mnt/raid
  352 
  353 
  354 
  355   Once md has finished initializing the RAID, the storage is single-
  356   fault tolerant: Any of the components can fail without making the
  357   storage unavailable. Once a single component has failed, the md device
  358   is said to be in a "degraded" state. Using a degraded array is fine,
  359   but a degraded array cannot remain usable if another component fails.
  360 
  361   Adding hot spares makes the array even more robust. Having hot spares
  362   allows md to bring a new component into the RAID as soon as one of its
  363   components has failed so that the normal state may be achieved as
  364   quickly as possible. You can check /proc/mdstat for information on the
  365   initialization's progress.
  366 
  367   The new write-intent bitmap feature can dramatically reduce the time
  368   needed for re-initialization after a component fails and is later
  369   added back to the array. Reducing the time the RAID spends in degraded
  370   mode makes a double fault less likely. Please see the mdadm manpages
  371   for details.
  372 
  373   44..22..  IImmppoorrttaanntt nnootteess
  374 
  375 
  376   1. Some Linux distributions come with an mdmonitor service running by
  377      default. Unless you configure the mdmonitor to do what you want,
  378      consider turning off this service with chkconfig mdmonitor off and
  379      /etc/init.d/mdmonitor stop or your system's equivalent commands. If
  380      mdadm is running in its "monitor" mode without being properly
  381      configured, it may interfere with failover to hot spares, the
  382      stopping of the RAID, and other actions.
  383 
  384   2. There is a problem with the way some 2.6 kernels determine whether
  385      an I/O device is idle. On these kernels, RAID initialization is
  386      about five times slower than it needs to be.
  387 
  388      On these kernels you can do the following to work around the
  389      problem:
  390 
  391 
  392 
  393        echo 100000 > /proc/sys/dev/raid/speed_limit_max
  394        echo 100000 > /proc/sys/dev/raid/speed_limit_min
  395 
  396 
  397 
  398   55..  FFAAQQ ((ccoonnttaaiinnss iimmppoorrttaanntt iinnffoo))
  399 
  400   55..11..  QQ:: HHooww ddooeess tthhee ssyysstteemm kknnooww aabboouutt tthhee AAooEE ttaarrggeettss oonn tthhee nneett--
  401   wwoorrkk??
  402 
  403   A: When an AoE target comes online, it emits a broadcast frame
  404   indicating its presence. In addition to this mechanism, the AoE
  405   initiator may send out a query frame to discover any new AoE targets.
  406 
  407   The Linux aoe driver, for example, sends an AoE query once per minute.
  408   The discovery can be triggered manually with the "aoe-discover" tool,
  409   one of the aoetools <http://aoetools.sourceforge.net/>.
  410 
  411   55..22..  QQ:: HHooww ddoo II sseeee wwhhaatt AAooEE ddeevviicceess tthhee ssyysstteemm kknnoowwss aabboouutt??
  412 
  413   A: The /usr/sbin/aoe-stat program (from the aoetools
  414   <http://aoetools.sourceforge.net/>) lists the devices the system
  415   considers valid. It also displays the status of the device (up or
  416   down). For example:
  417 
  418 
  419 
  420        root@makki root# aoe-stat
  421              e0.0     10995.116GB   eth0 up
  422              e0.1     10995.116GB   eth0 up
  423              e0.2     10995.116GB   eth0 up
  424              e1.0      1152.874GB   eth0 up
  425              e7.0       370.566GB   eth0 up
  426 
  427 
  428 
  429   55..33..  QQ:: WWhhaatt iiss tthhee ""cclloosseewwaaiitt"" ssttaattee??
  430 
  431   A: The "down,closewait" status means that the device went down but at
  432   least one process still has it open. After all processes close the
  433   device, it will become "up" again if it the remote AoE device is
  434   available and ready.
  435 
  436   The user can also use the "aoe-revalidate" command to manually cause
  437   the aoe driver to query the AoE device. If the AoE device is available
  438   and ready, the device state on the Linux host will change from
  439   "down,closewait" to "up".
  440 
  441   55..44..  QQ:: HHooww ddooeess tthhee ssyysstteemm kknnooww aann AAooEE ddeevviiccee hhaass ffaaiilleedd??
  442 
  443   A: When an AoE target cannot complete a requested command it will
  444   indicate so in the response to the failed request.  The Linux aoe
  445   driver will mark the AoE device as failed upon reception of such a
  446   response. In addition, if an AoE target has not responded to a prior
  447   request within a default timeout (currently three minutes) the aoe
  448   driver will fail the device.
  449 
  450   55..55..  QQ:: HHooww ddoo II ttaakkee aann AAooEE ddeevviiccee oouutt ooff tthhee ffaaiilleedd ssttaattee??
  451 
  452   A: If the aoe driver shows the device state to be "down", first check
  453   the EtherDrive storage itself and the AoE network. Once any problem
  454   has been rectified, you can use the "aoe-revalidate" command from the
  455   aoetools <http://aoetools.sourceforge.net/> to ask the aoe driver to
  456   change the state back to "up".
  457 
  458   If the Linux Software RAID driver has marked the device as "failed"
  459   (so that an "F" shows up in the output of "cat /proc/mdstat"), then
  460   you first need to remove the device from the RAID using mdadm. Next
  461   you add the device back to the array with mdadm.
  462 
  463   An example follows, showing how (after manually failing e10.0) the
  464   device is removed from the array and then added back. After adding it
  465   back to the RAID, the md driver begins rebuilding the redundancy of
  466   the array.
  467 
  468 
  469 
  470   root@kokone ~# cat /proc/mdstat
  471   Personalities : [raid1] [raid5]
  472   md0 : active raid1 etherd/e10.1[1] etherd/e10.0[0]
  473         524224 blocks [2/2] [UU]
  474 
  475   unused devices: <none>
  476   root@kokone ~# mdadm --fail /dev/md0 /dev/etherd/e10.0
  477   mdadm: set /dev/etherd/e10.0 faulty in /dev/md0
  478   root@kokone ~# cat /proc/mdstat
  479   Personalities : [raid1] [raid5]
  480   md0 : active raid1 etherd/e10.1[1] etherd/e10.0[2](F)
  481         524224 blocks [2/1] [_U]
  482 
  483   unused devices: <none>
  484   root@kokone ~# mdadm --remove /dev/md0 /dev/etherd/e10.0
  485   mdadm: hot removed /dev/etherd/e10.0
  486   root@kokone ~# mdadm --add /dev/md0 /dev/etherd/e10.0
  487   mdadm: hot added /dev/etherd/e10.0
  488   root@kokone ~# cat /proc/mdstat
  489   Personalities : [raid1] [raid5]
  490   md0 : active raid1 etherd/e10.0[2] etherd/e10.1[1]
  491         524224 blocks [2/1] [_U]
  492         [=>...................]  recovery =  5.0% (26944/524224) finish=0.6min speed=13472K/sec
  493   unused devices: <none>
  494   root@kokone ~#
  495 
  496 
  497 
  498   55..66..  QQ:: HHooww ccaann II uussee LLVVMM wwiitthh mmyy EEtthheerrDDrriivvee ssttoorraaggee??
  499 
  500   A: With older LVM2 <http://sources.redhat.com/lvm2/> releases, you may
  501   need to edit lvm.conf, but the current version of LVM2 supports AoE
  502   devices "out of the box".
  503 
  504   You can also create md devices from your aoe devices and tell LVM to
  505   use the md devices.
  506 
  507   It's necessary to understand LVM itself in order to use AoE devices
  508   with LVM. Besides the manpages for the LVM commands, the LVM HOWTO
  509   <http://tldp.org/HOWTO/LVM-HOWTO/> is a big help getting started if
  510   you are starting out with LVM.
  511 
  512   If you have an old LVM2 that does not already detect and work with AoE
  513   devices, you can add this line to the "devices" block of your
  514   lvm.conf.
  515 
  516 
  517 
  518        types = [ "aoe", 16 ]
  519 
  520 
  521 
  522   If you are creating physical volumes out of RAIDs over EtherDrive
  523   storage, make sure to turn on md component detection so that LVM2
  524   doesn't go snooping around on the underlying EtherDrive disks.
  525 
  526 
  527 
  528        md_component_detection = 1
  529 
  530 
  531 
  532   The snapshots feature in LVM2 did not work in early 2.6 kernels.
  533   Lately, Coraid customers have reported success using snapshots on AoE-
  534   backed logical volumes when using a recent kernel and aoe driver.
  535   Older aoe drivers, like version 22, may need a fix
  536   <https://bugzilla.redhat.com/attachment.cgi?id=311070> to work
  537   correctly with snapshots.
  538 
  539   Customers have reported data corruption and kernel panics when using
  540   striped logical volumes (created with the "-i" option to lvcreate)
  541   when using aoe driver versions prior to aoe6-48. No such problems
  542   occur with normal logical volumes or with Software RAID's striping
  543   (RAID 0).
  544 
  545   Most systems have boot scripts that try to detect LVM physical volumes
  546   early in the boot process, before AoE devices are available. In
  547   playing with LVM, you may need to help LVM to recognize AoE devices
  548   that are physical devices by running vgscan after loading the aoe
  549   module.
  550 
  551   There have been reports that partitions can interfere with LVM's
  552   ability to use an AoE device as a physical volume. For example, with
  553   partitions e0.1p1 and e0.1p2 residing on e0.1, pvcreate
  554   /dev/etherd/e0.1 might complain,
  555 
  556 
  557 
  558        Device /dev/etherd/e0.1 not found.
  559 
  560 
  561 
  562   Removing the partitions allows LVM to create a physical volume from
  563   e0.1.
  564 
  565   55..77..  QQ:: II ggeett aann ""iinnvvaalliidd mmoodduullee ffoorrmmaatt"" eerrrroorr oonn mmooddpprroobbee.. WWhhyy??
  566 
  567   A: The aoe module and the kernel must be built to match one another.
  568   On module load, the kernel version, SMP support (yes or no), the
  569   compiler version, and the target processor must be the same for the
  570   module as it was building the kernel.
  571 
  572   55..88..  QQ:: CCaann II aallllooww mmuullttiippllee LLiinnuuxx hhoossttss ttoo uussee aa ffiilleessyysstteemm tthhaatt iiss
  573   oonn mmyy EEtthheerrDDrriivvee ssttoorraaggee??
  574 
  575   A: Yes, but you're now taking advantage of the flexibility of
  576   EtherDrive storage, using it like a SAN. Your software must be
  577   "cluster aware", like GFS <http://sources.redhat.com/cluster/gfs/>.
  578   Otherwise, each host will assume it is the sole user of the filesystem
  579   and data corruption will result.
  580 
  581   55..99..  QQ:: CCaann yyoouu ggiivvee mmee aann oovveerrvviieeww ooff GGFFSS aanndd rreellaatteedd ssooffttwwaarree??
  582 
  583   A: Yes, here's a brief overview.
  584 
  585   55..99..11..  BBaacckkggrroouunndd
  586 
  587   GFS is a scalable, journaled filesystem designed to be used by more
  588   than one computer at a time. There is a separate journal for each host
  589   using the filesystem. All the hosts working together are called a
  590   cluster, and each member of the cluster is called a cluster node.
  591 
  592   To achieve acceptible performance, each cluster node remembers what
  593   was on the block device the last time it looked. This is caching,
  594   where data from copies in RAM are used temporarily instead of data
  595   directly from the block device.
  596 
  597   To avoid chaos, the data in the RAM cache of every cluster node has to
  598   match what's on the block device. The members of the cluster (called
  599   "cluster nodes") communicate over TCP/IP to agree on who is in the
  600   cluster and who has the right to use a particular part of the shared
  601   block device.
  602 
  603   55..99..22..  HHaarrddwwaarree
  604 
  605   To allow the cluster nodes to control membership in the cluster and to
  606   control access to the shared block storage, "fencing" hardware can be
  607   used.
  608 
  609   Some network switches can be dynamically configured to turn single
  610   ports on and off, effectively fencing a node off from the rest of the
  611   network.
  612 
  613   Remote power switches can be told to turn an outlet off, powering a
  614   cluster node down, so that it is certainly not accessing the shared
  615   storage.
  616 
  617   55..99..33..  SSooffttwwaarree
  618 
  619   The RedHat Cluster Suite developers have created several pieces of
  620   software besides the GFS filesystem itself to allow the cluster nodes
  621   to coordinate cluster membership and to control access to the shared
  622   block device.
  623 
  624   These parts are listed here, on the GFS Project Page.
  625 
  626   http://sources.redhat.com/cluster/gfs/
  627   <http://sources.redhat.com/cluster/gfs/>
  628 
  629   GFS and its related software are undergoing continuous heavy
  630   development and are maturing slowly but steadily.
  631 
  632   As might be expected, the devleopers working for RedHat target RedHat
  633   Enterprise Linux as the ultimate platform for GFS and its related
  634   software. They also use Fedora Core as a platform for testing and
  635   innovation.
  636 
  637   That means that when choosing a distribution for running GFS, recent
  638   versions of Fedora Core, RedHat Enterprise Linux (RHEL), and RHEL
  639   clones like CentOS should be considered. On these platforms, RPMs are
  640   available that have a good chance of working "out of the box."
  641 
  642   With a RedHat-based distro like Fedora Core, using GFS means seeking
  643   out the appropriate documentation, installing the necessary RPMs, and
  644   creating a few text files for configuring the software.
  645 
  646   Here is a good overview of what the process is generally like. Note
  647   that if you're using RPMs, then building and installing the software
  648   will not be necessary.
  649 
  650   http://sources.redhat.com/cluster/doc/usage.txt
  651   <http://sources.redhat.com/cluster/doc/usage.txt>
  652 
  653   55..99..44..  UUssee
  654 
  655   Once you have things ready, using the GFS is like using any other
  656   filesystem.
  657 
  658   Performance will be greatest when the filesystem operations of the
  659   different nodes do not interfere with one another. For instance, if
  660   all the nodes try to write to the same place in a directory or file,
  661   much time will be spent in coordinating access (locking).
  662 
  663   An easy way to eliminate a large amount of locking is to use the
  664   "noatime" (no access time update) mount option. Even in traditional
  665   filesystems the use of this option often results in a dramatic
  666   performance benefit, because it eliminates the need to write to the
  667   block storage just to record the time that the file was last accessed.
  668 
  669   55..99..55..  FFeenncciinngg
  670 
  671   There are several ways to keep a cluster node from accessing shared
  672   storage when that node might have outdated assumptions about the state
  673   of the cluster or the storage. Preventing the node from accessing the
  674   storage is called "fencing", and it can be accomplished in several
  675   ways.
  676 
  677   One popular way is to simply kill the power to the fenced node by
  678   using a remote power switch. Another is to use a network switch that
  679   has ports that can be turned on and off remotely.
  680 
  681   When the shared storage resource is a LUN on an SR, it is possible to
  682   manipulate the LUN's mask list in order to accomplish fencing. You can
  683   read about this technique in the Contributions area
  684   </support/linux/contrib/>.
  685 
  686   55..1100..  QQ:: HHooww ccaann II mmaakkee aa RRAAIIDD ooff mmoorree tthhaann 2277 ccoommppoonneennttss??
  687 
  688   A: For Linux Software RAID, the kernel limits the number of disks in
  689   one RAID to 27. However, you can easily overcome this limitation by
  690   creating another level of RAID.
  691 
  692   For example, to create a RAID 0 of thirty block devices, you may
  693   create three ten-disk RAIDs (md1, md2, and md3) and then stripe across
  694   them (md0 is a stripe over md1, md2, and md3).
  695 
  696   Here is an example raidtools configuration file that implements the
  697   above scenario for shelves 5, 6, and 7: multi-level RAID 0
  698   configuration file <raid0-30component.conf>. Non-trivial raidtab
  699   configuration files are easier to generate from a script than to
  700   create by hand.
  701 
  702   EtherDrive storage gives you a lot of freedom, so be creative.
  703 
  704   55..1111..  QQ:: WWhhyy ddoo mmyy ddeevviiccee nnooddeess ddiissaappppeeaarr aafftteerr aa rreebboooott??
  705 
  706   A: Some Linux distributions create device nodes dynamically. The
  707   upcoming method of choice is called "udev". The aoe driver and udev
  708   work together when the following rules are installed.
  709 
  710   These rules go into a file with a name like 60-aoe.rules.  Look in
  711   your udev.conf file (usually /etc/udev/udev.conf) for the line
  712   starting with udev_rules= to find out where rules go (usually
  713   /etc/udev/rules.d).
  714 
  715 
  716 
  717   # These rules tell udev what device nodes to create for aoe support.
  718   # They may be installed along the following lines.  Check the section
  719   # 8 udev manpage to see whether your udev supports SUBSYSTEM, and
  720   # whether it uses one or two equal signs for SUBSYSTEM and KERNEL.
  721 
  722   # aoe char devices
  723   SUBSYSTEM=="aoe", KERNEL=="discover",   NAME="etherd/%k", GROUP="disk", MODE="0220"
  724   SUBSYSTEM=="aoe", KERNEL=="err",        NAME="etherd/%k", GROUP="disk", MODE="0440"
  725   SUBSYSTEM=="aoe", KERNEL=="interfaces", NAME="etherd/%k", GROUP="disk", MODE="0220"
  726   SUBSYSTEM=="aoe", KERNEL=="revalidate", NAME="etherd/%k", GROUP="disk", MODE="0220"
  727   SUBSYSTEM=="aoe", KERNEL=="flush",      NAME="etherd/%k", GROUP="disk", MODE="0220"
  728 
  729   # aoe block devices
  730   KERNEL=="etherd*",       NAME="%k", GROUP="disk"
  731 
  732 
  733 
  734   Unfortunately the syntax for the udev rules file has changed several
  735   times as new versions of udev appear. You will probably have to modify
  736   the example above for your system, but the existing rules and the udev
  737   documentation should help you.
  738 
  739   There is an example script in the aoe driver,
  740   linux/Documentation/aoe/udev-install.sh, that can install the rules on
  741   most systems.
  742 
  743   The udev system can only work with the aoe driver if the aoe driver is
  744   loaded. To avoid confusion, make sure that you load the aoe driver at
  745   boot time.
  746 
  747   55..1122..  QQ:: WWhhyy ddooeess RRAAIIDD iinniittiiaalliizzaattiioonn sseeeemm ssllooww??
  748 
  749   A: The 2.6 Linux kernel has a problem with its RAID initialization
  750   rate limiting feature. You can override this feature and speed up RAID
  751   initialization by using the following commands. Note that these
  752   commands change kernel memory, so the commands must be re-run after a
  753   reboot.
  754 
  755 
  756 
  757        echo 100000 > /proc/sys/dev/raid/speed_limit_max
  758        echo 100000 > /proc/sys/dev/raid/speed_limit_min
  759 
  760 
  761 
  762   55..1133..  QQ:: II ccaann oonnllyy uussee sshheellff zzeerroo!! WWhhyy wwoonn''tt ee11..99 wwoorrkk??
  763 
  764   A: Every block device has a device file, usually in /dev, that has a
  765   major and minor number. You can see these numbers using ls. Note the
  766   high major numbers (1744, 2400, and 2401) in the example below.
  767 
  768 
  769 
  770        ecashin@makki ~$ ls -l /dev/etherd/
  771        total 0
  772        brw-------  1 root disk 152, 1744 Mar  1 14:35 e10.9
  773        brw-------  1 root disk 152, 2400 Feb 28 12:21 e15.0
  774        brw-------  1 root disk 152, 2401 Feb 28 12:21 e15.0p1
  775 
  776 
  777 
  778   The 2.6 Linux kernel allows high minor device numbers like this, but
  779   until recently, 255 was the highest minor number one could use. Some
  780   distributions contain userland software that cannot understand the
  781   high minor numbers that 2.6 makes possible.
  782 
  783   Here's a crude but reliable test that can determine whether your
  784   system is ready to use devices with high minor numbers. In the example
  785   below, we tried to create a device node with a minor number of 1744,
  786   but ls shows it as 208.
  787 
  788 
  789 
  790        root@kokone ~# mknod e10.9 b 152 1744
  791        root@kokone ~# ls -l e10.9
  792        brw-r--r--  1 root root 158, 208 Mar  2 15:13 e10.9
  793 
  794 
  795 
  796   On systems like this, you can still use the aoe driver to use up to
  797   256 disks if you're willing to live without support for partitions.
  798   Just make sure that the device nodes and the aoe driver are both
  799   created with one partition per device.
  800 
  801   The commands below show how to create a driver without partition
  802   support and then to create compatible device nodes for shelf 10.
  803 
  804 
  805 
  806        make install AOE_PARTITIONS=1
  807        rm -rf /dev/etherd
  808        env n_partitions=1 aoe-mkshelf /dev/etherd 10
  809 
  810 
  811 
  812   As of version 1.9.0, the mdadm command supports large minor device
  813   numbers. The mdadm versions before 1.9.0 do not. If you would like to
  814   use versions of mdadm older than 1.9.0, you can configure your driver
  815   and device nodes as outlined above. Be aware that it's easy confuse
  816   yourself by creating a driver that doesn't match the device nodes.
  817 
  818   55..1144..  QQ:: HHooww ccaann II ssttaarrtt mmyy AAooEE ssttoorraaggee oonn bboooott aanndd sshhuutt iitt ddoowwnn wwhheenn
  819   tthhee ssyysstteemm sshhuuttss ddoowwnn??
  820 
  821   A: That is really a question about your own system, so it's a question
  822   you, as the system administrator, are in the best position to answer.
  823 
  824   In general, though, many Linux distributions follow the same patterns
  825   when it comes to system "init scripts". Most use a System V style.
  826 
  827   The example below should help get you started if you have never
  828   created and installed an init script. Start by reading the comments at
  829   the top. Make sure you understand how your system works and what the
  830   script does, because every system is different.
  831 
  832   Here is an overview of what happens when the aoe module is loaded and
  833   the aoe module begins AoE device discovery. It should help you to
  834   understand the example script below. Starting up the aoe module on
  835   boot can be tricky if necessary parts of the system are not ready when
  836   you want to use AoE.
  837 
  838   To discover an AoE device, the aoe driver must receive a Query Config
  839   reponse packet that indicates the device is available. A Coraid SR
  840   broadcasts this response unsolicited when you run the online SR
  841   command, but it is usually sent in response to an AoE initiator
  842   broadcasting a Query Config command to discover devices on the
  843   network. Once an AoE device has been discovered, the aoe driver sends
  844   an ATA Device Identify command to get information about the disk
  845   drive. When the disk size is known, the aoe driver will install the
  846   new block device in the system.
  847 
  848   The aoe driver will broadcast this AoE discovery command when loaded,
  849   and then once a minute thereafter.
  850 
  851   The AoE discovery that takes place on loading the aoe driver does not
  852   take long, but it does take some time. That's why you'll see "sleep"
  853   commands in the example aoe-init script below. If AoE discovery is
  854   failing, try unloading the aoe module and tuning your init script by
  855   invoking it at the command line.
  856 
  857   You will often find that a delay is necessary after loading your
  858   network drivers (and before loading the aoe driver). This delay allows
  859   the network interface to initialize and to become usable. An
  860   additional delay is necessary after loading the aoe driver, so that
  861   AoE discovery has time to take place before any AoE storage is used.
  862 
  863   Without such a delay, the initial AoE Config Query broadcast packet
  864   might never go out onto the AoE network, and then the AoE initiator
  865   will not know about any AoE targets until the next periodic Config
  866   Query broadcast occurs, usually one minute later.
  867 
  868 
  869 
  870   #! /bin/sh
  871   # aoe-init - example init script for ATA over Ethernet storage
  872   #
  873   #   Edit this script for your purposes.  (Changing "eth1" to the
  874   #   appropriate interface name, adding commands, etc.)  You might
  875   #   need to tune the sleep times.
  876   #
  877   #   Install this script in /etc/init.d with the other init scripts.
  878   #
  879   #   Make it executable:
  880   #     chmod 755 /etc/init.d/aoe-init
  881   #
  882   #   Install symlinks for boot time:
  883   #     cd /etc/rc3.d && ln -s ../init.d/aoe-init S99aoe-init
  884   #     cd /etc/rc5.d && ln -s ../init.d/aoe-init S99aoe-init
  885   #
  886   #   Install symlinks for shutdown time:
  887   #     cd /etc/rc0.d && ln -s ../init.d/aoe-init K01aoe-init
  888   #     cd /etc/rc1.d && ln -s ../init.d/aoe-init K01aoe-init
  889   #     cd /etc/rc2.d && ln -s ../init.d/aoe-init K01aoe-init
  890   #     cd /etc/rc6.d && ln -s ../init.d/aoe-init K01aoe-init
  891   #
  892 
  893   case "$1" in
  894           "start")
  895                   # load any needed network drivers here
  896 
  897                   # replace "eth1" with your aoe network interface
  898                   ifconfig eth1 up
  899 
  900                   # time for network interface to come up
  901                   sleep 4
  902 
  903                   modprobe aoe
  904 
  905                   # time for AoE discovery and udev
  906                   sleep 7
  907 
  908                   # add your raid assemble commands here
  909                   # add any LVM commands if needed (e.g. vgchange)
  910                   # add your filesystem mount commands here
  911 
  912                   test -d /var/lock/subsys && touch /var/lock/subsys/aoe-init
  913                   ;;
  914           "stop")
  915                   # add your filesystem umount commands here
  916                   # deactivate LVM volume groups if needed
  917                   # add your raid stop commands here
  918                   rmmod aoe
  919                   rm -f /var/lock/subsys/aoe-init
  920                   ;;
  921           *)
  922                   echo "usage: `basename $0` {start|stop}" 1>&2
  923                   ;;
  924   esac
  925 
  926 
  927 
  928   55..1155..  QQ:: WWhhyy ddoo II ggeett ""ppeerrmmiissssiioonn ddeenniieedd"" wwhheenn II''mm rroooott??
  929 
  930   A: Some newer systems come with SELinux (Security-Enhanced Linux),
  931   which can limit what the root user can do.
  932 
  933   SELinux is usually good about creating entries in the system logs when
  934   it prevents root from doing something, so examine your logs for such
  935   messages.
  936 
  937   Check the SELinux documentation for information on how to configure or
  938   disable SELinux according to your needs.
  939 
  940   55..1166..  QQ:: WWhhyy ddooeess ffddiisskk aasskk mmee ffoorr tthhee nnuummbbeerr ooff ccyylliinnddeerrss??
  941 
  942   A: Your fdisk is probably asking the kernel for the size of the disk
  943   with a BLKGETSIZE block device ioctl, which returns the sector count
  944   of the disk in a 32-bit number. If the size of the disk exceeds the
  945   ability to be stored in this 32-bit number (2 TB is the limit), the
  946   ioctl returns ETOOBIG as an error. This error indicates that the
  947   program should try the 64-bit ioctl (BLKGETSIZE64), but when fdisk
  948   doesn't do that, it just asks the user to supply the number of
  949   cylinders.
  950 
  951   You can tell fdisk the number of cylinders yourself. The number to use
  952   (sectors / (255 * 63)) is printed by the following commands. Use the
  953   appropriate device instead of "e0.0".
  954 
  955 
  956 
  957        sectors=`cat /sys/block/etherd\!e0.0/size`
  958        echo $sectors 255 63 '*' / p | dc
  959 
  960 
  961 
  962   But no MSDOS partition table can ever work with more than 2TB. The
  963   reason is that the numbers in the partition table itself are only 32
  964   bits in size. That means you can't have a partition larger than 2TB in
  965   size or starting further than 2TB from the beginning of the device.
  966 
  967   Some options for multi-terabyte volumes are:
  968 
  969 
  970   1. By doing without partitions, the filesystem can be created directly
  971      on the AoE device itself (e.g., /dev/etherd/e1.0),
  972 
  973   2. LVM2, the Logical Volume Manager, is a sophisticated way of
  974      allocating storage to create logical volumes of desired sizes, and
  975 
  976   3. GPT partition tables.
  977 
  978   The last item in the list above is a new kind of partition table that
  979   overcomes the limitations of the older MSDOS-style partition table.
  980   Andrew Chernow has related his successful experiences using GPT
  981   partition tables on large AoE devices in this contributed document
  982   </support/linux/contrib/chernow/gpt.html>.
  983 
  984   Please note that some versions of the GNU parted tool, such as version
  985   1.8.6, have a bug. This bug allows the user to create an MSDOS-style
  986   partition table with partitions larger than two terabytes even though
  987   these partitions are too large for an MSDOS partition table. The
  988   result is that the filesystems on these partitions will only be usable
  989   until the next reboot.
  990 
  991   55..1177..  QQ:: CCaann II uussee AAooEE eeqquuiippmmeenntt wwiitthh OOrraaccllee ssooffttwwaarree??
  992 
  993   A: Oracle used to have a Oracle Storage Compatibility Program
  994   <http://www.oracle.com/technology/deploy/availability/htdocs/oscp.html>,
  995   but simple block-level storage technologies do not require Oracle
  996   validation. ATA over Ethernet provides simple, block-level storage.
  997 
  998   Oracle used to have a list of a frequently asked questions about
  999   running Oracle on Linux, but they have replaced it with documentation
 1000   about their own Linux distribution list covering
 1001   <http://www.oracle.com/technology/tech/linux/htdocs/oracleonlinux_faq.html>.
 1002   A third party site continues to maintain a FAQ about running Oracle on
 1003   Linux <http://www.orafaq.com/faqlinux.htm>.
 1004 
 1005   55..1188..  QQ:: WWhhyy ddoo II hhaavvee iinntteerrmmiitttteenntt pprroobblleemmss??
 1006 
 1007   A: Make sure your network is in good shape. Having good patch cables,
 1008   reliable network switches with good flow control, and good network
 1009   cards will keep your network storage happy.
 1010 
 1011   55..1199..  QQ:: HHooww ccaann II aavvooiidd rruunnnniinngg oouutt ooff mmeemmoorryy wwhheenn ccooppyyiinngg llaarrggee
 1012   ffiilleess??
 1013 
 1014   A: You can tell the Linux kernel not to wait so long before writing
 1015   data out to backing storage.
 1016 
 1017 
 1018 
 1019        echo 3 > /proc/sys/vm/dirty_ratio
 1020        echo 4 > /proc/sys/vm/dirty_background_ratio
 1021        echo 32768 > /proc/sys/vm/min_free_kbytes
 1022 
 1023 
 1024 
 1025   When a large MTU, like 9000, is in being used on the AoE-side network
 1026   interfaces, a larger min_free_kbytes setting could be helpful. The
 1027   more RAM you have, the larger the number you might have to use.
 1028 
 1029   There are also alternative settings to the above "ratio" settings,
 1030   available as of kernel version 2.6.29. They are dirty_bytes and
 1031   dirty_background_bytes, and they provide finer control for systems
 1032   with large amounts of RAM.
 1033 
 1034   If you find the /proc settings to be helpful, you can make them
 1035   permanent by editing /etc/sysctl.conf or by creating an init script
 1036   that performs the settings at boot time.
 1037 
 1038   The Documentation/sysctl/vm.txt file for your kernel has details on
 1039   the settings available for your particular kernel, but some guiding
 1040   principles are...
 1041 
 1042 
 1043   +o  Linux will use free RAM to cache the data that is on AoE targets,
 1044      which is helpful.
 1045 
 1046   +o  Writes to the AoE target go first to RAM, updating the cache. Those
 1047      updated parts of the cached data are "dirty" until the changes are
 1048      written out to the AoE target. Then they're "clean".
 1049 
 1050   +o  If the system needs RAM for something else, clean parts of the
 1051      cache can be repurposed immediately.
 1052 
 1053   +o  The RAM that is holding dirty cache data cannot be reclaimed
 1054      immediately, because it reflects updates to the AoE target that
 1055      have not yet made it to the AoE target.
 1056 
 1057   +o  Systems with much RAM and doing many writes will accumulate dirty
 1058      data quickly.
 1059 
 1060   +o  If the processes creating the write workload are forced by the
 1061      Linux kernel to wait for the dirty data to be flushed out to the
 1062      backing store (AoE targets), then I/O goes fast but the producers
 1063      are naturally throttled, and the system stays responsive and
 1064      stable.
 1065   +o  If the dirty data is flushed in "the background", though, then when
 1066      there's too much dirty data to flush out, the system becomes
 1067      unresponsive.
 1068 
 1069   +o  Telling Linux to maintain a certain amount of truly free RAM, not
 1070      used for caching, allows the system to have plenty of RAM for doing
 1071      the work of flushing out the dirty data.
 1072 
 1073   +o  Telling Linux to push dirty data out sooner keeps the backing store
 1074      more consistent while it is being used (with regard to the danger
 1075      of power failures, network failures, and the like). It also allows
 1076      the system to quickly reclaim memory used for caching when needed,
 1077      since the data is clean.
 1078 
 1079   55..2200..  QQ:: WWhhyy ddooeessnn''tt tthhee aaooee ddrriivveerr nnoottiiccee tthhaatt aann AAooEE ddeevviiccee hhaass
 1080   ddiissaappppeeaarreedd oorr cchhaannggeedd ssiizzee??
 1081 
 1082   A: Prior to the aoe6-15 driver, aoe drivers only learned an AoE
 1083   device's characteristics once, and the only way to use an AoE device
 1084   that had grown or to get rid of "phantom" AoE devices that were no
 1085   longer present was to re-load the aoe module completely.
 1086 
 1087 
 1088 
 1089        rmmod aoe
 1090        modprobe aoe
 1091 
 1092 
 1093 
 1094   Since aoe6-15, aoe drivers have supported the aoe-revalidate command.
 1095   See the aoe-revalidate manpage for more information.
 1096 
 1097   55..2211..  QQ:: MMyy NNFFSS cclliieenntt hhaannggss wwhheenn II eexxppoorrtt aa ffiilleessyysstteemm oonn aann AAooEE
 1098   ddeevviiccee..
 1099 
 1100   A: If you are exporting a filesystem over NFS, then that filesystem
 1101   resides on a block device. Every block device has a major and minor
 1102   device number that you can see by running "ls -l".
 1103 
 1104   If the block device has a "high" minor number, over 255, and you're
 1105   trying to export a filesystem on that device, then NFS will have
 1106   trouble using the minor number to identify the filesystem. You can
 1107   tell the NFS server to use a different number by using the "fsid"
 1108   option in your /etc/exports file.
 1109 
 1110   The fsid option is documented in the "exports" manpage. Here's an
 1111   example of how its use might look in /etc/exports.
 1112 
 1113 
 1114 
 1115        /mnt/alpha 205.185.197.207(rw,sync,no_root_squash,fsid=20)
 1116 
 1117 
 1118 
 1119   As the manpage says, each filesystem needs its own unique fsid.
 1120 
 1121   55..2222..  QQ:: WWhhyy ddoo II sseeee ""uunnkknnoowwnn ppaarrttiittiioonn ttaabbllee"" eerrrroorrss iinn mmyy llooggss??
 1122 
 1123   A: Those are probably not errors.  Usually this message means that
 1124   your disk doesn't have a partition table. With AoE devices, that's the
 1125   common case.
 1126 
 1127   When a new block device is detected by the kernel, the kernel tries to
 1128   read the part of the block device where a partition table is
 1129   conventially stored.
 1130 
 1131   The kernel checks to see whether the data there looks like any kind of
 1132   partition table that it knows about. It can't tell the difference
 1133   between a disk with a kind of partition table it doesn't know about
 1134   and a disk with no partition table at all.
 1135 
 1136   55..2233..  QQ:: WWhhyy ddoo II ggeett bbeetttteerr tthhrroouugghhppuutt ttoo aa ffiillee oonn aann AAooEE ddeevviiccee
 1137   tthhaann ttoo tthhee ddeevviiccee iittsseellff??
 1138 
 1139   Most of the time a filesystem resides on a block device, so that the
 1140   filesystem can be mounted and the storage is used by reading and
 1141   writing files and directories.  When you are not using a filesystem at
 1142   all, you might see somewhat degraded performance. Sometimes this
 1143   degradation comes as a surprise to new AoE users when they first try
 1144   out an AoE device with the dd command, for example, before creating a
 1145   filesystem on the device.
 1146 
 1147   If the AoE device has an odd number of sectors, the block layer of the
 1148   Linux kernel presents the aoe driver with 512-byte I/O jobs. Each AoE
 1149   packet winds up with only one sector of data, doubling the number of
 1150   AoE packets when normal ethernet frames are in use.
 1151 
 1152   The Linux kernel's block layer gives special treatment to filesystem
 1153   I/O, giving the aoe driver I/O jobs in the filesystem block size, so
 1154   there is no performance penalty to using a filesystem on an AoE device
 1155   that has an odd number of sectors. Since there isn't a large demand
 1156   for non-filesystem I/O, the complexity associated with coalescing
 1157   multiple I/O jobs in the aoe driver is probably not worth the
 1158   potential driver instability it could introduce.
 1159 
 1160   One way to work around this issue is to use the O_DIRECT flag to the
 1161   "open" system call. For recent versions of dd, you can use the option,
 1162   "oflag=direct" to tell dd to use this O_DIRECT flag. You should
 1163   combine this option with a large blocksize, such as "bs=4M" in order
 1164   to take use the larger possible I/O batch size.
 1165 
 1166   Another way to work around this issue is to use a trivial md device as
 1167   a wrapper. (Almost everyone uses a filesystem. This technique is only
 1168   interesting to those who are not using a filesystem, so most people
 1169   should ignore this idea.) In the example below, a single-disk RAID 0
 1170   is created for the AoE device e0.3. Although e0.3 has an odd number of
 1171   sectors, the md1 device does not, and tcpdump confirms that each AoE
 1172   packet has 1 KiB of data as we would like.
 1173 
 1174 
 1175 
 1176        makki:~# mdadm -C -l 0 -n 1 --auto=md  /dev/md1 /dev/etherd/e0.3
 1177        mdadm: '1' is an unusual number of drives for an array, so it is probably
 1178             a mistake.  If you really mean it you will need to specify --force before
 1179             setting the number of drives.
 1180        makki:~# mdadm -C -l 0 --force -n 1 --auto=md  /dev/md1 /dev/etherd/e0.3
 1181        mdadm: array /dev/md1 started.
 1182        makki:~# cat /sys/block/etherd\!e0.3/size
 1183        209715201
 1184        makki:~# cat /sys/block/md1/size
 1185        209715072
 1186 
 1187 
 1188 
 1189   55..2244..  QQ:: HHooww ccaann II bboooott ddiisskklleessss ssyysstteemmss ffrroomm mmyy CCoorraaiidd EEtthheerrDDrriivvee
 1190   ddeevviicceess??
 1191 
 1192   Booting from AoE devices is similar to other kinds of network booting.
 1193   Customers have contributed examples of successful strategies in the
 1194   Contributions Area </support/linux/contrib/> of the Coraid website.
 1195 
 1196   Jayson Vantuyl: Making A Flexible Initial Ramdisk
 1197   </support/linux/contrib/index.html#jvboot>
 1198 
 1199   Jason McMullan: Add root filesystem on AoE support to aoe driver
 1200   </support/linux/contrib/index.html#jmboot>
 1201 
 1202   Keep in mind that if you intend to use AoE devices before udev is
 1203   running, you must use static minor numbers for the device nodes. An
 1204   aoe6 driver version 50 or above can be instructed to use static minor
 1205   numbers by being loaded with the aoe_dyndevs=0 module parameter.
 1206   (Previous aoe drivers only used static minor device numbers.)
 1207 
 1208   55..2255..  QQ:: WWhhaatt ffiilleessyysstteemmss ddoo yyoouu rreeccoommmmeenndd ffoorr vveerryy llaarrggee bblloocckk
 1209   ddeevviicceess??
 1210 
 1211   The filesystem you choose will depend on how you want to use the
 1212   storage. Here are some generalizations that may serve as a starting
 1213   point.
 1214 
 1215   There are two major classes of filesystems: cluster filesystems and
 1216   traditional filesystems. Cluster filesystems are more complex and
 1217   support simultaneous access from multiple independent computers to a
 1218   single filesystem stored on a shared block device.
 1219 
 1220   Traditional filesystems are only mounted by one host at a time. Some
 1221   traditional filesystems that scale to sizes larger than those
 1222   supported by ext3 include the following journalling filesystems.
 1223 
 1224   XFS <http://oss.sgi.com/projects/xfs/>, developed at SGI, specializes
 1225   in high throughput to large files.
 1226 
 1227   Reiserfs <http://www.namesys.com/>, an often experimental filesystem
 1228   can perform well with many small files.
 1229 
 1230   JFS <http://jfs.sourceforge.net/>, developed at IBM, is a general
 1231   purpose filesystem.
 1232 
 1233   55..2266..  QQ:: WWhhyy ddooeess uummoouunntt ssaayy,, ""ddeevviiccee iiss bbuussyy""??
 1234 
 1235   A: That just means you're still using the filesystem on that device.
 1236 
 1237   Unless something has gone very wrong, you should be able to unmount
 1238   after you stop using the filesystem. Here are a few ways you might be
 1239   using the filesystem without knowing it:
 1240 
 1241 
 1242   +o  NFS might be exporting it. Stopping the NFS service will unuse the
 1243      filesystem.
 1244 
 1245   +o  A process might be holding open a file on the filesystem. Killing
 1246      the process will unuse the filesystem.
 1247 
 1248   +o  A process might have some directory on the filesystem as its
 1249      current working directory. In that case, you can kill the process
 1250      or (if it's a shell) cd to some other directory that's not on the
 1251      fs you're trying to unmount.
 1252 
 1253   The lsof command can be helpful in finding processes that are using
 1254   files.
 1255 
 1256   55..2277..  QQ:: HHooww ddoo II uussee tthhee mmuullttiippllee nneettwwoorrkk ppaatthh ssuuppppoorrtt iinn ddrriivveerr
 1257   vveerrssiioonnss 3333 aanndd uupp??
 1258 
 1259 
 1260   A: You don't have to do anything to benefit from the aoe driver's
 1261   ability to use multiple network paths to the same AoE target.
 1262 
 1263   The aoe driver will automatically use each end-to-end path in an
 1264   essentially round-robin fashion. If one network path becomes unusable,
 1265   the aoe driver will attempt to use the remaining network paths to
 1266   reach the AoE target, even retransmitting any lost packets through one
 1267   of the remaining paths.
 1268 
 1269   55..2288..  QQ:: WWhhyy ddooeess ""xxffss__cchheecckk"" ssaayy ""oouutt ooff mmeemmoorryy""??
 1270 
 1271   A: The xfstools use a huge amount of virtual memory when operating on
 1272   large filesystems. The CLN HOWTO has some helpful information about
 1273   using temporary swap space when necessary for accomodating the
 1274   xfstools' virtual memory requirements.
 1275 
 1276   CLN HOWTO: Repairing a Filesystem </support/cln/CLN-
 1277   HOWTO/ar01s05.html#id2515012>
 1278 
 1279   The 32-bit xfstools are limited in the size of the filesystem they can
 1280   operate on, but 64-bit systems overcome this limitation. This limit is
 1281   likely to be encountered with 32-bit xfstools for filesystems over 2
 1282   TiB in size.
 1283 
 1284   55..2299..  QQ:: CCaann vviirrttuuaall mmaacchhiinneess rruunnnniinngg oonn VVMMwwaarree EESSXX uussee AAooEE oovveerr
 1285   jjuummbboo ffrraammeess??
 1286 
 1287   A: It is somewhat difficult to find public information about the ESX
 1288   configuration necessary to use jumbo frames, but there is information
 1289   in the public forum at the URL below.
 1290 
 1291   How to setup TCP/IP Jumbo packet support in VMware ESX 3.5 on W2K3 VMs
 1292   <http://communities.vmware.com/thread/135691>
 1293 
 1294   55..3300..  QQ:: CCaann II uussee SSMMAARRTT wwiitthh mmyy AAooEE ddeevviicceess??
 1295 
 1296   A: The early Coraid products like the EtherDrive PATA blades simply
 1297   passed ATA commands through to the attached PATA disk, including SMART
 1298   commands. While there was no way to ask the aoe driver to send SMART
 1299   commands, one could ask aoeping to send SMART commands.  The aoeping
 1300   manpage has more information.
 1301 
 1302   The Coraid SR and VS storage appliances present AoE targets that are
 1303   LUNs, not corresponding to a specific disk. The SR supports SMART
 1304   internally, on its command line, but the AoE LUNs do not support
 1305   SMART.
 1306 
 1307   66..  JJuummbboo FFrraammeess
 1308 
 1309   Data is transmitted over the ethernet in frames, usually with a
 1310   maximum frame size of 1500. Receiving or transmitting a frame of data
 1311   takes time, and by increasing the amount of data per frame, data can
 1312   often be transmitted more efficiently over an ethernet network.
 1313 
 1314   Frames larger than 1500 octets are called "jumbo frames." There is
 1315   plenty of information about jumbo frames out there, so in this section
 1316   we're going to focus on how jumbo frames relate to the use of AoE.
 1317 
 1318   When you change the MTU on your Linux host's network interface, the
 1319   interface must essentially reboot itself. Once this has completed and
 1320   the interface is back up, you should run the aoe-discover command to
 1321   trigger the reevaluation of the aoe device's jumbo frame capability.
 1322   You should see lines in your log (or in the output of the dmesg
 1323   command) indicating that the outstanding frame size has changed. The
 1324   example text below appears after setting the MTU on eth1 to 4200,
 1325   enough for 4 KiB of data, plus headers.
 1326        aoe: e7.0: setting 4096 byte data frames on eth1:003048865ed2
 1327 
 1328 
 1329 
 1330   If you do not see this output, try running aoe-revalidate on the
 1331   device in question. If you have a switch inbetween your SR and your
 1332   linux client that does not have jumbo frames enabled, the aoe driver
 1333   will fall back to 1 KiB of data per packet until a forced revalidation
 1334   occurs.
 1335 
 1336   For larger frames to be used, the whole network path must support
 1337   them. For example, consider a scenario where you are using ...
 1338 
 1339 
 1340   1. a LUN from a Coraid SR1521 as your AoE target,
 1341 
 1342   2. a Linux host with an Intel gigabit NIC as your AoE initiator, and
 1343 
 1344   3. a gigabit switch between the target and initiator.
 1345 
 1346   In that case, all three points on the network must be configured to
 1347   handle large frames in order for AoE data to be transmitted in jumbo
 1348   frames.
 1349 
 1350   66..11..  LLiinnuuxx NNIICC MMTTUU
 1351 
 1352   Check the documentation for your network card's driver to find out how
 1353   to change its maximum transmission unit (MTU). For example, if you
 1354   have a gigabit Intel NIC, you can read the
 1355   Documentation/networking/e1000.txt file in the kernel sources to find
 1356   out that the following command increases the MTU to 4200.
 1357 
 1358 
 1359 
 1360        ifconfig ethx mtu 4200 up
 1361 
 1362 
 1363 
 1364   The real name of your interface (e.g., "eth1") should be used instead
 1365   of "ethx".
 1366 
 1367   66..22..  NNeettwwoorrkk SSwwiittcchh MMTTUU
 1368 
 1369   Usually you have to turn on jumbo frames in a switch that supports
 1370   them. Doing jumbo frames requires a different buffer allocation in the
 1371   switch that's not usually sensible for ethernet with standard frames.
 1372   Check the documentation for your switch for details.
 1373 
 1374   66..33..  SSRR MMTTUU
 1375 
 1376   No special configuration steps need to be taken on the Coraid
 1377   SATA+RAID unit for it to use jumbo frames if the firmware release is
 1378   20060316 or newer.
 1379 
 1380   You can see what firmware release your SR is running by issuing the
 1381   "release" command at its command line.
 1382 
 1383   77..  AAppppeennddiixx AA:: AArrcchhiivveess
 1384 
 1385   This section contains material that is no longer relevant to a
 1386   majority of readers. It has been placed in this appendix with minimal
 1387   editing.
 1388 
 1389 
 1390   77..11..  EExxaammppllee:: RRAAIIDD 55 wwiitthh tthhee rraaiiddttoooollss
 1391 
 1392   Let us assume we have five AoE targets that are virtual LUNs numbered
 1393   0 through 4, exported from a Coraid VS appliance that has been
 1394   assigned shelf address 0. Let us further assume we want to use these
 1395   five LUNs to create a level-5 RAID array. Using a text editor, we
 1396   create a Software RAID configuration file named "/etc/rt". The
 1397   transcript below shows its contents.
 1398 
 1399 
 1400 
 1401        $ cat /etc/rt
 1402        raiddev /dev/md0
 1403                raid-level      5
 1404                nr-raid-disks   5
 1405                chunk-size      32
 1406                persistent-superblock 1
 1407                device          /dev/etherd/e0.0
 1408                raid-disk       0
 1409                device          /dev/etherd/e0.1
 1410                raid-disk       1
 1411                device          /dev/etherd/e0.2
 1412                raid-disk       2
 1413                device          /dev/etherd/e0.3
 1414                raid-disk       3
 1415                device          /dev/etherd/e0.4
 1416                raid-disk       4
 1417 
 1418 
 1419 
 1420   Here is an example for setting up and using the RAID array described
 1421   by the above configuration file, /etc/rt.
 1422 
 1423 
 1424 
 1425        $ mkraid -c /etc/rt /dev/md0
 1426        DESTROYING the contents of /dev/md0 in 5 seconds, Ctrl-C if unsure!
 1427        handling MD device /dev/md0
 1428        analyzing super-block
 1429        disk 0: /dev/etherd/00:00, 19535040kB, raid superblock at 19534976kB
 1430        disk 1: /dev/etherd/00:01, 19535040kB, raid superblock at 19534976kB
 1431        disk 2: /dev/etherd/00:02, 19535040kB, raid superblock at 19534976kB
 1432        disk 3: /dev/etherd/00:03, 19535040kB, raid superblock at 19534976kB
 1433        disk 4: /dev/etherd/00:04, 19535040kB, raid superblock at 19534976kB
 1434        $
 1435 
 1436 
 1437 
 1438   To make an ext3 filesystem on the RAID array and mount it, the
 1439   following commands can be issued:
 1440 
 1441 
 1442 
 1443        $ mkfs.ext3 /dev/md0
 1444        ... (mkfs output)
 1445        $ mount /dev/md0 /mnt/raid
 1446        $
 1447 
 1448 
 1449 
 1450   The resulting storage is single-fault tolerant. Add hot spares to make
 1451   the array even more robust (see the Software RAID documentation for
 1452   more information.) Remember that it takes the md driver some time to
 1453   initialize a new RAID 5 array. During that time, you can use the
 1454   device, but performance is sub-optimal until md finishes. Check
 1455   /proc/mdstat for information on the initialization's progress.
 1456 
 1457   77..22..  EExxaammppllee:: RRAAIIDD 1100 wwiitthh mmddaaddmm
 1458 
 1459   Today, the Linux kernel supports a raid10 personality, and you can
 1460   create a RAID 10 with one mdadm command. Things used to be more
 1461   complicated. The section below shows the steps that used to be
 1462   necessary to create a RAID 10 by first creating several RAID 1 mirrors
 1463   that could serve as components for the larger RAID 0.
 1464 
 1465   RAID 10 is striping over mirrors. That is, a RAID 0 is created to
 1466   stripe data over several RAID 1 devices. Each RAID 1 is a mirrored
 1467   pair of disks. For a given (even) number of disks, a RAID 10 has less
 1468   capacity and throughput than a RAID 5. Nevertheless, storage experts
 1469   often prefer RAID 10 for its superior resiliancy to failure, its low
 1470   re-initialization time, and its low computational overhead.
 1471 
 1472   The first example shows how to create a RAID 10 and a hot spare from
 1473   eight AoE targets that share shelf address 1. After checking the mdadm
 1474   manpage, it should be easy for you to create startup and shutdown
 1475   scripts.
 1476 
 1477 
 1478 
 1479        # make-raid10.sh
 1480        # create a RAID 10 from shelf 1 to be used with mdadm-aoe.conf
 1481 
 1482        set -xe         # shell flags: be verbose, exits on errors
 1483        shelf=1
 1484 
 1485        # create the mirrors
 1486        mdadm -C /dev/md1 -l 1 -n 2 /dev/etherd/e$shelf.0 /dev/etherd/e$shelf.1
 1487        mdadm -C /dev/md2 -l 1 -n 2 /dev/etherd/e$shelf.2 /dev/etherd/e$shelf.3
 1488        mdadm -C /dev/md3 -l 1 -n 2 /dev/etherd/e$shelf.4 /dev/etherd/e$shelf.5
 1489        mdadm -C /dev/md4 -l 1 -n 2 -x 2 /dev/etherd/e$shelf.6 /dev/etherd/e$shelf.7 \
 1490                /dev/etherd/e$shelf.8
 1491        sleep 1
 1492        # create the stripe over the mirrors
 1493        mdadm -C /dev/md0 -l 0 -n 4 /dev/md1 /dev/md2 /dev/md3 /dev/md4
 1494 
 1495 
 1496 
 1497   Notice that the make-raid10.sh script above sets up md4 with the hot
 1498   spare drive. What if one of the drives in md1 fails? The "spare group"
 1499   mdadm feature allows an mdadm process running in monitor mode to
 1500   dynamically allocate hot spares as needed, so that the single hot
 1501   spare can replace a faulty disk in any RAID 1 of the four.
 1502 
 1503   The configuration file below tells the mdadm monitor process that it
 1504   can use the hot spare to replace any drive in the RAID 10.
 1505 
 1506 
 1507 
 1508   # mdadm-aoe.conf
 1509   # see mdadm.conf manpage for syntax and info
 1510   #
 1511   # There's a "spare group" called e1, after the shelf
 1512   # with address 1, so that mdadm can use hot spares for
 1513   # any RAID 1 in the RAID 10 on shelf 1.
 1514   #
 1515 
 1516   DEVICE /dev/etherd/e1.[0-9]
 1517 
 1518   ARRAY /dev/md1
 1519     devices=/dev/etherd/e1.0,/dev/etherd/e1.1
 1520     spare-group=e1
 1521   ARRAY /dev/md2
 1522     devices=/dev/etherd/e1.2,/dev/etherd/e1.3
 1523     spare-group=e1
 1524   ARRAY /dev/md3
 1525     devices=/dev/etherd/e1.4,/dev/etherd/e1.5
 1526     spare-group=e1
 1527   ARRAY /dev/md4
 1528     devices=/dev/etherd/e1.6,/dev/etherd/e1.7,/dev/etherd/e1.8
 1529     spare-group=e1
 1530 
 1531   ARRAY /dev/md0
 1532     devices=/dev/md1,/dev/md2,/dev/md3,/dev/md4
 1533 
 1534   MAILADDR root
 1535 
 1536   # This is normally a program that handles events instead
 1537   # of just /bin/echo.  If you run the mdadm monitor in the
 1538   # forground, though, using echo allows you to see what events
 1539   # are occurring.
 1540   #
 1541   PROGRAM /bin/echo
 1542 
 1543 
 1544 
 1545   77..33..  IImmppoorrttaanntt nnootteess
 1546 
 1547 
 1548   1. You may note above that the example creates the RAID device
 1549      configuration file as /etc/rt rather than the conventional
 1550      /etc/raidtab. The kernel uses the existence of /etc/raidtab to
 1551      trigger starting the RAID device on boot before any other
 1552      initializations are performed. This is done to permit users the
 1553      ability to use a Software RAID device for their root filesystem.
 1554      Unfortunately, because the kernel has not yet initialized the
 1555      network it is unable to access the EtherDrive storage at this point
 1556      and the kernel hangs. The workaround for this is to place
 1557      EtherDrive-based RAID configurations in another file such as
 1558      /etc/rt and add calls in an rc.local file similar to the following
 1559      for startup on boot:
 1560 
 1561 
 1562 
 1563        raidstart -c /etc/rt /dev/md0
 1564        mount /dev/md0 /mnt/raid
 1565 
 1566 
 1567 
 1568   77..44..  OOlldd FFAAQQ LLiisstt
 1569 
 1570   These questions are no longer frequently asked, probably because they
 1571   relate to software that is no longer widely used.
 1572 
 1573   77..44..11..  QQ:: WWhheenn II ""mmooddpprroobbee aaooee"",, iitt ttaakkeess aa lloonngg ttiimmee.. TThhee ssyysstteemm
 1574   sseeeemmss ttoo hhaanngg.. WWhhaatt ccoouulldd bbee tthhee pprroobblleemm??
 1575 
 1576   A: When the hotplug service was first making its way into Linux
 1577   distributions, it could slow things down and cause problems when the
 1578   aoe module loaded. For some systems, it is may be easiest to disable
 1579   it on your system. Usually the right commands look like this:
 1580 
 1581 
 1582 
 1583        chkconfig hotplug off
 1584        /etc/init.d/hotplug stop
 1585 
 1586 
 1587 
 1588   More recent distributions may need hotplug working in conjunction with
 1589   udev. See the udev question in this FAQ for more information.
 1590 
 1591 
 1592