"Fossies" - the Fresh Open Source Software Archive

Member "fasm/fasm.txt" (9 Feb 2020, 273907 Bytes) of package /linux/misc/fasm-1.73.22.tgz:


As a special service "Fossies" has tried to format the requested text file into HTML format (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file. See also the last Fossies "Diffs" side-by-side code changes report for "fasm.txt": 1.73.20_vs_1.73.21.

    1 
    2                            ,'''
    3                          ,,;,, ,,,,    ,,,,, ,,, ,,
    4                            ;       ;  ;      ;  ;  ;
    5                            ;  ,'''';   '''', ;  ;  ;
    6                            ;  ',,,,;, ,,,,,' ;  ;  ;
    7 
    8                               flat assembler 1.73
    9                               Programmer's Manual
   10 
   11 
   12 Table of contents
   13 -----------------
   14 
   15 Chapter 1  Introduction
   16 
   17         1.1  Compiler overview
   18         1.1.1  System requirements
   19         1.1.2  Executing compiler from command line
   20         1.1.3  Compiler messages
   21         1.1.4  Output formats
   22 
   23         1.2  Assembly syntax
   24         1.2.1  Instruction syntax
   25         1.2.2  Data definitions
   26         1.2.3  Constants and labels
   27         1.2.4  Numerical expressions
   28         1.2.5  Jumps and calls
   29         1.2.6  Size settings
   30 
   31 Chapter 2  Instruction set
   32 
   33         2.1  The x86 architecture instructions
   34         2.1.1  Data movement instructions
   35         2.1.2  Type conversion instructions
   36         2.1.3  Binary arithmetic instructions
   37         2.1.4  Decimal arithmetic instructions
   38         2.1.5  Logical instructions
   39         2.1.6  Control transfer instructions
   40         2.1.7  I/O instructions
   41         2.1.8  Strings operations
   42         2.1.9  Flag control instructions
   43         2.1.10  Conditional operations
   44         2.1.11  Miscellaneous instructions
   45         2.1.12  System instructions
   46         2.1.13  FPU instructions
   47         2.1.14  MMX instructions
   48         2.1.15  SSE instructions
   49         2.1.16  SSE2 instructions
   50         2.1.17  SSE3 instructions
   51         2.1.18  AMD 3DNow! instructions
   52         2.1.19  The x86-64 long mode instructions
   53         2.1.20  SSE4 instructions
   54         2.1.21  AVX instructions
   55         2.1.22  AVX2 instructions
   56         2.1.23  Auxiliary sets of computational instructions
   57         2.1.24  AVX-512 instructions
   58         2.1.25  Other extensions of instruction set
   59 
   60         2.2  Control directives
   61         2.2.1  Numerical constants
   62         2.2.2  Conditional assembly
   63         2.2.3  Repeating blocks of instructions
   64         2.2.4  Addressing spaces
   65         2.2.5  Other directives
   66         2.2.6  Multiple passes
   67 
   68         2.3  Preprocessor directives
   69         2.3.1  Including source files
   70         2.3.2  Symbolic constants
   71         2.3.3  Macroinstructions
   72         2.3.4  Structures
   73         2.3.5  Repeating macroinstructions
   74         2.3.6  Conditional preprocessing
   75         2.3.7  Order of processing
   76 
   77         2.4  Formatter directives
   78         2.4.1  MZ executable
   79         2.4.2  Portable Executable
   80         2.4.3  Common Object File Format
   81         2.4.4  Executable and Linkable Format
   82 
   83 
   84 
   85 Chapter 1  Introduction
   86 -----------------------
   87 
   88 This chapter contains all the most important information you need to begin
   89 using the flat assembler. If you are experienced assembly language programmer,
   90 you should read at least this chapter before using this compiler.
   91 
   92 
   93 1.1  Compiler overview
   94 
   95 Flat assembler is a fast assembly language compiler for the x86 architecture
   96 processors, which does multiple passes to optimize the size of generated
   97 machine code. It is self-compilable and versions for different operating
   98 systems are provided. All the versions are designed to be used from the system
   99 command line and they should not differ in behavior.
  100 
  101 
  102 1.1.1  System requirements
  103 
  104 All versions require the x86 architecture 32-bit processor (at least 80386),
  105 although they can produce programs for the x86 architecture 16-bit processors,
  106 too. DOS version requires an OS compatible with MS DOS 2.0 and either true
  107 real mode environment or DPMI. Windows version requires a Win32 console
  108 compatible with 3.1 version.
  109 
  110 
  111 1.1.2  Executing compiler from command line
  112 
  113 To execute flat assembler from the command line you need to provide two
  114 parameters - first should be name of source file, second should be name of
  115 destination file. If no second parameter is given, the name for output
  116 file will be guessed automatically. After displaying short information about
  117 the program name and version, compiler will read the data from source file and
  118 compile it. When the compilation is successful, compiler will write the
  119 generated code to the destination file and display the summary of compilation
  120 process; otherwise it will display the information about error that occurred.
  121   The source file should be a text file, and can be created in any text
  122 editor. Line breaks are accepted in both DOS and Unix standards, tabulators
  123 are treated as spaces.
  124   In the command line you can also include "-m" option followed by a number,
  125 which specifies how many kilobytes of memory flat assembler should maximally
  126 use. In case of DOS version this options limits only the usage of extended
  127 memory. The "-p" option followed by a number can be used to specify the limit
  128 for number of passes the assembler performs. If code cannot be generated
  129 within specified amount of passes, the assembly will be terminated with an
  130 error message. The maximum value of this setting is 65536, while the default
  131 limit, used when no such option is included in command line, is 100.
  132 It is also possible to limit the number of passes the assembler
  133 performs, with the "-p" option followed by a number specifying the maximum
  134 number of passes.
  135   There are no command line options that would affect the output of compiler,
  136 flat assembler requires only the source code to include the information it
  137 really needs. For example, to specify output format you specify it by using
  138 the "format" directive at the beginning of source.
  139 
  140 
  141 1.1.3  Compiler messages
  142 
  143 As it is stated above, after the successful compilation, the compiler displays
  144 the compilation summary. It includes the information of how many passes was
  145 done, how much time it took, and how many bytes were written into the
  146 destination file.
  147 The following is an example of the compilation summary:
  148 
  149 flat assembler  version 1.72 (16384 kilobytes memory)
  150 38 passes, 5.3 seconds, 77824 bytes.
  151 
  152 In case of error during the compilation process, the program will display an
  153 error message. For example, when compiler can't find the input file, it will
  154 display the following message:
  155 
  156 flat assembler  version 1.72 (16384 kilobytes memory)
  157 error: source file not found.
  158 
  159 If the error is connected with a specific part of source code, the source line
  160 that caused the error will be also displayed. Also placement of this line in
  161 the source is given to help you finding this error, for example:
  162 
  163 flat assembler  version 1.72 (16384 kilobytes memory)
  164 example.asm [3]:
  165         mob     ax,1
  166 error: illegal instruction.
  167 
  168 It means that in the third line of the "example.asm" file compiler has
  169 encountered an unrecognized instruction. When the line that caused error
  170 contains a macroinstruction, also the line in macroinstruction definition
  171 that generated the erroneous instruction is displayed:
  172 
  173 flat assembler  version 1.72 (16384 kilobytes memory)
  174 example.asm [6]:
  175         stoschar 7
  176 example.asm [3] stoschar [1]:
  177         mob     al,char
  178 error: illegal instruction.
  179 
  180 It means that the macroinstruction in the sixth line of the "example.asm" file
  181 generated an unrecognized instruction with the first line of its definition.
  182 
  183 
  184 1.1.4  Output formats
  185 
  186 By default, when there is no "format" directive in source file, flat
  187 assembler simply puts generated instruction codes into output, creating this
  188 way flat binary file. By default it generates 16-bit code, but you can always
  189 turn it into the 16-bit or 32-bit mode by using "use16" or "use32" directive.
  190 Some of the output formats switch into 32-bit mode, when selected - more
  191 information about formats which you can choose can be found in 2.4.
  192   All output code is always in the order in which it was entered into the
  193 source file.
  194 
  195 
  196 1.2  Assembly syntax
  197 
  198 The information provided below is intended mainly for the assembly language
  199 programmers that have been using some other assembly compilers before.
  200 If you are beginner, you should look for the assembly programming tutorials.
  201   Flat assembler by default uses the Intel syntax for the assembly
  202 instructions, although you can customize it using the preprocessor
  203 capabilities (macroinstructions and symbolic constants). It also has its own
  204 set of the directives - the instructions for compiler.
  205   All symbols defined inside the sources are case-sensitive.
  206 
  207 
  208 1.2.1  Instruction syntax
  209 
  210 Instructions in assembly language are separated by line breaks, and one
  211 instruction is expected to fill the one line of text. If a line contains
  212 a semicolon, except for the semicolons inside the quoted strings, the rest of
  213 this line is the comment and compiler ignores it. If a line ends with "\"
  214 character (eventually the semicolon and comment may follow it), the next line
  215 is attached at this point.
  216   Each line in source is the sequence of items, which may be one of the three
  217 types. One type are the symbol characters, which are the special characters
  218 that are individual items even when are not spaced from the other ones.
  219 Any of the "+-*/=<>()[]{}:,|&~#`" is the symbol character. The sequence of
  220 other characters, separated from other items with either blank spaces or
  221 symbol characters, is a symbol. If the first character of symbol is either a
  222 single or double quote, it integrates any sequence of characters following it,
  223 even the special ones, into a quoted string, which should end with the same
  224 character, with which it began (the single or double quote) - however if there
  225 are two such characters in a row (without any other character between them),
  226 they are integrated into quoted string as just one of them and the quoted
  227 string continues then. The symbols other than symbol characters and quoted
  228 strings can be used as names, so are also called the name symbols.
  229   Every instruction consists of the mnemonic and the various number of
  230 operands, separated with commas. The operand can be register, immediate value
  231 or a data addressed in memory, it can also be preceded by size operator to
  232 define or override its size (table 1.1). Names of available registers you can
  233 find in table 1.2, their sizes cannot be overridden. Immediate value can be
  234 specified by any numerical expression.
  235   When operand is a data in memory, the address of that data (also any
  236 numerical expression, but it may contain registers) should be enclosed in
  237 square brackets or preceded by "ptr" operator. For example instruction
  238 "mov eax,3" will put the immediate value 3 into the EAX register, instruction
  239 "mov eax,[7]" will put the 32-bit value from the address 7 into EAX and the
  240 instruction "mov byte [7],3" will put the immediate value 3 into the byte at
  241 address 7, it can also be written as "mov byte ptr 7,3". To specify which
  242 segment register should be used for addressing, segment register name followed
  243 by a colon should be put just before the address value (inside the square
  244 brackets or after the "ptr" operator).
  245 
  246    Table 1.1  Size operators
  247   /-------------------------\
  248   | Operator | Bits | Bytes |
  249   |==========|======|=======|
  250   | byte     | 8    | 1     |
  251   | word     | 16   | 2     |
  252   | dword    | 32   | 4     |
  253   | fword    | 48   | 6     |
  254   | pword    | 48   | 6     |
  255   | qword    | 64   | 8     |
  256   | tbyte    | 80   | 10    |
  257   | tword    | 80   | 10    |
  258   | dqword   | 128  | 16    |
  259   | xword    | 128  | 16    |
  260   | qqword   | 256  | 32    |
  261   | yword    | 256  | 32    |
  262   | dqqword  | 512  | 64    |
  263   | zword    | 512  | 64    |
  264   \-------------------------/
  265 
  266    Table 1.2  Registers
  267   /-----------------------------------------------------------------\
  268   | Type    | Bits |                                                |
  269   |=========|======|================================================|
  270   |         | 8    | al    cl    dl    bl    ah    ch    dh    bh   |
  271   | General | 16   | ax    cx    dx    bx    sp    bp    si    di   |
  272   |         | 32   | eax   ecx   edx   ebx   esp   ebp   esi   edi  |
  273   |---------|------|------------------------------------------------|
  274   | Segment | 16   | es    cs    ss    ds    fs    gs               |
  275   |---------|------|------------------------------------------------|
  276   | Control | 32   | cr0         cr2   cr3   cr4                    |
  277   |---------|------|------------------------------------------------|
  278   | Debug   | 32   | dr0   dr1   dr2   dr3               dr6   dr7  |
  279   |---------|------|------------------------------------------------|
  280   | FPU     | 80   | st0   st1   st2   st3   st4   st5   st6   st7  |
  281   |---------|------|------------------------------------------------|
  282   | MMX     | 64   | mm0   mm1   mm2   mm3   mm4   mm5   mm6   mm7  |
  283   |---------|------|------------------------------------------------|
  284   | SSE     | 128  | xmm0  xmm1  xmm2  xmm3  xmm4  xmm5  xmm6  xmm7 |
  285   |---------|------|------------------------------------------------|
  286   | AVX     | 256  | ymm0  ymm1  ymm2  ymm3  ymm4  ymm5  ymm6  ymm7 |
  287   |---------|------|------------------------------------------------|
  288   | AVX-512 | 512  | zmm0  zmm1  zmm2  zmm3  zmm4  zmm5  zmm6  zmm7 |
  289   |---------|------|------------------------------------------------|
  290   | Opmask  | 64   | k0    k1    k2    k3    k4    k5    k6    k7   |
  291   |---------|------|------------------------------------------------|
  292   | Bounds  | 128  | bnd0  bnd1  bnd2  bnd3                         |
  293   \-----------------------------------------------------------------/
  294 
  295 
  296 1.2.2  Data definitions
  297 
  298 To define data or reserve a space for it, use one of the directives listed in
  299 table 1.3. The data definition directive should be followed by one or more of
  300 numerical expressions, separated with commas. These expressions define the
  301 values for data cells of size depending on which directive is used. For
  302 example "db 1,2,3" will define the three bytes of values 1, 2 and 3
  303 respectively.
  304   The "db" and "du" directives also accept the quoted string values of any
  305 length, which will be converted into chain of bytes when "db" is used and into
  306 chain of words with zeroed high byte when "du" is used. For example "db 'abc'"
  307 will define the three bytes of values 61, 62 and 63.
  308   The "dp" directive and its synonym "df" accept the values consisting of two
  309 numerical expressions separated with colon, the first value will become the
  310 high word and the second value will become the low double word of the far
  311 pointer value. Also "dd" accepts such pointers consisting of two word values
  312 separated with colon, and "dt" accepts the word and quad word value separated
  313 with colon, the quad word is stored first. The "dt" directive with single
  314 expression as parameter accepts only floating point values and creates data in
  315 FPU double extended precision format.
  316   Any of the above directive allows the usage of special "dup" operator to
  317 make multiple copies of given values. The count of duplicates should precede
  318 this operator and the value to duplicate should follow - it can even be the
  319 chain of values separated with commas, but such set of values needs to be
  320 enclosed with parenthesis, like "db 5 dup (1,2)", which defines five copies
  321 of the given two byte sequence.
  322   The "file" is a special directive and its syntax is different. This
  323 directive includes a chain of bytes from file and it should be followed by the
  324 quoted file name, then optionally numerical expression specifying offset in
  325 file preceded by the colon, and - also optionally - comma and numerical
  326 expression specifying count of bytes to include (if no count is specified, all
  327 data up to the end of file is included). For example "file 'data.bin'" will
  328 include the whole file as binary data and "file 'data.bin':10h,4" will include
  329 only four bytes starting at offset 10h.
  330   The data reservation directive should be followed by only one numerical
  331 expression, and this value defines how many cells of the specified size should
  332 be reserved. All data definition directives also accept the "?" value, which
  333 means that this cell should not be initialized to any value and the effect is
  334 the same as by using the data reservation directive. The uninitialized data
  335 may not be included in the output file, so its values should be always
  336 considered unknown.
  337 
  338    Table 1.3  Data directives
  339   /----------------------------\
  340   | Size    | Define | Reserve |
  341   | (bytes) | data   | data    |
  342   |=========|========|=========|
  343   | 1       | db     | rb      |
  344   |         | file   |         |
  345   |---------|--------|---------|
  346   | 2       | dw     | rw      |
  347   |         | du     |         |
  348   |---------|--------|---------|
  349   | 4       | dd     | rd      |
  350   |---------|--------|---------|
  351   | 6       | dp     | rp      |
  352   |         | df     | rf      |
  353   |---------|--------|---------|
  354   | 8       | dq     | rq      |
  355   |---------|--------|---------|
  356   | 10      | dt     | rt      |
  357   \----------------------------/
  358 
  359 
  360 1.2.3  Constants and labels
  361 
  362 In the numerical expressions you can also use constants or labels instead of
  363 numbers. To define the constant or label you should use the specific
  364 directives. Each label can be defined only once and it is accessible from the
  365 any place of source (even before it was defined). Constant can be redefined
  366 many times, but in this case it is accessible only after it was defined, and
  367 is always equal to the value from last definition before the place where it's
  368 used. When a constant is defined only once in source, it is - like the label -
  369 accessible from anywhere.
  370   The definition of constant consists of name of the constant followed by the
  371 "=" character and numerical expression, which after calculation will become
  372 the value of constant. This value is always calculated at the time the
  373 constant is defined. For example you can define "count" constant by using the
  374 directive "count = 17", and then use it in the assembly instructions, like
  375 "mov cx,count" - which will become "mov cx,17" during the compilation process.
  376   There are different ways to define labels. The simplest is to follow the
  377 name of label by the colon, this directive can even be followed by the other
  378 instruction in the same line. It defines the label whose value is equal to
  379 offset of the point where it's defined. This method is usually used to label
  380 the places in code. The other way is to follow the name of label (without a
  381 colon) by some data directive. It defines the label with value equal to
  382 offset of the beginning of defined data, and remembered as a label for data
  383 with cell size as specified for that data directive in table 1.3.
  384   The label can be treated as constant of value equal to offset of labeled
  385 code or data. For example when you define data using the labeled directive
  386 "char db 224", to put the offset of this data into BX register you should use
  387 "mov bx,char" instruction, and to put the value of byte addressed by "char"
  388 label to DL register, you should use "mov dl,[char]" (or "mov dl,ptr char").
  389 But when you try to assemble "mov ax,[char]", it will cause an error, because
  390 fasm compares the sizes of operands, which should be equal. You can force
  391 assembling that instruction by using size override: "mov ax,word [char]", but
  392 remember that this instruction will read the two bytes beginning at "char"
  393 address, while it was defined as a one byte.
  394   The last and the most flexible way to define labels is to use "label"
  395 directive. This directive should be followed by the name of label, then
  396 optionally size operator (it can be preceded by a colon) and then - also
  397 optionally "at" operator and the numerical expression defining the address at
  398 which this label should be defined. For example "label wchar word at char"
  399 will define a new label for the 16-bit data at the address of "char". Now the
  400 instruction "mov ax,[wchar]" will be after compilation the same as
  401 "mov ax,word [char]". If no address is specified, "label" directive defines
  402 the label at current offset. Thus "mov [wchar],57568" will copy two bytes
  403 while "mov [char],224" will copy one byte to the same address.
  404   The label whose name begins with dot is treated as local label, and its name
  405 is attached to the name of last global label (with name beginning with
  406 anything but dot) to make the full name of this label. So you can use the
  407 short name (beginning with dot) of this label anywhere before the next global
  408 label is defined, and in the other places you have to use the full name. Label
  409 beginning with two dots are the exception - they are like global, but they
  410 don't become the new prefix for local labels.
  411   The "@@" name means anonymous label, you can have defined many of them in
  412 the source. Symbol "@b" (or equivalent "@r") references the nearest preceding
  413 anonymous label, symbol "@f" references the nearest following anonymous label.
  414 These special symbol are case-insensitive.
  415 
  416 
  417 1.2.4  Numerical expressions
  418 
  419 In the above examples all the numerical expressions were the simple numbers,
  420 constants or labels. But they can be more complex, by using the arithmetical
  421 or logical operators for calculations at compile time. All these operators
  422 with their priority values are listed in table 1.4. The operations with higher
  423 priority value will be calculated first, you can of course change this
  424 behavior by putting some parts of expression into parenthesis. The "+", "-",
  425 "*" and "/" are standard arithmetical operations, "mod" calculates the
  426 remainder from division. The "and", "or", "xor", "shl", "shr", "bsf", "bsr"
  427 and "not" perform the same bit-logical operations as assembly instructions of 
  428 those names. The "rva" and "plt" are special unary operators that perform 
  429 conversions between different kinds of addresses, they can be used only with 
  430 few of the output formats and their meaning may vary (see 2.4).
  431   The arithmetical and bit-logical calculations are processed as if they
  432 operated on infinite precision 2-adic numbers, and assembler signalizes an
  433 overflow error if because of its limitations it is not table to perform the
  434 required calculation, or if the result is too large number to fit in either
  435 signed or unsigned range for the destination unit size.
  436   The numbers in the expression are by default treated as a decimal, binary
  437 numbers should have the "b" letter attached at the end, octal number should
  438 end with "o" letter, hexadecimal numbers should begin with "0x" characters
  439 (like in C language) or with the "$" character (like in Pascal language) or
  440 they should end with "h" letter. Also quoted string, when encountered in
  441 expression, will be converted into number - the first character will become
  442 the least significant byte of number.
  443   The numerical expression used as an address value can also contain any of
  444 general registers used for addressing, they can be added and multiplied by
  445 appropriate values, as it is allowed for the x86 architecture instructions.
  446 The numerical calculations inside address definition by default operate with
  447 target size assumed to be the same as the current bitness of code, even if
  448 generated instruction encoding will use a different address size.
  449   There are also some special symbols that can be used inside the numerical
  450 expression. First is "$", which is always equal to the value of current
  451 offset, while "$$" is equal to base address of current addressing space. The
  452 other one is "%", which is the number of current repeat in parts of code that
  453 are repeated using some special directives (see 2.2) and zero anywhere else.
  454 There's also "%t" symbol, which is always equal to the current time stamp.
  455   Any numerical expression can also consist of single floating point value
  456 (flat assembler does not allow any floating point operations at compilation
  457 time) in the scientific notation, they can end with the "f" letter to be
  458 recognized, otherwise they should contain at least one of the "." or "E"
  459 characters. So "1.0", "1E0" and "1f" define the same floating point value,
  460 while simple "1" defines an integer value.
  461 
  462    Table 1.4  Arithmetical and bit-logical operators by priority
  463   /-------------------------\
  464   | Priority | Operators    |
  465   |==========|==============|
  466   | 0        | +  -         |
  467   |----------|--------------|
  468   | 1        | *  /         |
  469   |----------|--------------|
  470   | 2        | mod          |
  471   |----------|--------------|
  472   | 3        | and  or  xor |
  473   |----------|--------------|
  474   | 4        | shl  shr     |
  475   |----------|--------------|
  476   | 5        | not          |
  477   |----------|--------------|
  478   | 6        | bsf  bsr     |
  479   |----------|--------------|
  480   | 7        | rva  plt     |
  481   \-------------------------/
  482 
  483 
  484 1.2.5  Jumps and calls
  485 
  486 The operand of any jump or call instruction can be preceded not only by the
  487 size operator, but also by one of the operators specifying type of the jump:
  488 "short", "near" or "far". For example, when assembler is in 16-bit mode,
  489 instruction "jmp dword [0]" will become the far jump and when assembler is
  490 in 32-bit mode, it will become the near jump. To force this instruction to be
  491 treated differently, use the "jmp near dword [0]" or "jmp far dword [0]" form.
  492   When operand of near jump is the immediate value, assembler will generate
  493 the shortest variant of this jump instruction if possible (but will not create
  494 32-bit instruction in 16-bit mode nor 16-bit instruction in 32-bit mode,
  495 unless there is a size operator stating it). By specifying the jump type
  496 you can force it to always generate long variant (for example "jmp near 0")
  497 or to always generate short variant and terminate with an error when it's
  498 impossible (for example "jmp short 0").
  499 
  500 
  501 1.2.6  Size settings
  502 
  503 When instruction uses some memory addressing, by default the smallest form of
  504 instruction is generated by using the short displacement if only address
  505 value fits in the range. This can be overridden using the "word" or "dword"
  506 operator before the address inside the square brackets (or after the "ptr"
  507 operator), which forces the long displacement of appropriate size to be made.
  508 In case when address is not relative to any registers, those operators allow
  509 also to choose the appropriate mode of absolute addressing.
  510   Instructions "adc", "add", "and", "cmp", "or", "sbb", "sub" and "xor" with
  511 first operand being 16-bit or 32-bit are by default generated in shortened
  512 8-bit form when the second operand is immediate value fitting in the range
  513 for signed 8-bit values. It also can be overridden by putting the "word" or
  514 "dword" operator before the immediate value. The similar rules applies to the
  515 "imul" instruction with the last operand being immediate value.
  516   Immediate value as an operand for "push" instruction without a size operator
  517 is by default treated as a word value if assembler is in 16-bit mode and as a
  518 double word value if assembler is in 32-bit mode, shorter 8-bit form of this
  519 instruction is used if possible, "word" or "dword" size operator forces the
  520 "push" instruction to be generated in longer form for specified size. "pushw"
  521 and "pushd" mnemonics force assembler to generate 16-bit or 32-bit code
  522 without forcing it to use the longer form of instruction.
  523 
  524 
  525 Chapter 2  Instruction set
  526 --------------------------
  527 
  528 This chapter provides the detailed information about the instructions and
  529 directives supported by flat assembler. Directives for defining labels were
  530 already discussed in 1.2.3, all other directives will be described later in
  531 this chapter.
  532 
  533 
  534 2.1  The x86 architecture instructions
  535 
  536 In this section you can find both the information about the syntax and
  537 purpose the assembly language instructions. If you need more technical
  538 information, look for the Intel Architecture Software Developer's Manual.
  539   Assembly instructions consist of the mnemonic (instruction's name) and from
  540 zero to three operands. If there are two or more operands, usually first is
  541 the destination operand and second is the source operand. Each operand can be
  542 register, memory or immediate value (see 1.2 for details about syntax of
  543 operands). After the description of each instruction there are examples
  544 of different combinations of operands, if the instruction has any.
  545   Some instructions act as prefixes and can be followed by other instruction
  546 in the same line, and there can be more than one prefix in a line. Each name
  547 of the segment register is also a mnemonic of instruction prefix, altough it
  548 is recommended to use segment overrides inside the square brackets instead of
  549 these prefixes.
  550 
  551 
  552 2.1.1  Data movement instructions
  553 
  554 "mov" transfers a byte, word or double word from the source operand to the
  555 destination operand. It can transfer data between general registers, from
  556 the general register to memory, or from memory to general register, but it
  557 cannot move from memory to memory. It can also transfer an immediate value to
  558 general register or memory, segment register to general register or memory,
  559 general register or memory to segment register, control or debug register to
  560 general register and general register to control or debug register. The "mov"
  561 can be assembled only if the size of source operand and size of destination
  562 operand are the same. Below are the examples for each of the allowed
  563 combinations:
  564 
  565     mov bx,ax       ; general register to general register
  566     mov [char],al   ; general register to memory
  567     mov bl,[char]   ; memory to general register
  568     mov dl,32       ; immediate value to general register
  569     mov [char],32   ; immediate value to memory
  570     mov ax,ds       ; segment register to general register
  571     mov [bx],ds     ; segment register to memory
  572     mov ds,ax       ; general register to segment register
  573     mov ds,[bx]     ; memory to segment register
  574     mov eax,cr0     ; control register to general register
  575     mov cr3,ebx     ; general register to control register
  576 
  577   "xchg" swaps the contents of two operands. It can swap two byte operands,
  578 two word operands or two double word operands. Order of operands is not
  579 important. The operands may be two general registers, or general register
  580 with memory. For example:
  581 
  582     xchg ax,bx      ; swap two general registers
  583     xchg al,[char]  ; swap register with memory
  584 
  585   "push" decrements the stack frame pointer (ESP register), then transfers
  586 the operand to the top of stack indicated by ESP. The operand can be memory,
  587 general register, segment register or immediate value of word or double word
  588 size. If operand is an immediate value and no size is specified, it is by
  589 default treated as a word value if assembler is in 16-bit mode and as a double
  590 word value if assembler is in 32-bit mode. "pushw" and "pushd" mnemonics are
  591 variants of this instruction that store the values of word or double word size
  592 respectively. If more operands follow in the same line (separated only with
  593 spaces, not commas), compiler will assemble chain of the "push" instructions
  594 with these operands. The examples are with single operands:
  595 
  596     push ax         ; store general register
  597     push es         ; store segment register
  598     pushw [bx]      ; store memory
  599     push 1000h      ; store immediate value
  600 
  601   "pusha" saves the contents of the eight general register on the stack.
  602 This instruction has no operands. There are two version of this instruction,
  603 one 16-bit and one 32-bit, assembler automatically generates the appropriate
  604 version for current mode, but it can be overridden by using "pushaw" or
  605 "pushad" mnemonic to always get the 16-bit or 32-bit version. The 16-bit
  606 version of this instruction pushes general registers on the stack in the
  607 following order: AX, CX, DX, BX, the initial value of SP before AX was pushed,
  608 BP, SI and DI. The 32-bit version pushes equivalent 32-bit general registers
  609 in the same order.
  610   "pop" transfers the word or double word at the current top of stack to the
  611 destination operand, and then increments ESP to point to the new top of stack.
  612 The operand can be memory, general register or segment register. "popw" and
  613 "popd" mnemonics are variants of this instruction for restoring the values of
  614 word or double word size respectively. If more operands separated with spaces
  615 follow in the same line, compiler will assemble chain of the "pop"
  616 instructions with these operands.
  617 
  618     pop bx          ; restore general register
  619     pop ds          ; restore segment register
  620     popw [si]       ; restore memory
  621 
  622   "popa" restores the registers saved on the stack by "pusha" instruction,
  623 except for the saved value of SP (or ESP), which is ignored. This instruction
  624 has no operands. To force assembling 16-bit or 32-bit version of this
  625 instruction use "popaw" or "popad" mnemonic.
  626 
  627 
  628 2.1.2  Type conversion instructions
  629 
  630 The type conversion instructions convert bytes into words, words into double
  631 words, and double words into quad words. These conversions can be done using
  632 the sign extension or zero extension. The sign extension fills the extra bits
  633 of the larger item with the value of the sign bit of the smaller item, the
  634 zero extension simply fills them with zeros.
  635   "cwd" and "cdq" double the size of value AX or EAX register respectively
  636 and store the extra bits into the DX or EDX register. The conversion is done
  637 using the sign extension. These instructions have no operands.
  638   "cbw" extends the sign of the byte in AL throughout AX, and "cwde" extends
  639 the sign of the word in AX throughout EAX. These instructions also have no
  640 operands.
  641   "movsx" converts a byte to word or double word and a word to double word
  642 using the sign extension. "movzx" does the same, but it uses the zero
  643 extension. The source operand can be general register or memory, while the
  644 destination operand must be a general register. For example:
  645 
  646     movsx ax,al         ; byte register to word register
  647     movsx edx,dl        ; byte register to double word register
  648     movsx eax,ax        ; word register to double word register
  649     movsx ax,byte [bx]  ; byte memory to word register
  650     movsx edx,byte [bx] ; byte memory to double word register
  651     movsx eax,word [bx] ; word memory to double word register
  652 
  653 
  654 2.1.3  Binary arithmetic instructions
  655 
  656 "add" replaces the destination operand with the sum of the source and
  657 destination operands and sets CF if overflow has occurred. The operands may
  658 be bytes, words or double words. The destination operand can be general
  659 register or memory, the source operand can be general register or immediate
  660 value, it can also be memory if the destination operand is register.
  661 
  662     add ax,bx       ; add register to register
  663     add ax,[si]     ; add memory to register
  664     add [di],al     ; add register to memory
  665     add al,48       ; add immediate value to register
  666     add [char],48   ; add immediate value to memory
  667 
  668   "adc" sums the operands, adds one if CF is set, and replaces the destination
  669 operand with the result. Rules for the operands are the same as for the "add"
  670 instruction. An "add" followed by multiple "adc" instructions can be used to
  671 add numbers longer than 32 bits.
  672   "inc" adds one to the operand, it does not affect CF. The operand can be a
  673 general register or memory, and the size of the operand can be byte, word or
  674 double word.
  675 
  676     inc ax          ; increment register by one
  677     inc byte [bx]   ; increment memory by one
  678 
  679   "sub" subtracts the source operand from the destination operand and replaces
  680 the destination operand with the result. If a borrow is required, the CF is
  681 set. Rules for the operands are the same as for the "add" instruction.
  682   "sbb" subtracts the source operand from the destination operand, subtracts
  683 one if CF is set, and stores the result to the destination operand. Rules for
  684 the operands are the same as for the "add" instruction. A "sub" followed by
  685 multiple "sbb" instructions may be used to subtract numbers longer than 32
  686 bits.
  687   "dec" subtracts one from the operand, it does not affect CF. Rules for the
  688 operand are the same as for the "inc" instruction.
  689   "cmp" subtracts the source operand from the destination operand. It updates
  690 the flags as the "sub" instruction, but does not alter the source and
  691 destination operands. Rules for the operands are the same as for the "sub"
  692 instruction.
  693   "neg" subtracts a signed integer operand from zero. The effect of this
  694 instructon is to reverse the sign of the operand from positive to negative or
  695 from negative to positive. Rules for the operand are the same as for the "inc"
  696 instruction.
  697   "xadd" exchanges the destination operand with the source operand, then loads
  698 the sum of the two values into the destination operand. The destination operand
  699 may be a general register or memory, the source operand must be a general
  700 register.
  701   All the above binary arithmetic instructions update SF, ZF, PF and OF flags.
  702 SF is always set to the same value as the result's sign bit, ZF is set when
  703 all the bits of result are zero, PF is set when low order eight bits of result
  704 contain an even number of set bits, OF is set if result is too large for a
  705 positive number or too small for a negative number (excluding sign bit) to fit
  706 in destination operand.
  707   "mul" performs an unsigned multiplication of the operand and the
  708 accumulator. If the operand is a byte, the processor multiplies it by the
  709 contents of AL and returns the 16-bit result to AH and AL. If the operand is a
  710 word, the processor multiplies it by the contents of AX and returns the 32-bit
  711 result to DX and AX. If the operand is a double word, the processor multiplies
  712 it by the contents of EAX and returns the 64-bit result in EDX and EAX. "mul"
  713 sets CF and OF when the upper half of the result is nonzero, otherwise they
  714 are cleared. Rules for the operand are the same as for the "inc" instruction.
  715   "imul" performs a signed multiplication operation. This instruction has
  716 three variations. First has one operand and behaves in the same way as the
  717 "mul" instruction. Second has two operands, in this case destination operand
  718 is multiplied by the source operand and the result replaces the destination
  719 operand. Destination operand must be a general register, it can be word or
  720 double word, source operand can be general register, memory or immediate
  721 value. Third form has three operands, the destination operand must be a
  722 general register, word or double word in size, source operand can be general
  723 register or memory, and third operand must be an immediate value. The source
  724 operand is multiplied by the immediate value and the result is stored in the
  725 destination register. All the three forms calculate the product to twice the
  726 size of operands and set CF and OF when the upper half of the result is
  727 nonzero, but second and third form truncate the product to the size of
  728 operands. So second and third forms can be also used for unsigned operands
  729 because, whether the operands are signed or unsigned, the lower half of the
  730 product is the same. Below are the examples for all three forms:
  731 
  732     imul bl         ; accumulator by register
  733     imul word [si]  ; accumulator by memory
  734     imul bx,cx      ; register by register
  735     imul bx,[si]    ; register by memory
  736     imul bx,10      ; register by immediate value
  737     imul ax,bx,10   ; register by immediate value to register
  738     imul ax,[si],10 ; memory by immediate value to register
  739 
  740   "div" performs an unsigned division of the accumulator by the operand.
  741 The dividend (the accumulator) is twice the size of the divisor (the operand),
  742 the quotient and remainder have the same size as the divisor. If divisor is
  743 byte, the dividend is taken from AX register, the quotient is stored in AL and
  744 the remainder is stored in AH. If divisor is word, the upper half of dividend
  745 is taken from DX, the lower half of dividend is taken from AX, the quotient is
  746 stored in AX and the remainder is stored in DX. If divisor is double word,
  747 the upper half of dividend is taken from EDX, the lower half of dividend is
  748 taken from EAX, the quotient is stored in EAX and the remainder is stored in
  749 EDX. Rules for the operand are the same as for the "mul" instruction.
  750   "idiv" performs a signed division of the accumulator by the operand.
  751 It uses the same registers as the "div" instruction, and the rules for
  752 the operand are the same.
  753 
  754 
  755 2.1.4  Decimal arithmetic instructions
  756 
  757 Decimal arithmetic is performed by combining the binary arithmetic
  758 instructions (already described in the prior section) with the decimal
  759 arithmetic instructions. The decimal arithmetic instructions are used to
  760 adjust the results of a previous binary arithmetic operation to produce a
  761 valid packed or unpacked decimal result, or to adjust the inputs to a
  762 subsequent binary arithmetic operation so the operation will produce a valid
  763 packed or unpacked decimal result.
  764   "daa" adjusts the result of adding two valid packed decimal operands in
  765 AL. "daa" must always follow the addition of two pairs of packed decimal
  766 numbers (one digit in each half-byte) to obtain a pair of valid packed
  767 decimal digits as results. The carry flag is set if carry was needed.
  768 This instruction has no operands.
  769   "das" adjusts the result of subtracting two valid packed decimal operands
  770 in AL. "das" must always follow the subtraction of one pair of packed decimal
  771 numbers (one digit in each half-byte) from another to obtain a pair of valid
  772 packed decimal digits as results. The carry flag is set if a borrow was
  773 needed. This instruction has no operands.
  774   "aaa" changes the contents of register AL to a valid unpacked decimal
  775 number, and zeroes the top four bits. "aaa" must always follow the addition
  776 of two unpacked decimal operands in AL. The carry flag is set and AH is
  777 incremented if a carry is necessary. This instruction has no operands.
  778   "aas" changes the contents of register AL to a valid unpacked decimal
  779 number, and zeroes the top four bits. "aas" must always follow the
  780 subtraction of one unpacked decimal operand from another in AL. The carry flag
  781 is set and AH decremented if a borrow is necessary. This instruction has no
  782 operands.
  783   "aam" corrects the result of a multiplication of two valid unpacked decimal
  784 numbers. "aam" must always follow the multiplication of two decimal numbers
  785 to produce a valid decimal result. The high order digit is left in AH, the
  786 low order digit in AL. The generalized version of this instruction allows
  787 adjustment of the contents of the AX to create two unpacked digits of any
  788 number base. The standard version of this instruction has no operands, the
  789 generalized version has one operand - an immediate value specifying the
  790 number base for the created digits.
  791   "aad" modifies the numerator in AH and AL to prepare for the division of two
  792 valid unpacked decimal operands so that the quotient produced by the division
  793 will be a valid unpacked decimal number. AH should contain the high order
  794 digit and AL the low order digit. This instruction adjusts the value and
  795 places the result in AL, while AH will contain zero. The generalized version
  796 of this instruction allows adjustment of two unpacked digits of any number
  797 base. Rules for the operand are the same as for the "aam" instruction.
  798 
  799 
  800 2.1.5  Logical instructions
  801 
  802 "not" inverts the bits in the specified operand to form a one's complement 
  803 of the operand. It has no effect on the flags. Rules for the operand are the 
  804 same as for the "inc" instruction.
  805   "and", "or" and "xor" instructions perform the standard logical operations. 
  806 They update the SF, ZF and PF flags. Rules for the operands are the same as 
  807 for the "add" instruction.
  808   "bt", "bts", "btr" and "btc" instructions operate on a single bit which can
  809 be in memory or in a general register. The location of the bit is specified
  810 as an offset from the low order end of the operand. The value of the offset
  811 is the taken from the second operand, it either may be an immediate byte or
  812 a general register. These instructions first assign the value of the selected
  813 bit to CF. "bt" instruction does nothing more, "bts" sets the selected bit to
  814 1, "btr" resets the selected bit to 0, "btc" changes the bit to its
  815 complement. The first operand can be word or double word.
  816 
  817     bt  ax,15        ; test bit in register
  818     bts word [bx],15 ; test and set bit in memory
  819     btr ax,cx        ; test and reset bit in register
  820     btc word [bx],cx ; test and complement bit in memory
  821 
  822   "bsf" and "bsr" instructions scan a word or double word for first set bit
  823 and store the index of this bit into destination operand, which must be
  824 general register. The bit string being scanned is specified by source operand,
  825 it may be either general register or memory. The ZF flag is set if the entire
  826 string is zero (no set bits are found); otherwise it is cleared. If no set bit
  827 is found, the value of the destination register is undefined. "bsf" scans from
  828 low order to high order (starting from bit index zero). "bsr" scans from high
  829 order to low order (starting from bit index 15 of a word or index 31 of a
  830 double word).
  831 
  832     bsf ax,bx        ; scan register forward
  833     bsr ax,[si]      ; scan memory reverse
  834 
  835   "shl" shifts the destination operand left by the number of bits specified
  836 in the second operand. The destination operand can be byte, word, or double
  837 word general register or memory. The second operand can be an immediate value
  838 or the CL register. The processor shifts zeros in from the right (low order)
  839 side of the operand as bits exit from the left side. The last bit that exited
  840 is stored in CF. "sal" is a synonym for "shl".
  841 
  842     shl al,1         ; shift register left by one bit
  843     shl byte [bx],1  ; shift memory left by one bit
  844     shl ax,cl        ; shift register left by count from cl
  845     shl word [bx],cl ; shift memory left by count from cl
  846 
  847   "shr" and "sar" shift the destination operand right by the number of bits
  848 specified in the second operand. Rules for operands are the same as for the
  849 "shl" instruction. "shr" shifts zeros in from the left side of the operand as
  850 bits exit from the right side. The last bit that exited is stored in CF.
  851 "sar" preserves the sign of the operand by shifting in zeros on the left side
  852 if the value is positive or by shifting in ones if the value is negative.
  853   "shld" shifts bits of the destination operand to the left by the number
  854 of bits specified in third operand, while shifting high order bits from the
  855 source operand into the destination operand on the right. The source operand
  856 remains unmodified. The destination operand can be a word or double word
  857 general register or memory, the source operand must be a general register,
  858 third operand can be an immediate value or the CL register.
  859 
  860     shld ax,bx,1     ; shift register left by one bit
  861     shld [di],bx,1   ; shift memory left by one bit
  862     shld ax,bx,cl    ; shift register left by count from cl
  863     shld [di],bx,cl  ; shift memory left by count from cl
  864 
  865   "shrd" shifts bits of the destination operand to the right, while shifting
  866 low order bits from the source operand into the destination operand on the
  867 left. The source operand remains unmodified. Rules for operands are the same
  868 as for the "shld" instruction.
  869   "rol" and "rcl" rotate the byte, word or double word destination operand
  870 left by the number of bits specified in the second operand. For each rotation
  871 specified, the high order bit that exits from the left of the operand returns
  872 at the right to become the new low order bit. "rcl" additionally puts in CF
  873 each high order bit that exits from the left side of the operand before it
  874 returns to the operand as the low order bit on the next rotation cycle. Rules
  875 for operands are the same as for the "shl" instruction.
  876   "ror" and "rcr" rotate the byte, word or double word destination operand
  877 right by the number of bits specified in the second operand. For each rotation
  878 specified, the low order bit that exits from the right of the operand returns
  879 at the left to become the new high order bit. "rcr" additionally puts in CF
  880 each low order bit that exits from the right side of the operand before it
  881 returns to the operand as the high order bit on the next rotation cycle.
  882 Rules for operands are the same as for the "shl" instruction.
  883   "test" performs the same action as the "and" instruction, but it does not
  884 alter the destination operand, only updates flags. Rules for the operands are
  885 the same as for the "and" instruction.
  886   "bswap" reverses the byte order of a 32-bit general register: bits 0 through
  887 7 are swapped with bits 24 through 31, and bits 8 through 15 are swapped with
  888 bits 16 through 23. This instruction is provided for converting little-endian
  889 values to big-endian format and vice versa.
  890 
  891     bswap edx        ; swap bytes in register
  892 
  893 
  894 2.1.6  Control transfer instructions
  895 
  896 "jmp" unconditionally transfers control to the target location. The
  897 destination address can be specified directly within the instruction or
  898 indirectly through a register or memory, the acceptable size of this address
  899 depends on whether the jump is near or far (it can be specified by preceding
  900 the operand with "near" or "far" operator) and whether the instruction is
  901 16-bit or 32-bit. Operand for near jump should be "word" size for 16-bit
  902 instruction or the "dword" size for 32-bit instruction. Operand for far jump
  903 should be "dword" size for 16-bit instruction or "pword" size for 32-bit
  904 instruction. A direct "jmp" instruction includes the destination address as
  905 part of the instruction (and can be preceded by "short", "near" or "far"
  906 operator), the operand specifying address should be the numerical expression
  907 for near or short jump, or two numerical expressions separated with colon for
  908 far jump, the first specifies selector of segment, the second is the offset
  909 within segment. The "pword" operator can be used to force the 32-bit far call,
  910 and "dword" to force the 16-bit far call. An indirect "jmp" instruction
  911 obtains the destination address indirectly through a register or a pointer
  912 variable, the operand should be general register or memory. See also 1.2.5 for
  913 some more details.
  914 
  915     jmp 100h         ; direct near jump
  916     jmp 0FFFFh:0     ; direct far jump
  917     jmp ax           ; indirect near jump
  918     jmp pword [ebx]  ; indirect far jump
  919 
  920   "call" transfers control to the procedure, saving on the stack the address
  921 of the instruction following the "call" for later use by a "ret" (return)
  922 instruction. Rules for the operands are the same as for the "jmp" instruction,
  923 but the "call" has no short variant of direct instruction and thus it not
  924 optimized.
  925   "ret", "retn" and "retf" instructions terminate the execution of a procedure
  926 and transfers control back to the program that originally invoked the
  927 procedure using the address that was stored on the stack by the "call"
  928 instruction. "ret" is the equivalent for "retn", which returns from the
  929 procedure that was executed using the near call, while "retf" returns from
  930 the procedure that was executed using the far call. These instructions default
  931 to the size of address appropriate for the current code setting, but the size
  932 of address can be forced to 16-bit by using the "retw", "retnw" and "retfw"
  933 mnemonics, and to 32-bit by using the "retd", "retnd" and "retfd" mnemonics.
  934 All these instructions may optionally specify an immediate operand, by adding
  935 this constant to the stack pointer, they effectively remove any arguments that
  936 the calling program pushed on the stack before the execution of the "call"
  937 instruction.
  938   "iret" returns control to an interrupted procedure. It differs from "ret" in
  939 that it also pops the flags from the stack into the flags register. The flags
  940 are stored on the stack by the interrupt mechanism. It defaults to the size of
  941 return address appropriate for the current code setting, but it can be forced
  942 to use 16-bit or 32-bit address by using the "iretw" or "iretd" mnemonic.
  943   The conditional transfer instructions are jumps that may or may not transfer
  944 control, depending on the state of the CPU flags when the instruction
  945 executes. The mnemonics for conditional jumps may be obtained by attaching
  946 the condition mnemonic (see table 2.1) to the "j" mnemonic,
  947 for example "jc" instruction will transfer the control when the CF flag is
  948 set. The conditional jumps can be short or near, and direct only, and can be
  949 optimized (see 1.2.5), the operand should be an immediate value specifying
  950 target address.
  951 
  952    Table 2.1  Conditions
  953   /-----------------------------------------------------------\
  954   | Mnemonic | Condition tested      | Description            |
  955   |==========|=======================|========================|
  956   | o        | OF = 1                | overflow               |
  957   |----------|-----------------------|------------------------|
  958   | no       | OF = 0                | not overflow           |
  959   |----------|-----------------------|------------------------|
  960   | c        |                       | carry                  |
  961   | b        | CF = 1                | below                  |
  962   | nae      |                       | not above nor equal    |
  963   |----------|-----------------------|------------------------|
  964   | nc       |                       | not carry              |
  965   | ae       | CF = 0                | above or equal         |
  966   | nb       |                       | not below              |
  967   |----------|-----------------------|------------------------|
  968   | e        | ZF = 1                | equal                  |
  969   | z        |                       | zero                   |
  970   |----------|-----------------------|------------------------|
  971   | ne       | ZF = 0                | not equal              |
  972   | nz       |                       | not zero               |
  973   |----------|-----------------------|------------------------|
  974   | be       | CF or ZF = 1          | below or equal         |
  975   | na       |                       | not above              |
  976   |----------|-----------------------|------------------------|
  977   | a        | CF or ZF = 0          | above                  |
  978   | nbe      |                       | not below nor equal    |
  979   |----------|-----------------------|------------------------|
  980   | s        | SF = 1                | sign                   |
  981   |----------|-----------------------|------------------------|
  982   | ns       | SF = 0                | not sign               |
  983   |----------|-----------------------|------------------------|
  984   | p        | PF = 1                | parity                 |
  985   | pe       |                       | parity even            |
  986   |----------|-----------------------|------------------------|
  987   | np       | PF = 0                | not parity             |
  988   | po       |                       | parity odd             |
  989   |----------|-----------------------|------------------------|
  990   | l        | SF xor OF = 1         | less                   |
  991   | nge      |                       | not greater nor equal  |
  992   |----------|-----------------------|------------------------|
  993   | ge       | SF xor OF = 0         | greater or equal       |
  994   | nl       |                       | not less               |
  995   |----------|-----------------------|------------------------|
  996   | le       | (SF xor OF) or ZF = 1 | less or equal          |
  997   | ng       |                       | not greater            |
  998   |----------|-----------------------|------------------------|
  999   | g        | (SF xor OF) or ZF = 0 | greater                |
 1000   | nle      |                       | not less nor equal     |
 1001   \-----------------------------------------------------------/
 1002 
 1003   The "loop" instructions are conditional jumps that use a value placed in
 1004 CX (or ECX) to specify the number of repetitions of a software loop. All
 1005 "loop" instructions automatically decrement CX (or ECX) and terminate the
 1006 loop (don't transfer the control) when CX (or ECX) is zero. It uses CX or ECX
 1007 whether the current code setting is 16-bit or 32-bit, but it can be forced to
 1008 us CX with the "loopw" mnemonic or to use ECX with the "loopd" mnemonic.
 1009 "loope" and "loopz" are the synonyms for the same instruction, which acts as
 1010 the standard "loop", but also terminates the loop when ZF flag is set.
 1011 "loopew" and "loopzw" mnemonics force them to use CX register while "looped"
 1012 and "loopzd" force them to use ECX register. "loopne" and "loopnz" are the
 1013 synonyms for the same instructions, which acts as the standard "loop", but
 1014 also terminate the loop when ZF flag is not set. "loopnew" and "loopnzw"
 1015 mnemonics force them to use CX register while "loopned" and "loopnzd" force
 1016 them to use ECX register. Every "loop" instruction needs an operand being an
 1017 immediate value specifying target address, it can be only short jump (in the
 1018 range of 128 bytes back and 127 bytes forward from the address of instruction
 1019 following the "loop" instruction).
 1020   "jcxz" branches to the label specified in the instruction if it finds a
 1021 value of zero in CX, "jecxz" does the same, but checks the value of ECX
 1022 instead of CX. Rules for the operands are the same as for the "loop"
 1023 instruction.
 1024   "int" activates the interrupt service routine that corresponds to the
 1025 number specified as an operand to the instruction, the number should be in
 1026 range from 0 to 255. The interrupt service routine terminates with an "iret"
 1027 instruction that returns control to the instruction that follows "int".
 1028 "int3" mnemonic codes the short (one byte) trap that invokes the interrupt 3.
 1029 "into" instruction invokes the interrupt 4 if the OF flag is set.
 1030   "bound" verifies that the signed value contained in the specified register
 1031 lies within specified limits. An interrupt 5 occurs if the value contained in
 1032 the register is less than the lower bound or greater than the upper bound. It
 1033 needs two operands, the first operand specifies the register being tested,
 1034 the second operand should be memory address for the two signed limit values.
 1035 The operands can be "word" or "dword" in size.
 1036 
 1037     bound ax,[bx]    ; check word for bounds
 1038     bound eax,[esi]  ; check double word for bounds
 1039 
 1040 
 1041 2.1.7  I/O instructions
 1042 
 1043   "in" transfers a byte, word, or double word from an input port to AL, AX,
 1044 or EAX. I/O ports can be addressed either directly, with the immediate byte
 1045 value coded in instruction, or indirectly via the DX register. The destination
 1046 operand should be AL, AX, or EAX register. The source operand should be an
 1047 immediate value in range from 0 to 255, or DX register.
 1048 
 1049     in al,20h        ; input byte from port 20h
 1050     in ax,dx         ; input word from port addressed by dx
 1051 
 1052   "out" transfers a byte, word, or double word to an output port from AL, AX,
 1053 or EAX. The program can specify the number of the port using the same methods
 1054 as the "in" instruction. The destination operand should be an immediate value
 1055 in range from 0 to 255, or DX register. The source operand should be AL, AX,
 1056 or EAX register.
 1057 
 1058     out 20h,ax       ; output word to port 20h
 1059     out dx,al        ; output byte to port addressed by dx
 1060 
 1061 
 1062 2.1.8  Strings operations
 1063 
 1064 The string operations operate on one element of a string. A string element
 1065 may be a byte, a word, or a double word. The string elements are addressed by
 1066 SI and DI (or ESI and EDI) registers. After every string operation SI and/or
 1067 DI (or ESI and/or EDI) are automatically updated to point to the next element
 1068 of the string. If DF (direction flag) is zero, the index registers are
 1069 incremented, if DF is one, they are decremented. The amount of the increment
 1070 or decrement is 1, 2, or 4 depending on the size of the string element. Every
 1071 string operation instruction has short forms which have no operands and use
 1072 SI and/or DI when the code type is 16-bit, and ESI and/or EDI when the code
 1073 type is 32-bit. SI and ESI by default address data in the segment selected
 1074 by DS, DI and EDI always address data in the segment selected by ES. Short
 1075 form is obtained by attaching to the mnemonic of string operation letter
 1076 specifying the size of string element, it should be "b" for byte element,
 1077 "w" for word element, and "d" for double word element. Full form of string
 1078 operation needs operands providing the size operator and the memory addresses,
 1079 which can be SI or ESI with any segment prefix, DI or EDI always with ES
 1080 segment prefix.
 1081   "movs" transfers the string element pointed to by SI (or ESI) to the
 1082 location pointed to by DI (or EDI). Size of operands can be byte, word, or
 1083 double word. The destination operand should be memory addressed by DI or EDI,
 1084 the source operand should be memory addressed by SI or ESI with any segment
 1085 prefix.
 1086 
 1087     movs byte [di],[si]        ; transfer byte
 1088     movs word [es:di],[ss:si]  ; transfer word
 1089     movsd                      ; transfer double word
 1090 
 1091   "cmps" subtracts the destination string element from the source string
 1092 element and updates the flags AF, SF, PF, CF and OF, but it does not change
 1093 any of the compared elements. If the string elements are equal, ZF is set,
 1094 otherwise it is cleared. The first operand for this instruction should be the
 1095 source string element addressed by SI or ESI with any segment prefix, the
 1096 second operand should be the destination string element addressed by DI or
 1097 EDI.
 1098 
 1099     cmpsb                      ; compare bytes
 1100     cmps word [ds:si],[es:di]  ; compare words
 1101     cmps dword [fs:esi],[edi]  ; compare double words
 1102 
 1103   "scas" subtracts the destination string element from AL, AX, or EAX
 1104 (depending on the size of string element) and updates the flags AF, SF, ZF,
 1105 PF, CF and OF. If the values are equal, ZF is set, otherwise it is cleared.
 1106 The operand should be the destination string element addressed by DI or EDI.
 1107 
 1108     scas byte [es:di]          ; scan byte
 1109     scasw                      ; scan word
 1110     scas dword [es:edi]        ; scan double word
 1111 
 1112   "stos" places the value of AL, AX, or EAX into the destination string
 1113 element. Rules for the operand are the same as for the "scas" instruction.
 1114   "lods" places the source string element into AL, AX, or EAX. The operand
 1115 should be the source string element addressed by SI or ESI with any segment
 1116 prefix.
 1117 
 1118     lods byte [ds:si]          ; load byte
 1119     lods word [cs:si]          ; load word
 1120     lodsd                      ; load double word
 1121 
 1122   "ins" transfers a byte, word, or double word from an input port addressed
 1123 by DX register to the destination string element. The destination operand
 1124 should be memory addressed by DI or EDI, the source operand should be the DX
 1125 register.
 1126 
 1127     insb                       ; input byte
 1128     ins word [es:di],dx        ; input word
 1129     ins dword [edi],dx         ; input double word
 1130 
 1131   "outs" transfers the source string element to an output port addressed by
 1132 DX register. The destination operand should be the DX register and the source
 1133 operand should be memory addressed by SI or ESI with any segment prefix.
 1134 
 1135     outs dx,byte [si]          ; output byte
 1136     outsw                      ; output word
 1137     outs dx,dword [gs:esi]     ; output double word
 1138 
 1139   The repeat prefixes "rep", "repe"/"repz", and "repne"/"repnz" specify
 1140 repeated string operation. When a string operation instruction has a repeat
 1141 prefix, the operation is executed repeatedly, each time using a different
 1142 element of the string. The repetition terminates when one of the conditions
 1143 specified by the prefix is satisfied. All three prefixes automatically
 1144 decrease CX or ECX register (depending whether string operation instruction
 1145 uses the 16-bit or 32-bit addressing) after each operation and repeat the
 1146 associated operation until CX or ECX is zero. "repe"/"repz" and
 1147 "repne"/"repnz" are used exclusively with the "scas" and "cmps" instructions
 1148 (described below). When these prefixes are used, repetition of the next
 1149 instruction depends on the zero flag (ZF) also, "repe" and "repz" terminate
 1150 the execution when the ZF is zero, "repne" and "repnz" terminate the execution
 1151 when the ZF is set.
 1152 
 1153     rep  movsd       ; transfer multiple double words
 1154     repe cmpsb       ; compare bytes until not equal
 1155 
 1156 
 1157 2.1.9  Flag control instructions
 1158 
 1159 The flag control instructions provide a method for directly changing the
 1160 state of bits in the flag register. All instructions described in this
 1161 section have no operands.
 1162   "stc" sets the CF (carry flag) to 1, "clc" zeroes the CF, "cmc" changes the
 1163 CF to its complement. "std" sets the DF (direction flag) to 1, "cld" zeroes
 1164 the DF, "sti" sets the IF (interrupt flag) to 1 and therefore enables the
 1165 interrupts, "cli" zeroes the IF and therefore disables the interrupts.
 1166   "lahf" copies SF, ZF, AF, PF, and CF to bits 7, 6, 4, 2, and 0 of the
 1167 AH register. The contents of the remaining bits are undefined. The flags
 1168 remain unaffected.
 1169   "sahf" transfers bits 7, 6, 4, 2, and 0 from the AH register into SF, ZF,
 1170 AF, PF, and CF.
 1171   "pushf" decrements "esp" by two or four and stores the low word or
 1172 double word of flags register at the top of stack, size of stored data
 1173 depends on the current code setting. "pushfw" variant forces storing the
 1174 word and "pushfd" forces storing the double word.
 1175   "popf" transfers specific bits from the word or double word at the top
 1176 of stack, then increments "esp" by two or four, this value depends on
 1177 the current code setting. "popfw" variant forces restoring from the word
 1178 and "popfd" forces restoring from the double word.
 1179 
 1180 
 1181 2.1.10  Conditional operations
 1182 
 1183   The instructions obtained by attaching the condition mnemonic (see table
 1184 2.1) to the "set" mnemonic set a byte to one if the condition is true and set
 1185 the byte to zero otherwise. The operand should be an 8-bit be general register
 1186 or the byte in memory.
 1187 
 1188     setne al         ; set al if zero flag cleared
 1189     seto byte [bx]   ; set byte if overflow
 1190 
 1191   "salc" instruction sets the all bits of AL register when the carry flag is
 1192 set and zeroes the AL register otherwise. This instruction has no arguments.
 1193   The instructions obtained by attaching the condition mnemonic to "cmov"
 1194 mnemonic transfer the word or double word from the general register or memory
 1195 to the general register only when the condition is true. The destination
 1196 operand should be general register, the source operand can be general register
 1197 or memory.
 1198 
 1199     cmove ax,bx      ; move when zero flag set
 1200     cmovnc eax,[ebx] ; move when carry flag cleared
 1201 
 1202   "cmpxchg" compares the value in the AL, AX, or EAX register with the
 1203 destination operand. If the two values are equal, the source operand is
 1204 loaded into the destination operand. Otherwise, the destination operand is
 1205 loaded into the AL, AX, or EAX register. The destination operand may be a
 1206 general register or memory, the source operand must be a general register.
 1207 
 1208     cmpxchg dl,bl    ; compare and exchange with register
 1209     cmpxchg [bx],dx  ; compare and exchange with memory
 1210 
 1211   "cmpxchg8b" compares the 64-bit value in EDX and EAX registers with the
 1212 destination operand. If the values are equal, the 64-bit value in ECX and EBX
 1213 registers is stored in the destination operand. Otherwise, the value in the
 1214 destination operand is loaded into EDX and EAX registers. The destination
 1215 operand should be a quad word in memory.
 1216 
 1217     cmpxchg8b [bx]   ; compare and exchange 8 bytes
 1218 
 1219 
 1220 2.1.11  Miscellaneous instructions
 1221 
 1222 "nop" instruction occupies one byte but affects nothing but the instruction
 1223 pointer. This instruction has no operands and doesn't perform any operation.
 1224   "ud2" instruction generates an invalid opcode exception. This instruction
 1225 is provided for software testing to explicitly generate an invalid opcode.
 1226 This is instruction has no operands.
 1227   "xlat" replaces a byte in the AL register with a byte indexed by its value
 1228 in a translation table addressed by BX or EBX. The operand should be a byte
 1229 memory addressed by BX or EBX with any segment prefix. This instruction has
 1230 also a short form "xlatb" which has no operands and uses the BX or EBX address
 1231 in the segment selected by DS depending on the current code setting.
 1232   "lds" transfers a pointer variable from the source operand to DS and the
 1233 destination register. The source operand must be a memory operand, and the
 1234 destination operand must be a general register. The DS register receives the
 1235 segment selector of the pointer while the destination register receives the
 1236 offset part of the pointer. "les", "lfs", "lgs" and "lss" operate identically
 1237 to "lds" except that rather than DS register the ES, FS, GS and SS is used
 1238 respectively.
 1239 
 1240     lds bx,[si]      ; load pointer to ds:bx
 1241 
 1242   "lea" transfers the offset of the source operand (rather than its value)
 1243 to the destination operand. The source operand must be a memory operand, and
 1244 the destination operand must be a general register.
 1245 
 1246     lea dx,[bx+si+1] ; load effective address to dx
 1247 
 1248   "cpuid" returns processor identification and feature information in the
 1249 EAX, EBX, ECX, and EDX registers. The information returned is selected by
 1250 entering a value in the EAX register before the instruction is executed.
 1251 This instruction has no operands.
 1252   "pause" instruction delays the execution of the next instruction an
 1253 implementation specific amount of time. It can be used to improve the
 1254 performance of spin wait loops. This instruction has no operands.
 1255   "enter" creates a stack frame that may be used to implement the scope rules
 1256 of block-structured high-level languages. A "leave" instruction at the end of
 1257 a procedure complements an "enter" at the beginning of the procedure to
 1258 simplify stack management and to control access to variables for nested
 1259 procedures. The "enter" instruction includes two parameters. The first
 1260 parameter specifies the number of bytes of dynamic storage to be allocated on
 1261 the stack for the routine being entered. The second parameter corresponds to
 1262 the lexical nesting level of the routine, it can be in range from 0 to 31.
 1263 The specified lexical level determines how many sets of stack frame pointers
 1264 the CPU copies into the new stack frame from the preceding frame. This list
 1265 of stack frame pointers is sometimes called the display. The first word (or
 1266 double word when code is 32-bit) of the display is a pointer to the last stack
 1267 frame. This pointer enables a "leave" instruction to reverse the action of the
 1268 previous "enter" instruction by effectively discarding the last stack frame.
 1269 After "enter" creates the new display for a procedure, it allocates the
 1270 dynamic storage space for that procedure by decrementing ESP by the number of
 1271 bytes specified in the first parameter. To enable a procedure to address its
 1272 display, "enter" leaves BP (or EBP) pointing to the beginning of the new stack
 1273 frame. If the lexical level is zero, "enter" pushes BP (or EBP), copies SP to
 1274 BP (or ESP to EBP) and then subtracts the first operand from ESP. For nesting
 1275 levels greater than zero, the processor pushes additional frame pointers on
 1276 the stack before adjusting the stack pointer.
 1277 
 1278     enter 2048,0     ; enter and allocate 2048 bytes on stack
 1279 
 1280 
 1281 2.1.12  System instructions
 1282 
 1283 "lmsw" loads the operand into the machine status word (bits 0 through 15 of
 1284 CR0 register), while "smsw" stores the machine status word into the
 1285 destination operand. The operand for both those instructions can be 16-bit
 1286 general register or memory, for "smsw" it can also be 32-bit general
 1287 register.
 1288 
 1289     lmsw ax          ; load machine status from register
 1290     smsw [bx]        ; store machine status to memory
 1291 
 1292   "lgdt" and "lidt" instructions load the values in operand into the global
 1293 descriptor table register or the interrupt descriptor table register
 1294 respectively. "sgdt" and "sidt" store the contents of the global descriptor
 1295 table register or the interrupt descriptor table register in the destination
 1296 operand. The operand should be a 6 bytes in memory.
 1297 
 1298     lgdt [ebx]       ; load global descriptor table
 1299 
 1300   "lldt" loads the operand into the segment selector field of the local
 1301 descriptor table register and "sldt" stores the segment selector from the
 1302 local descriptor table register in the operand. "ltr" loads the operand into
 1303 the segment selector field of the task register and "str" stores the segment
 1304 selector from the task register in the operand. Rules for operand are the same
 1305 as for the "lmsw" and "smsw" instructions.
 1306   "lar" loads the access rights from the segment descriptor specified by
 1307 the selector in source operand into the destination operand and sets the ZF
 1308 flag. The destination operand can be a 16-bit or 32-bit general register.
 1309 The source operand should be a 16-bit general register or memory.
 1310 
 1311     lar ax,[bx]      ; load access rights into word
 1312     lar eax,dx       ; load access rights into double word
 1313 
 1314   "lsl" loads the segment limit from the segment descriptor specified by the
 1315 selector in source operand into the destination operand and sets the ZF flag.
 1316 Rules for operand are the same as for the "lar" instruction.
 1317   "verr" and "verw" verify whether the code or data segment specified with
 1318 the operand is readable or writable from the current privilege level. The
 1319 operand should be a word, it can be general register or memory. If the segment
 1320 is accessible and readable (for "verr") or writable (for "verw") the ZF flag
 1321 is set, otherwise it's cleared. Rules for operand are the same as for the
 1322 "lldt" instruction.
 1323   "arpl" compares the RPL (requestor's privilege level) fields of two segment
 1324 selectors. The first operand contains one segment selector and the second
 1325 operand contains the other. If the RPL field of the destination operand is
 1326 less than the RPL field of the source operand, the ZF flag is set and the RPL
 1327 field of the destination operand is increased to match that of the source
 1328 operand. Otherwise, the ZF flag is cleared and no change is made to the
 1329 destination operand. The destination operand can be a word general register
 1330 or memory, the source operand must be a general register.
 1331 
 1332     arpl bx,ax       ; adjust RPL of selector in register
 1333     arpl [bx],ax     ; adjust RPL of selector in memory
 1334 
 1335   "clts" clears the TS (task switched) flag in the CR0 register. This
 1336 instruction has no operands.
 1337   "lock" prefix causes the processor's bus-lock signal to be asserted during
 1338 execution of the accompanying instruction. In a multiprocessor environment,
 1339 the bus-lock signal insures that the processor has exclusive use of any shared
 1340 memory while the signal is asserted. The "lock" prefix can be prepended only
 1341 to the following instructions and only to those forms of the instructions
 1342 where the destination operand is a memory operand: "add", "adc", "and", "btc",
 1343 "btr", "bts", "cmpxchg", "cmpxchg8b", "dec", "inc", "neg", "not", "or", "sbb",
 1344 "sub", "xor", "xadd" and "xchg". If the "lock" prefix is used with one of
 1345 these instructions and the source operand is a memory operand, an undefined
 1346 opcode exception may be generated. An undefined opcode exception will also be
 1347 generated if the "lock" prefix is used with any instruction not in the above
 1348 list. The "xchg" instruction always asserts the bus-lock signal regardless of
 1349 the presence or absence of the "lock" prefix.
 1350   "hlt" stops instruction execution and places the processor in a halted
 1351 state. An enabled interrupt, a debug exception, the BINIT, INIT or the RESET
 1352 signal will resume execution. This instruction has no operands.
 1353   "invlpg" invalidates (flushes) the TLB (translation lookaside buffer) entry
 1354 specified with the operand, which should be a memory. The processor determines
 1355 the page that contains that address and flushes the TLB entry for that page.
 1356   "rdmsr" loads the contents of a 64-bit MSR (model specific register) of the
 1357 address specified in the ECX register into registers EDX and EAX. "wrmsr"
 1358 writes the contents of registers EDX and EAX into the 64-bit MSR of the
 1359 address specified in the ECX register. "rdtsc" loads the current value of the
 1360 processor's time stamp counter from the 64-bit MSR into the EDX and EAX
 1361 registers. The processor increments the time stamp counter MSR every clock
 1362 cycle and resets it to 0 whenever the processor is reset. "rdpmc" loads the
 1363 contents of the 40-bit performance monitoring counter specified in the ECX
 1364 register into registers EDX and EAX. These instructions have no operands.
 1365   "wbinvd" writes back all modified cache lines in the processor's internal
 1366 cache to main memory and invalidates (flushes) the internal caches. The
 1367 instruction then issues a special function bus cycle that directs external
 1368 caches to also write back modified data and another bus cycle to indicate that
 1369 the external caches should be invalidated. This instruction has no operands.
 1370   "rsm" return program control from the system management mode to the program
 1371 that was interrupted when the processor received an SMM interrupt. This
 1372 instruction has no operands.
 1373   "sysenter" executes a fast call to a level 0 system procedure, "sysexit"
 1374 executes a fast return to level 3 user code. The addresses used by these
 1375 instructions are stored in MSRs. These instructions have no operands.
 1376 
 1377 
 1378 2.1.13  FPU instructions
 1379 
 1380 The FPU (Floating-Point Unit) instructions operate on the floating-point
 1381 values in three formats: single precision (32-bit), double precision (64-bit)
 1382 and double extended precision (80-bit). The FPU registers form the stack and
 1383 each of them holds the double extended precision floating-point value. When
 1384 some values are pushed onto the stack or are removed from the top, the FPU
 1385 registers are shifted, so ST0 is always the value on the top of FPU stack, ST1
 1386 is the first value below the top, etc. The ST0 name has also the synonym ST.
 1387   "fld" pushes the floating-point value onto the FPU register stack. The
 1388 operand can be 32-bit, 64-bit or 80-bit memory location or the FPU register,
 1389 its value is then loaded onto the top of FPU register stack (the ST0
 1390 register) and is automatically converted into the double extended precision
 1391 format.
 1392 
 1393     fld dword [bx]   ; load single prevision value from memory
 1394     fld st2          ; push value of st2 onto register stack
 1395 
 1396   "fld1", "fldz", "fldl2t", "fldl2e", "fldpi", "fldlg2" and "fldln2" load the
 1397 commonly used contants onto the FPU register stack. The loaded constants are
 1398 +1.0, +0.0, lb 10, lb e, pi, lg 2 and ln 2 respectively. These instructions
 1399 have no operands.
 1400   "fild" converts the signed integer source operand into double extended
 1401 precision floating-point format and pushes the result onto the FPU register
 1402 stack. The source operand can be a 16-bit, 32-bit or 64-bit memory location.
 1403 
 1404     fild qword [bx]  ; load 64-bit integer from memory
 1405 
 1406   "fst" copies the value of ST0 register to the destination operand, which
 1407 can be 32-bit or 64-bit memory location or another FPU register. "fstp"
 1408 performs the same operation as "fst" and then pops the register stack,
 1409 getting rid of ST0. "fstp" accepts the same operands as the "fst" instruction
 1410 and can also store value in the 80-bit memory.
 1411 
 1412     fst st3          ; copy value of st0 into st3 register
 1413     fstp tword [bx]  ; store value in memory and pop stack
 1414 
 1415   "fist" converts the value in ST0 to a signed integer and stores the result
 1416 in the destination operand. The operand can be 16-bit or 32-bit memory
 1417 location. "fistp" performs the same operation and then pops the register
 1418 stack, it accepts the same operands as the "fist" instruction and can also
 1419 store integer value in the 64-bit memory, so it has the same rules for
 1420 operands as "fild" instruction.
 1421   "fbld" converts the packed BCD integer into double extended precision
 1422 floating-point format and pushes this value onto the FPU stack. "fbstp"
 1423 converts the value in ST0 to an 18-digit packed BCD integer, stores the result
 1424 in the destination operand, and pops the register stack. The operand should be
 1425 an 80-bit memory location.
 1426   "fadd" adds the destination and source operand and stores the sum in the
 1427 destination location. The destination operand is always an FPU register, if
 1428 the source is a memory location, the destination is ST0 register and only
 1429 source operand should be specified. If both operands are FPU registers, at
 1430 least one of them should be ST0 register. An operand in memory can be a
 1431 32-bit or 64-bit value.
 1432 
 1433     fadd qword [bx]  ; add double precision value to st0
 1434     fadd st2,st0     ; add st0 to st2
 1435 
 1436   "faddp" adds the destination and source operand, stores the sum in the
 1437 destination location and then pops the register stack. The destination operand
 1438 must be an FPU register and the source operand must be the ST0. When no
 1439 operands are specified, ST1 is used as a destination operand.
 1440 
 1441     faddp            ; add st0 to st1 and pop the stack
 1442     faddp st2,st0    ; add st0 to st2 and pop the stack
 1443 
 1444 "fiadd" instruction converts an integer source operand into double extended
 1445 precision floating-point value and adds it to the destination operand. The
 1446 operand should be a 16-bit or 32-bit memory location.
 1447 
 1448     fiadd word [bx]  ; add word integer to st0
 1449 
 1450   "fsub", "fsubr", "fmul", "fdiv", "fdivr" instruction are similar to "fadd",
 1451 have the same rules for operands and differ only in the perfomed computation.
 1452 "fsub" subtracts the source operand from the destination operand, "fsubr"
 1453 subtract the destination operand from the source operand, "fmul" multiplies
 1454 the destination and source operands, "fdiv" divides the destination operand by
 1455 the source operand and "fdivr" divides the source operand by the destination
 1456 operand. "fsubp", "fsubrp", "fmulp", "fdivp", "fdivrp" perform the same
 1457 operations and pop the register stack, the rules for operand are the same as
 1458 for the "faddp" instruction. "fisub", "fisubr", "fimul", "fidiv", "fidivr"
 1459 perform these operations after converting the integer source operand into
 1460 floating-point value, they have the same rules for operands as "fiadd"
 1461 instruction.
 1462   "fsqrt" computes the square root of the value in ST0 register, "fsin"
 1463 computes the sine of that value, "fcos" computes the cosine of that value,
 1464 "fchs" complements its sign bit, "fabs" clears its sign to create the absolute
 1465 value, "frndint" rounds it to the nearest integral value, depending on the
 1466 current rounding mode. "f2xm1" computes the exponential value of 2 to the
 1467 power of ST0 and subtracts the 1.0 from it, the value of ST0 must lie in the
 1468 range -1.0 to +1.0. All these instructions store the result in ST0 and have no
 1469 operands.
 1470   "fsincos" computes both the sine and the cosine of the value in ST0
 1471 register, stores the sine in ST0 and pushes the cosine on the top of FPU
 1472 register stack. "fptan" computes the tangent of the value in ST0, stores the
 1473 result in ST0 and pushes a 1.0 onto the FPU register stack. "fpatan" computes
 1474 the arctangent of the value in ST1 divided by the value in ST0, stores the
 1475 result in ST1 and pops the FPU register stack. "fyl2x" computes the binary
 1476 logarithm of ST0, multiplies it by ST1, stores the result in ST1 and pops the
 1477 FPU register stack; "fyl2xp1" performs the same operation but it adds 1.0 to
 1478 ST0 before computing the logarithm. "fprem" computes the remainder obtained
 1479 from dividing the value in ST0 by the value in ST1, and stores the result
 1480 in ST0. "fprem1" performs the same operation as "fprem", but it computes the
 1481 remainder in the way specified by IEEE Standard 754. "fscale" truncates the
 1482 value in ST1 and increases the exponent of ST0 by this value. "fxtract"
 1483 separates the value in ST0 into its exponent and significand, stores the
 1484 exponent in ST0 and pushes the significand onto the register stack. "fnop"
 1485 performs no operation. These instructions have no operands.
 1486   "fxch" exchanges the contents of ST0 an another FPU register. The operand
 1487 should be an FPU register, if no operand is specified, the contents of ST0 and
 1488 ST1 are exchanged.
 1489   "fcom" and "fcomp" compare the contents of ST0 and the source operand and
 1490 set flags in the FPU status word according to the results. "fcomp"
 1491 additionally pops the register stack after performing the comparison. The
 1492 operand can be a single or double precision value in memory or the FPU
 1493 register. When no operand is specified, ST1 is used as a source operand.
 1494 
 1495     fcom             ; compare st0 with st1
 1496     fcomp st2        ; compare st0 with st2 and pop stack
 1497 
 1498   "fcompp" compares the contents of ST0 and ST1, sets flags in the FPU status
 1499 word according to the results and pops the register stack twice. This
 1500 instruction has no operands.
 1501   "fucom", "fucomp" and "fucompp" performs an unordered comparison of two FPU
 1502 registers. Rules for operands are the same as for the "fcom", "fcomp" and
 1503 "fcompp", but the source operand must be an FPU register.
 1504   "ficom" and "ficomp" compare the value in ST0 with an integer source operand
 1505 and set the flags in the FPU status word according to the results. "ficomp"
 1506 additionally pops the register stack after performing the comparison. The
 1507 integer value is converted to double extended precision floating-point format
 1508 before the comparison is made. The operand should be a 16-bit or 32-bit
 1509 memory location.
 1510 
 1511     ficom word [bx]  ; compare st0 with 16-bit integer
 1512 
 1513   "fcomi", "fcomip", "fucomi", "fucomip" perform the comparison of ST0 with
 1514 another FPU register and set the ZF, PF and CF flags according to the results.
 1515 "fcomip" and "fucomip" additionaly pop the register stack after performing the
 1516 comparison. The instructions obtained by attaching the FPU condition mnemonic
 1517 (see table 2.2) to the "fcmov" mnemonic transfer the specified FPU register
 1518 into ST0 register if the given test condition is true. These instructions
 1519 allow two different syntaxes, one with single operand specifying the source
 1520 FPU register, and one with two operands, in that case destination operand
 1521 should be ST0 register and the second operand specifies the source FPU
 1522 register.
 1523 
 1524     fcomi st2        ; compare st0 with st2 and set flags
 1525     fcmovb st0,st2   ; transfer st2 to st0 if below
 1526 
 1527    Table 2.2  FPU conditions
 1528   /------------------------------------------------------\
 1529   | Mnemonic | Condition tested | Description            |
 1530   |==========|==================|========================|
 1531   | b        | CF = 1           | below                  |
 1532   | e        | ZF = 1           | equal                  |
 1533   | be       | CF or ZF = 1     | below or equal         |
 1534   | u        | PF = 1           | unordered              |
 1535   | nb       | CF = 0           | not below              |
 1536   | ne       | ZF = 0           | not equal              |
 1537   | nbe      | CF and ZF = 0    | not below nor equal    |
 1538   | nu       | PF = 0           | not unordered          |
 1539   \------------------------------------------------------/
 1540 
 1541   "ftst" compares the value in ST0 with 0.0 and sets the flags in the FPU
 1542 status word according to the results. "fxam" examines the contents of the ST0
 1543 and sets the flags in FPU status word to indicate the class of value in the
 1544 register. These instructions have no operands.
 1545   "fstsw" and "fnstsw" store the current value of the FPU status word in the
 1546 destination location. The destination operand can be either a 16-bit memory or
 1547 the AX register. "fstsw" checks for pending unmasked FPU exceptions before
 1548 storing the status word, "fnstsw" does not.
 1549   "fstcw" and "fnstcw" store the current value of the FPU control word at the
 1550 specified destination in memory. "fstcw" checks for pending umasked FPU
 1551 exceptions before storing the control word, "fnstcw" does not. "fldcw" loads
 1552 the operand into the FPU control word. The operand should be a 16-bit memory
 1553 location.
 1554   "fstenv" and "fnstenv" store the current FPU operating environment at the
 1555 memory location specified with the destination operand, and then mask all FPU
 1556 exceptions. "fstenv" checks for pending umasked FPU exceptions before
 1557 proceeding, "fnstenv" does not. "fldenv" loads the complete operating
 1558 environment from memory into the FPU. "fsave" and "fnsave" store the current
 1559 FPU state (operating environment and register stack) at the specified
 1560 destination in memory and reinitializes the FPU. "fsave" check for pending
 1561 unmasked FPU exceptions before proceeding, "fnsave" does not. "frstor"
 1562 loads the FPU state from the specified memory location. All these instructions
 1563 need an operand being a memory location. For each of these instructions
 1564 exist two additional mnemonics that allow to precisely select the type of the
 1565 operation. The "fstenvw", "fnstenvw", "fldenvw", "fsavew", "fnsavew" and
 1566 "frstorw" mnemonics force the instruction to perform operation as in the 16-bit
 1567 mode, while "fstenvd", "fnstenvd", "fldenvd", "fsaved", "fnsaved" and "frstord"
 1568 force the operation as in 32-bit mode.
 1569   "finit" and "fninit" set the FPU operating environment into its default
 1570 state. "finit" checks for pending unmasked FPU exception before proceeding,
 1571 "fninit" does not. "fclex" and "fnclex" clear the FPU exception flags in the
 1572 FPU status word. "fclex" checks for pending unmasked FPU exception before
 1573 proceeding, "fnclex" does not. "wait" and "fwait" are synonyms for the same
 1574 instruction, which causes the processor to check for pending unmasked FPU
 1575 exceptions and handle them before proceeding. These instructions have no
 1576 operands.
 1577   "ffree" sets the tag associated with specified FPU register to empty. The
 1578 operand should be an FPU register.
 1579   "fincstp" and "fdecstp" rotate the FPU stack by one by adding or
 1580 subtracting one to the pointer of the top of stack. These instructions have no
 1581 operands.
 1582 
 1583 
 1584 2.1.14  MMX instructions
 1585 
 1586 The MMX instructions operate on the packed integer types and use the MMX
 1587 registers, which are the low 64-bit parts of the 80-bit FPU registers. Because
 1588 of this MMX instructions cannot be used at the same time as FPU instructions.
 1589 They can operate on packed bytes (eight 8-bit integers), packed words (four
 1590 16-bit integers) or packed double words (two 32-bit integers), use of packed
 1591 formats allows to perform operations on multiple data at one time.
 1592   "movq" copies a quad word from the source operand to the destination
 1593 operand. At least one of the operands must be a MMX register, the second one
 1594 can be also a MMX register or 64-bit memory location.
 1595 
 1596     movq mm0,mm1     ; move quad word from register to register
 1597     movq mm2,[ebx]   ; move quad word from memory to register
 1598 
 1599   "movd" copies a double word from the source operand to the destination
 1600 operand. One of the operands must be a MMX register, the second one can be a
 1601 general register or 32-bit memory location. Only low double word of MMX
 1602 register is used.
 1603   All general MMX operations have two operands, the destination operand should
 1604 be a MMX register, the source operand can be a MMX register or 64-bit memory
 1605 location. Operation is performed on the corresponding data elements of the
 1606 source and destination operand and stored in the data elements of the
 1607 destination operand. "paddb", "paddw" and "paddd" perform the addition of
 1608 packed bytes, packed words, or packed double words.  "psubb", "psubw" and
 1609 "psubd" perform the subtraction of appropriate types. "paddsb", "paddsw",
 1610 "psubsb" and "psubsw" perform the addition or subtraction of packed bytes
 1611 or packed words with the signed saturation. "paddusb", "paddusw", "psubusb",
 1612 "psubusw" are analoguous, but with unsigned saturation. "pmulhw" and "pmullw"
 1613 performs a signed multiplication of the packed words and store the high or low
 1614 words of the results in the destination operand. "pmaddwd" performs a multiply
 1615 of the packed words and adds the four intermediate double word products in
 1616 pairs to produce result as a packed double words. "pand", "por" and "pxor"
 1617 perform the logical operations on the quad words, "pandn" peforms also a
 1618 logical negation of the destination operand before performing the "and"
 1619 operation. "pcmpeqb", "pcmpeqw" and "pcmpeqd" compare for equality of packed
 1620 bytes, packed words or packed double words. If a pair of data elements is
 1621 equal, the corresponding data element in the destination operand is filled with
 1622 bits of value 1, otherwise it's set to 0. "pcmpgtb", "pcmpgtw" and "pcmpgtd"
 1623 perform the similar operation, but they check whether the data elements in the
 1624 destination operand are greater than the correspoding data elements in the
 1625 source operand. "packsswb" converts packed signed words into packed signed
 1626 bytes, "packssdw" converts packed signed double words into packed signed
 1627 words, using saturation to handle overflow conditions. "packuswb" converts
 1628 packed signed words into packed unsigned bytes. Converted data elements from
 1629 the source operand are stored in the high part of the destination operand,
 1630 while converted data elements from the destination operand are stored in the
 1631 low part. "punpckhbw", "punpckhwd" and "punpckhdq" interleaves the data
 1632 elements from the high parts of the source and destination operands and
 1633 stores the result into the destination operand. "punpcklbw", "punpcklwd" and
 1634 "punpckldq" perform the same operation, but the low parts of the source and
 1635 destination operand are used.
 1636 
 1637     paddsb mm0,[esi] ; add packed bytes with signed saturation
 1638     pcmpeqw mm3,mm7  ; compare packed words for equality
 1639 
 1640   "psllw", "pslld" and "psllq" perform logical shift left of the packed words,
 1641 packed double words or a single quad word in the destination operand by the
 1642 amount specified in the source operand. "psrlw", "psrld" and "psrlq" perform
 1643 logical shift right of the packed words, packed double words or a single quad
 1644 word. "psraw" and "psrad" perform arithmetic shift of the packed words or
 1645 double words. The destination operand should be a MMX register, while source
 1646 operand can be a MMX register, 64-bit memory location, or 8-bit immediate
 1647 value.
 1648 
 1649     psllw mm2,mm4    ; shift words left logically
 1650     psrad mm4,[ebx]  ; shift double words right arithmetically
 1651 
 1652   "emms" makes the FPU registers usable for the FPU instructions, it must be
 1653 used before using the FPU instructions if any MMX instructions were used.
 1654 
 1655 
 1656 2.1.15  SSE instructions
 1657 
 1658 The SSE extension adds more MMX instructions and also introduces the
 1659 operations on packed single precision floating point values. The 128-bit
 1660 packed single precision format consists of four single precision floating
 1661 point values. The 128-bit SSE registers are designed for the purpose of
 1662 operations on this data type.
 1663   "movaps" and "movups" transfer a double quad word operand containing packed
 1664 single precision values from source operand to destination operand. At least
 1665 one of the operands have to be a SSE register, the second one can be also a
 1666 SSE register or 128-bit memory location. Memory operands for "movaps"
 1667 instruction must be aligned on boundary of 16 bytes, operands for "movups"
 1668 instruction don't have to be aligned.
 1669 
 1670     movups xmm0,[ebx]  ; move unaligned double quad word
 1671 
 1672   "movlps" moves packed two single precision values between the memory and the
 1673 low quad word of SSE register. "movhps" moved packed two single precision
 1674 values between the memory and the high quad word of SSE register. One of the
 1675 operands must be a SSE register, and the other operand must be a 64-bit memory
 1676 location.
 1677 
 1678     movlps xmm0,[ebx]  ; move memory to low quad word of xmm0
 1679     movhps [esi],xmm7  ; move high quad word of xmm7 to memory
 1680 
 1681   "movlhps" moves packed two single precision values from the low quad word
 1682 of source register to the high quad word of destination register. "movhlps"
 1683 moves two packed single precision values from the high quad word of source
 1684 register to the low quad word of destination register. Both operands have to
 1685 be a SSE registers.
 1686   "movmskps" transfers the most significant bit of each of the four single
 1687 precision values in the SSE register into low four bits of a general register.
 1688 The source operand must be a SSE register, the destination operand must be a
 1689 general register.
 1690   "movss" transfers a single precision value between source and destination
 1691 operand (only the low double word is trasferred). At least one of the operands
 1692 have to be a SSE register, the second one can be also a SSE register or 32-bit
 1693 memory location.
 1694 
 1695     movss [edi],xmm3   ; move low double word of xmm3 to memory
 1696 
 1697   Each of the SSE arithmetic operations has two variants. When the mnemonic
 1698 ends with "ps", the source operand can be a 128-bit memory location or a SSE
 1699 register, the destination operand must be a SSE register and the operation is
 1700 performed on packed four single precision values, for each pair of the
 1701 corresponding data elements separately, the result is stored in the
 1702 destination register. When the mnemonic ends with "ss", the source operand
 1703 can be a 32-bit memory location or a SSE register, the destination operand
 1704 must be a SSE register and the operation is performed on single precision
 1705 values, only low double words of SSE registers are used in this case, the
 1706 result is stored in the low double word of destination register. "addps" and
 1707 "addss" add the values, "subps" and "subss" subtract the source value from
 1708 destination value, "mulps" and "mulss" multiply the values, "divps" and
 1709 "divss" divide the destination value by the source value, "rcpps" and "rcpss"
 1710 compute the approximate reciprocal of the source value, "sqrtps" and "sqrtss"
 1711 compute the square root of the source value, "rsqrtps" and "rsqrtss" compute
 1712 the approximate reciprocal of square root of the source value, "maxps" and
 1713 "maxss" compare the source and destination values and return the greater one,
 1714 "minps" and "minss" compare the source and destination values and return the
 1715 lesser one.
 1716 
 1717     mulss xmm0,[ebx]   ; multiply single precision values
 1718     addps xmm3,xmm7    ; add packed single precision values
 1719 
 1720   "andps", "andnps", "orps" and "xorps" perform the logical operations on
 1721 packed single precision values. The source operand can be a 128-bit memory
 1722 location or a SSE register, the destination operand must be a SSE register.
 1723   "cmpps" compares packed single precision values and returns a mask result
 1724 into the destination operand, which must be a SSE register. The source operand
 1725 can be a 128-bit memory location or SSE register, the third operand must be an
 1726 immediate operand selecting code of one of the eight compare conditions
 1727 (table 2.3). "cmpss" performs the same operation on single precision values,
 1728 only low double word of destination register is affected, in this case source
 1729 operand can be a 32-bit memory location or SSE register. These two
 1730 instructions have also variants with only two operands and the condition
 1731 encoded within mnemonic. Their mnemonics are obtained by attaching the
 1732 mnemonic from table 2.3 to the "cmp" mnemonic and then attaching the "ps" or
 1733 "ss" at the end.
 1734 
 1735     cmpps xmm2,xmm4,0  ; compare packed single precision values
 1736     cmpltss xmm0,[ebx] ; compare single precision values
 1737 
 1738    Table 2.3  SSE conditions
 1739   /-------------------------------------------\
 1740   | Code | Mnemonic | Description             |
 1741   |======|==========|=========================|
 1742   | 0    | eq       | equal                   |
 1743   | 1    | lt       | less than               |
 1744   | 2    | le       | less than or equal      |
 1745   | 3    | unord    | unordered               |
 1746   | 4    | neq      | not equal               |
 1747   | 5    | nlt      | not less than           |
 1748   | 6    | nle      | not less than nor equal |
 1749   | 7    | ord      | ordered                 |
 1750   \-------------------------------------------/
 1751 
 1752   "comiss" and "ucomiss" compare the single precision values and set the ZF,
 1753 PF and CF flags to show the result. The destination operand must be a SSE
 1754 register, the source operand can be a 32-bit memory location or SSE register.
 1755   "shufps" moves any two of the four single precision values from the
 1756 destination operand into the low quad word of the destination operand, and any
 1757 two of the four values from the source operand into the high quad word of the
 1758 destination operand. The destination operand must be a SSE register, the
 1759 source operand can be a 128-bit memory location or SSE register, the third
 1760 operand must be an 8-bit immediate value selecting which values will be moved
 1761 into the destination operand. Bits 0 and 1 select the value to be moved from
 1762 destination operand to the low double word of the result, bits 2 and 3 select
 1763 the value to be moved from the destination operand to the second double word,
 1764 bits 4 and 5 select the value to be moved from the source operand to the third
 1765 double word, and bits 6 and 7 select the value to be moved from the source
 1766 operand to the high double word of the result.
 1767 
 1768     shufps xmm0,xmm0,10010011b ; shuffle double words
 1769 
 1770   "unpckhps" performs an interleaved unpack of the values from the high parts
 1771 of the source and destination operands and stores the result in the
 1772 destination operand, which must be a SSE register. The source operand can be
 1773 a 128-bit memory location or a SSE register. "unpcklps" performs an
 1774 interleaved unpack of the values from the low parts of the source and
 1775 destination operand and stores the result in the destination operand,
 1776 the rules for operands are the same.
 1777   "cvtpi2ps" converts packed two double word integers into the the packed two
 1778 single precision floating point values and stores the result in the low quad
 1779 word of the destination operand, which should be a SSE register. The source
 1780 operand can be a 64-bit memory location or MMX register.
 1781 
 1782     cvtpi2ps xmm0,mm0  ; convert integers to single precision values
 1783 
 1784   "cvtsi2ss" converts a double word integer into a single precision floating
 1785 point value and stores the result in the low double word of the destination
 1786 operand, which should be a SSE register. The source operand can be a 32-bit
 1787 memory location or 32-bit general register.
 1788 
 1789     cvtsi2ss xmm0,eax  ; convert integer to single precision value
 1790 
 1791   "cvtps2pi" converts packed two single precision floating point values into
 1792 packed two double word integers and stores the result in the destination
 1793 operand, which should be a MMX register. The source operand can be a 64-bit
 1794 memory location or SSE register, only low quad word of SSE register is used.
 1795 "cvttps2pi" performs the similar operation, except that truncation is used to
 1796 round a source values to integers, rules for the operands are the same.
 1797 
 1798     cvtps2pi mm0,xmm0  ; convert single precision values to integers
 1799 
 1800   "cvtss2si" convert a single precision floating point value into a double
 1801 word integer and stores the result in the destination operand, which should be
 1802 a 32-bit general register. The source operand can be a 32-bit memory location
 1803 or SSE register, only low double word of SSE register is used. "cvttss2si"
 1804 performs the similar operation, except that truncation is used to round a
 1805 source value to integer, rules for the operands are the same.
 1806 
 1807     cvtss2si eax,xmm0  ; convert single precision value to integer
 1808 
 1809   "pextrw" copies the word in the source operand specified by the third
 1810 operand to the destination operand. The source operand must be a MMX register,
 1811 the destination operand must be a 32-bit general register (the high word of
 1812 the destination is cleared), the third operand must an 8-bit immediate value.
 1813 
 1814     pextrw eax,mm0,1   ; extract word into eax
 1815 
 1816   "pinsrw" inserts a word from the source operand in the destination operand
 1817 at the location specified with the third operand, which must be an 8-bit
 1818 immediate value. The destination operand must be a MMX register, the source
 1819 operand can be a 16-bit memory location or 32-bit general register (only low
 1820 word of the register is used).
 1821 
 1822     pinsrw mm1,ebx,2   ; insert word from ebx
 1823 
 1824   "pavgb" and "pavgw" compute average of packed bytes or words. "pmaxub"
 1825 return the maximum values of packed unsigned bytes, "pminub" returns the
 1826 minimum values of packed unsigned bytes, "pmaxsw" returns the maximum values
 1827 of packed signed words, "pminsw" returns the minimum values of packed signed
 1828 words. "pmulhuw" performs a unsigned multiplication of the packed words and
 1829 stores the high words of the results in the destination operand. "psadbw"
 1830 computes the absolute differences of packed unsigned bytes, sums the
 1831 differences, and stores the sum in the low word of destination operand. All
 1832 these instructions follow the same rules for operands as the general MMX
 1833 operations described in previous section.
 1834   "pmovmskb" creates a mask made of the most significant bit of each byte in
 1835 the source operand and stores the result in the low byte of destination
 1836 operand. The source operand must be a MMX register, the destination operand
 1837 must a 32-bit general register.
 1838   "pshufw" inserts words from the source operand in the destination operand
 1839 from the locations specified with the third operand. The destination operand
 1840 must be a MMX register, the source operand can be a 64-bit memory location or
 1841 MMX register, third operand must an 8-bit immediate value selecting which
 1842 values will be moved into destination operand, in the similar way as the third
 1843 operand of the "shufps" instruction.
 1844   "movntq" moves the quad word from the source operand to memory using a
 1845 non-temporal hint to minimize cache pollution. The source operand should be a
 1846 MMX register, the destination operand should be a 64-bit memory location.
 1847 "movntps" stores packed single precision values from the SSE register to
 1848 memory using a non-temporal hint. The source operand should be a SSE register,
 1849 the destination operand should be a 128-bit memory location. "maskmovq" stores
 1850 selected bytes from the first operand into a 64-bit memory location using a
 1851 non-temporal hint. Both operands should be a MMX registers, the second operand
 1852 selects wich bytes from the source operand are written to memory. The
 1853 memory location is pointed by DI (or EDI) register in the segment selected
 1854 by DS.
 1855   "prefetcht0", "prefetcht1", "prefetcht2" and "prefetchnta" fetch the line
 1856 of data from memory that contains byte specified with the operand to a
 1857 specified location in hierarchy.  The operand should be an 8-bit memory
 1858 location.
 1859   "sfence" performs a serializing operation on all instruction storing to
 1860 memory that were issued prior to it. This instruction has no operands.
 1861   "ldmxcsr" loads the 32-bit memory operand into the MXCSR register. "stmxcsr"
 1862 stores the contents of MXCSR into a 32-bit memory operand.
 1863   "fxsave" saves the current state of the FPU, MXCSR register, and all the FPU
 1864 and SSE registers to a 512-byte memory location specified in the destination
 1865 operand. "fxrstor" reloads data previously stored with "fxsave" instruction
 1866 from the specified 512-byte memory location. The memory operand for both those
 1867 instructions must be aligned on 16 byte boundary, it should declare operand
 1868 of no specified size.
 1869 
 1870 
 1871 2.1.16  SSE2 instructions
 1872 
 1873 The SSE2 extension introduces the operations on packed double precision
 1874 floating point values, extends the syntax of MMX instructions, and adds also
 1875 some new instructions.
 1876   "movapd" and "movupd" transfer a double quad word operand containing packed
 1877 double precision values from source operand to destination operand. These
 1878 instructions are analogous to "movaps" and "movups" and have the same rules
 1879 for operands.
 1880   "movlpd" moves double precision value between the memory and the low quad
 1881 word of SSE register. "movhpd" moved double precision value between the memory
 1882 and the high quad word of SSE register. These instructions are analogous to
 1883 "movlps" and "movhps" and have the same rules for operands.
 1884   "movmskpd" transfers the most significant bit of each of the two double
 1885 precision values in the SSE register into low two bits of a general register.
 1886 This instruction is analogous to "movmskps" and has the same rules for
 1887 operands.
 1888   "movsd" transfers a double precision value between source and destination
 1889 operand (only the low quad word is trasferred). At least one of the operands
 1890 have to be a SSE register, the second one can be also a SSE register or 64-bit
 1891 memory location.
 1892   Arithmetic operations on double precision values are: "addpd", "addsd",
 1893 "subpd", "subsd", "mulpd", "mulsd", "divpd", "divsd", "sqrtpd", "sqrtsd",
 1894 "maxpd", "maxsd", "minpd", "minsd", and they are analoguous to arithmetic
 1895 operations on single precision values described in previous section. When the
 1896 mnemonic ends with "pd" instead of "ps", the operation is performed on packed
 1897 two double precision values, but rules for operands are the same. When the
 1898 mnemonic ends with "sd" instead of "ss", the source operand can be a 64-bit
 1899 memory location or a SSE register, the destination operand must be a SSE
 1900 register and the operation is performed on double precision values, only low
 1901 quad words of SSE registers are used in this case.
 1902   "andpd", "andnpd", "orpd" and "xorpd" perform the logical operations on
 1903 packed double precision values. They are analoguous to SSE logical operations
 1904 on single prevision values and have the same rules for operands.
 1905   "cmppd" compares packed double precision values and returns and returns a
 1906 mask result into the destination operand. This instruction is analoguous to
 1907 "cmpps" and has the same rules for operands. "cmpsd" performs the same
 1908 operation on double precision values, only low quad word of destination
 1909 register is affected, in this case source operand can be a 64-bit memory or
 1910 SSE register. Variant with only two operands are obtained by attaching the
 1911 condition mnemonic from table 2.3 to the "cmp" mnemonic and then attaching
 1912 the "pd" or "sd" at the end.
 1913   "comisd" and "ucomisd" compare the double precision values and set the ZF,
 1914 PF and CF flags to show the result. The destination operand must be a SSE
 1915 register, the source operand can be a 128-bit memory location or SSE register.
 1916   "shufpd" moves any of the two double precision values from the destination
 1917 operand into the low quad word of the destination operand, and any of the two
 1918 values from the source operand into the high quad word of the destination
 1919 operand. This instruction is analoguous to "shufps" and has the same rules for
 1920 operand. Bit 0 of the third operand selects the value to be moved from the
 1921 destination operand, bit 1 selects the value to be moved from the source
 1922 operand, the rest of bits are reserved and must be zeroed.
 1923   "unpckhpd" performs an unpack of the high quad words from the source and
 1924 destination operands, "unpcklpd" performs an unpack of the low quad words from
 1925 the source and destination operands. They are analoguous to "unpckhps" and
 1926 "unpcklps", and have the same rules for operands.
 1927   "cvtps2pd" converts the packed two single precision floating point values to
 1928 two packed double precision floating point values, the destination operand
 1929 must be a SSE register, the source operand can be a 64-bit memory location or
 1930 SSE register. "cvtpd2ps" converts the packed two double precision floating
 1931 point values to packed two single precision floating point values, the
 1932 destination operand must be a SSE register, the source operand can be a
 1933 128-bit memory location or SSE register. "cvtss2sd" converts the single
 1934 precision floating point value to double precision floating point value, the
 1935 destination operand must be a SSE register, the source operand can be a 32-bit
 1936 memory location or SSE register. "cvtsd2ss" converts the double precision
 1937 floating point value to single precision floating point value, the destination
 1938 operand must be a SSE register, the source operand can be 64-bit memory
 1939 location or SSE register.
 1940   "cvtpi2pd" converts packed two double word integers into the the packed
 1941 double precision floating point values, the destination operand must be a SSE
 1942 register, the source operand can be a 64-bit memory location or MMX register.
 1943 "cvtsi2sd" converts a double word integer into a double precision floating
 1944 point value, the destination operand must be a SSE register, the source
 1945 operand can be a 32-bit memory location or 32-bit general register. "cvtpd2pi"
 1946 converts packed double precision floating point values into packed two double
 1947 word integers, the destination operand should be a MMX register, the source
 1948 operand can be a 128-bit memory location or SSE register. "cvttpd2pi" performs
 1949 the similar operation, except that truncation is used to round a source values
 1950 to integers, rules for operands are the same. "cvtsd2si" converts a double
 1951 precision floating point value into a double word integer, the destination
 1952 operand should be a 32-bit general register, the source operand can be a
 1953 64-bit memory location or SSE register. "cvttsd2si" performs the similar
 1954 operation, except that truncation is used to round a source value to integer,
 1955 rules for operands are the same.
 1956   "cvtps2dq" and "cvttps2dq" convert packed single precision floating point
 1957 values to packed four double word integers, storing them in the destination
 1958 operand. "cvtpd2dq" and "cvttpd2dq" convert packed double precision floating
 1959 point values to packed two double word integers, storing the result in the low
 1960 quad word of the destination operand. "cvtdq2ps" converts packed four
 1961 double word integers to packed single precision floating point values.
 1962 For all these instructions destination operand must be a SSE register, the
 1963 source operand can be a 128-bit memory location or SSE register.
 1964 "cvtdq2pd" converts packed two double word integers from the source operand to
 1965 packed double precision floating point values, the source can be a 64-bit 
 1966 memory location or SSE register, destination has to be SSE register.
 1967   "movdqa" and "movdqu" transfer a double quad word operand containing packed
 1968 integers from source operand to destination operand. At least one of the
 1969 operands have to be a SSE register, the second one can be also a SSE register
 1970 or 128-bit memory location. Memory operands for "movdqa" instruction must be
 1971 aligned on boundary of 16 bytes, operands for "movdqu" instruction don't have
 1972 to be aligned.
 1973   "movq2dq" moves the contents of the MMX source register to the low quad word
 1974 of destination SSE register. "movdq2q" moves the low quad word from the source
 1975 SSE register to the destination MMX register.
 1976 
 1977     movq2dq xmm0,mm1   ; move from MMX register to SSE register
 1978     movdq2q mm0,xmm1   ; move from SSE register to MMX register
 1979 
 1980   All MMX instructions operating on the 64-bit packed integers (those with
 1981 mnemonics starting with "p") are extended to operate on 128-bit packed
 1982 integers located in SSE registers. Additional syntax for these instructions
 1983 needs an SSE register where MMX register was needed, and the 128-bit memory
 1984 location or SSE register where 64-bit memory location or MMX register were
 1985 needed. The exception is "pshufw" instruction, which doesn't allow extended
 1986 syntax, but has two new variants: "pshufhw" and "pshuflw", which allow only
 1987 the extended syntax, and perform the same operation as "pshufw" on the high
 1988 or low quad words of operands respectively. Also the new instruction "pshufd"
 1989 is introduced, which performs the same operation as "pshufw", but on the
 1990 double words instead of words, it allows only the extended syntax.
 1991 
 1992     psubb xmm0,[esi]   ; subtract 16 packed bytes
 1993     pextrw eax,xmm0,7  ; extract highest word into eax
 1994 
 1995   "paddq" performs the addition of packed quad words, "psubq" performs the
 1996 subtraction of packed quad words, "pmuludq" performs an unsigned
 1997 multiplication of low double words from each corresponding quad words and
 1998 returns the results in packed quad words. These instructions follow the same
 1999 rules for operands as the general MMX operations described in 2.1.14.
 2000   "pslldq" and "psrldq" perform logical shift left or right of the double
 2001 quad word in the destination operand by the amount of bytes specified in the
 2002 source operand. The destination operand should be a SSE register, source
 2003 operand should be an 8-bit immediate value.
 2004   "punpckhqdq" interleaves the high quad word of the source operand and the
 2005 high quad word of the destination operand and writes them to the destination
 2006 SSE register. "punpcklqdq" interleaves the low quad word of the source operand
 2007 and the low quad word of the destination operand and writes them to the
 2008 destination SSE register. The source operand can be a 128-bit memory location
 2009 or SSE register.
 2010   "movntdq" stores packed integer data from the SSE register to memory using
 2011 non-temporal hint. The source operand should be a SSE register, the
 2012 destination operand should be a 128-bit memory location. "movntpd" stores
 2013 packed double precision values from the SSE register to memory using a
 2014 non-temporal hint. Rules for operand are the same. "movnti" stores integer
 2015 from a general register to memory using a non-temporal hint. The source
 2016 operand should be a 32-bit general register, the destination operand should
 2017 be a 32-bit memory location. "maskmovdqu" stores selected bytes from the first
 2018 operand into a 128-bit memory location using a non-temporal hint. Both
 2019 operands should be a SSE registers, the second operand selects wich bytes from
 2020 the source operand are written to memory. The memory location is pointed by DI
 2021 (or EDI) register in the segment selected by DS and does not need to be
 2022 aligned.
 2023   "clflush" writes and invalidates the cache line associated with the address
 2024 of byte specified with the operand, which should be a 8-bit memory location.
 2025   "lfence" performs a serializing operation on all instruction loading from
 2026 memory that were issued prior to it. "mfence" performs a serializing operation
 2027 on all instruction accesing memory that were issued prior to it, and so it
 2028 combines the functions of "sfence" (described in previous section) and
 2029 "lfence" instructions. These instructions have no operands.
 2030 
 2031 
 2032 2.1.17  SSE3 instructions
 2033 
 2034 Prescott technology introduced some new instructions to improve the performance
 2035 of SSE and SSE2 - this extension is called SSE3.
 2036   "fisttp" behaves like the "fistp" instruction and accepts the same operands,
 2037 the only difference is that it always used truncation, irrespective of the
 2038 rounding mode.
 2039   "movshdup" loads into destination operand the 128-bit value obtained from
 2040 the source value of the same size by filling the each quad word with the two
 2041 duplicates of the value in its high double word. "movsldup" performs the same
 2042 action, except it duplicates the values of low double words. The destination
 2043 operand should be SSE register, the source operand can be SSE register or
 2044 128-bit memory location.
 2045   "movddup" loads the 64-bit source value and duplicates it into high and low
 2046 quad word of the destination operand. The destination operand should be SSE
 2047 register, the source operand can be SSE register or 64-bit memory location.
 2048   "lddqu" is functionally equivalent to "movdqu" with memory as source 
 2049 operand, but it may improve performance when the source operand crosses a 
 2050 cacheline boundary. The destination operand has to be SSE register, the source
 2051 operand must be 128-bit memory location.
 2052   "addsubps" performs single precision addition of second and fourth pairs and
 2053 single precision subtracion of the first and third pairs of floating point
 2054 values in the operands. "addsubpd" performs double precision addition of the
 2055 second pair and double precision subtraction of the first pair of floating
 2056 point values in the operand. "haddps" performs the addition of two single
 2057 precision values within the each quad word of source and destination operands,
 2058 and stores the results of such horizontal addition of values from destination
 2059 operand into low quad word of destination operand, and the results from the
 2060 source operand into high quad word of destination operand. "haddpd" performs
 2061 the addition of two double precision values within each operand, and stores
 2062 the result from destination operand into low quad word of destination operand,
 2063 and the result from source operand into high quad word of destination operand.
 2064 All these instructions need the destination operand to be SSE register, source
 2065 operand can be SSE register or 128-bit memory location.
 2066   "monitor" sets up an address range for monitoring of write-back stores. It
 2067 need its three operands to be EAX, ECX and EDX register in that order. "mwait"
 2068 waits for a write-back store to the address range set up by the "monitor"
 2069 instruction. It uses two operands with additional parameters, first being the
 2070 EAX and second the ECX register.
 2071   The functionality of SSE3 is further extended by the set of Supplemental
 2072 SSE3 instructions (SSSE3). They generally follow the same rules for operands
 2073 as all the MMX operations extended by SSE.
 2074   "phaddw" and "phaddd" perform the horizontal additional of the pairs of
 2075 adjacent values from both the source and destination operand, and stores the
 2076 sums into the destination (sums from the source operand go into higher part of
 2077 destination register). They operate on 16-bit or 32-bit chunks, respectively.
 2078 "phaddsw" performs the same operation on signed 16-bit packed values, but the
 2079 result of each addition is saturated. "phsubw" and "phsubd" analogously
 2080 perform the horizontal subtraction of 16-bit or 32-bit packed value, and
 2081 "phsubsw" performs the horizontal subtraction of signed 16-bit packed values
 2082 with saturation.
 2083   "pabsb", "pabsw" and "pabsd" calculate the absolute value of each signed
 2084 packed signed value in source operand and stores them into the destination
 2085 register. They operator on 8-bit, 16-bit and 32-bit elements respectively.
 2086   "pmaddubsw" multiplies signed 8-bit values from the source operand with the
 2087 corresponding unsigned 8-bit values from the destination operand to produce
 2088 intermediate 16-bit values, and every adjacent pair of those intermediate
 2089 values is then added horizontally and those 16-bit sums are stored into the
 2090 destination operand.
 2091   "pmulhrsw" multiplies corresponding 16-bit integers from the source and
 2092 destination operand to produce intermediate 32-bit values, and the 16 bits
 2093 next to the highest bit of each of those values are then rounded and packed
 2094 into the destination operand.
 2095   "pshufb" shuffles the bytes in the destination operand according to the
 2096 mask provided by source operand - each of the bytes in source operand is
 2097 an index of the target position for the corresponding byte in the destination.
 2098   "psignb", "psignw" and "psignd" perform the operation on 8-bit, 16-bit or
 2099 32-bit integers in destination operand, depending on the signs of the values
 2100 in the source. If the value in source is negative, the corresponding value in
 2101 the destination register is negated, if the value in source is positive, no
 2102 operation is performed on the corresponding value is performed, and if the
 2103 value in source is zero, the value in destination is zeroed, too.
 2104   "palignr" appends the source operand to the destination operand to form the
 2105 intermediate value of twice the size, and then extracts into the destination
 2106 register the 64 or 128 bits that are right-aligned to the byte offset
 2107 specified by the third operand, which should be an 8-bit immediate value. This
 2108 is the only SSSE3 instruction that takes three arguments.
 2109 
 2110 
 2111 2.1.18  AMD 3DNow! instructions
 2112 
 2113 The 3DNow! extension adds a new MMX instructions to those described in 2.1.14,
 2114 and introduces operation on the 64-bit packed floating point values, each
 2115 consisting of two single precision floating point values.
 2116   These instructions follow the same rules as the general MMX operations, the
 2117 destination operand should be a MMX register, the source operand can be a MMX
 2118 register or 64-bit memory location. "pavgusb" computes the rounded averages
 2119 of packed unsigned bytes. "pmulhrw" performs a signed multiplication of the
 2120 packed words, round the high word of each double word results and stores them
 2121 in the destination operand. "pi2fd" converts packed double word integers into
 2122 packed floating point values. "pf2id" converts packed floating point values
 2123 into packed double word integers using truncation. "pi2fw" converts packed
 2124 word integers into packed floating point values, only low words of each
 2125 double word in source operand are used. "pf2iw" converts packed floating
 2126 point values to packed word integers, results are extended to double words
 2127 using the sign extension. "pfadd" adds packed floating point values. "pfsub"
 2128 and "pfsubr" subtracts packed floating point values, the first one subtracts
 2129 source values from destination values, the second one subtracts destination
 2130 values from the source values. "pfmul" multiplies packed floating point
 2131 values. "pfacc" adds the low and high floating point values of the destination
 2132 operand, storing the result in the low double word of destination, and adds
 2133 the low and high floating point values of the source operand, storing the
 2134 result in the high double word of destination. "pfnacc" subtracts the high
 2135 floating point value of the destination operand from the low, storing the
 2136 result in the low double word of destination, and subtracts the high floating
 2137 point value of the source operand from the low, storing the result in the high
 2138 double word of destination. "pfpnacc" subtracts the high floating point value
 2139 of the destination operand from the low, storing the result in the low double
 2140 word of destination, and adds the low and high floating point values of the
 2141 source operand, storing the result in the high double word of destination.
 2142 "pfmax" and "pfmin" compute the maximum and minimum of floating point values.
 2143 "pswapd" reverses the high and low double word of the source operand. "pfrcp"
 2144 returns an estimates of the reciprocals of floating point values from the
 2145 source operand, "pfrsqrt" returns an estimates of the reciprocal square
 2146 roots of floating point values from the source operand, "pfrcpit1" performs
 2147 the first step in the Newton-Raphson iteration to refine the reciprocal
 2148 approximation produced by "pfrcp" instruction, "pfrsqit1" performs the first
 2149 step in the Newton-Raphson iteration to refine the reciprocal square root
 2150 approximation produced by "pfrsqrt" instruction, "pfrcpit2" performs the
 2151 second final step in the Newton-Raphson iteration to refine the reciprocal
 2152 approximation or the reciprocal square root approximation. "pfcmpeq",
 2153 "pfcmpge" and "pfcmpgt" compare the packed floating point values and sets
 2154 all bits or zeroes all bits of the correspoding data element in the
 2155 destination operand according to the result of comparison, first checks
 2156 whether values are equal, second checks whether destination value is greater
 2157 or equal to source value, third checks whether destination value is greater
 2158 than source value.
 2159   "prefetch" and "prefetchw" load the line of data from memory that contains
 2160 byte specified with the operand into the data cache, "prefetchw" instruction
 2161 should be used when the data in the cache line is expected to be modified,
 2162 otherwise the "prefetch" instruction should be used. The operand should be an
 2163 8-bit memory location.
 2164   "femms" performs a fast clear of MMX state. This instruction has no
 2165 operands.
 2166 
 2167 
 2168 2.1.19  The x86-64 long mode instructions
 2169 
 2170 The AMD64 and EM64T architectures (we will use the common name x86-64 for them
 2171 both) extend the x86 instruction set for the 64-bit processing. While legacy
 2172 and compatibility modes use the same set of registers and instructions, the
 2173 new long mode extends the x86 operations to 64 bits and introduces several new
 2174 registers. You can turn on generating the code for this mode with the "use64"
 2175 directive.
 2176   Each of the general purpose registers is extended to 64 bits and the eight
 2177 whole new general purpose registers and also eight new SSE registers are added.
 2178 See table 2.4 for the summary of new registers (only the ones that was not
 2179 listed in table 1.2). The general purpose registers of smallers sizes are the
 2180 low order portions of the larger ones. You can still access the "ah", "bh",
 2181 "ch" and "dh" registers in long mode, but you cannot use them in the same
 2182 instruction with any of the new registers.
 2183 
 2184    Table 2.4  New registers in long mode
 2185   /--------------------------------------------------\
 2186   | Type |          General          |  SSE  |  AVX  |
 2187   |------|---------------------------|-------|-------|
 2188   | Bits |  8   |  16  |  32  |  64  |  128  |  256  |
 2189   |======|======|======|======|======|=======|=======|
 2190   |      |      |      |      | rax  |       |       |
 2191   |      |      |      |      | rcx  |       |       |
 2192   |      |      |      |      | rdx  |       |       |
 2193   |      |      |      |      | rbx  |       |       |
 2194   |      | spl  |      |      | rsp  |       |       |
 2195   |      | bpl  |      |      | rbp  |       |       |
 2196   |      | sil  |      |      | rsi  |       |       |
 2197   |      | dil  |      |      | rdi  |       |       |
 2198   |      | r8b  | r8w  | r8d  | r8   | xmm8  | ymm8  |
 2199   |      | r9b  | r9w  | r9d  | r9   | xmm9  | ymm9  |
 2200   |      | r10b | r10w | r10d | r10  | xmm10 | ymm10 |
 2201   |      | r11b | r11w | r11d | r11  | xmm11 | ymm11 |
 2202   |      | r12b | r12w | r12d | r12  | xmm12 | ymm12 |
 2203   |      | r13b | r13w | r13d | r13  | xmm13 | ymm13 |
 2204   |      | r14b | r14w | r14d | r14  | xmm14 | ymm14 |
 2205   |      | r15b | r15w | r15d | r15  | xmm15 | ymm15 |
 2206   \--------------------------------------------------/
 2207 
 2208    In general any instruction from x86 architecture, which allowed 16-bit or
 2209 32-bit operand sizes, in long mode allows also the 64-bit operands. The 64-bit
 2210 registers should be used for addressing in long mode, the 32-bit addressing
 2211 is also allowed, but it's not possible to use the addresses based on 16-bit
 2212 registers. Below are the samples of new operations possible in long mode on the
 2213 example of "mov" instruction:
 2214 
 2215     mov rax,r8   ; transfer 64-bit general register
 2216     mov al,[rbx] ; transfer memory addressed by 64-bit register
 2217 
 2218 The long mode uses also the instruction pointer based addresses, you can
 2219 specify it manually with the special RIP register symbol, but such addressing
 2220 is also automatically generated by flat assembler, since there is no 64-bit
 2221 absolute addressing in long mode. You can still force the assembler to use the
 2222 32-bit absolute addressing by putting the "dword" size override for address
 2223 inside the square brackets. There is also one exception, where the 64-bit
 2224 absolute addressing is possible, it's the "mov" instruction with one of the
 2225 operand being accumulator register, and second being the memory operand.
 2226 To force the assembler to use the 64-bit absolute addressing there, use the
 2227 "qword" size operator for address inside the square brackets. When no size
 2228 operator is applied to address, assembler generates the optimal form
 2229 automatically.
 2230 
 2231     mov [qword 0],rax  ; absolute 64-bit addressing
 2232     mov [dword 0],r15d ; absolute 32-bit addressing
 2233     mov [0],rsi        ; automatic RIP-relative addressing
 2234     mov [rip+3],sil    ; manual RIP-relative addressing
 2235 
 2236   Also as the immediate operands for 64-bit operations only the signed 32-bit
 2237 values are possible, with the only exception being the "mov" instruction with
 2238 destination operand being 64-bit general purpose register. Trying to force the
 2239 64-bit immediate with any other instruction will cause an error.
 2240   If any operation is performed on the 32-bit general registers in long mode,
 2241 the upper 32 bits of the 64-bit registers containing them are filled with
 2242 zeros. This is unlike the operations on 16-bit or 8-bit portions of those
 2243 registers, which preserve the upper bits.
 2244   Three new type conversion instructions are available. The "cdqe" sign 
 2245 extends the double word in EAX into quad word and stores the result in RAX 
 2246 register. "cqo" sign extends the quad word in RAX into double quad word and 
 2247 stores the extra bits in the RDX register. These instructions have no 
 2248 operands. "movsxd" sign extends the double word source operand, being either
 2249 the 32-bit register or memory, into 64-bit destination operand, which has to
 2250 be register. No analogous instruction is needed for the zero extension, since
 2251 it is done automatically by any operations on 32-bit registers, as noted in
 2252 previous paragraph. And the "movzx" and "movsx" instructions, conforming to
 2253 the general rule, can be used with 64-bit destination operand, allowing
 2254 extension of byte or word values into quad words.
 2255   All the binary arithmetic and logical instruction have been promoted to
 2256 allow 64-bit operands in long mode. The use of decimal arithmetic instructions
 2257 in long mode is prohibited.
 2258   The stack operations, like "push" and "pop" in long mode default to 64-bit
 2259 operands and it's not possible to use 32-bit operands with them. The "pusha"
 2260 and "popa" are disallowed in long mode.
 2261   The indirect near jumps and calls in long mode default to 64-bit operands
 2262 and it's not possible to use the 32-bit operands with them. On the other hand,
 2263 the indirect far jumps and calls allow any operands that were allowed by the 
 2264 x86 architecture and also 80-bit memory operand is allowed (though only EM64T
 2265 seems to implement such variant), with the first eight bytes defining the 
 2266 offset and two last bytes specifying the selector. The direct far jumps and 
 2267 calls are not allowed in long mode.
 2268   The I/O instructions, "in", "out", "ins" and "outs" are the exceptional
 2269 instructions that are not extended to accept quad word operands in long mode.
 2270 But all other string operations are, and there are new short forms "movsq",
 2271 "cmpsq", "scasq", "lodsq" and "stosq" introduced for the variants of string
 2272 operations for 64-bit string elements. The RSI and RDI registers are used by
 2273 default to address the string elements.
 2274   The "lfs", "lgs" and "lss" instructions are extended to accept 80-bit source
 2275 memory operand with 64-bit destination register (though only EM64T seems to
 2276 implement such variant). The "lds" and "les" are disallowed in long mode.
 2277   The system instructions like "lgdt" which required the 48-bit memory operand,
 2278 in long mode require the 80-bit memory operand.
 2279   The "cmpxchg16b" is the 64-bit equivalent of "cmpxchg8b" instruction, it uses
 2280 the double quad word memory operand and 64-bit registers to perform the
 2281 analoguous operation.
 2282   The "fxsave64" and "fxrstor64" are new variants of "fxsave" and "fxrstor"
 2283 instructions, available only in long mode, which use a different format of
 2284 storage area in order to store some pointers in full 64-bit size.  
 2285   "swapgs" is the new instruction, which swaps the contents of GS register and
 2286 the KernelGSbase model-specific register (MSR address 0C0000102h).
 2287   "syscall" and "sysret" is the pair of new instructions that provide the
 2288 functionality similar to "sysenter" and "sysexit" in long mode, where the
 2289 latter pair is disallowed. The "sysexitq" and "sysretq" mnemonics provide the
 2290 64-bit versions of "sysexit" and "sysret" instructions.
 2291   The "rdmsrq" and "wrmsrq" mnemonics are the 64-bit variants of the "rdmsr"
 2292 and "wrmsr" instructions.
 2293 
 2294 
 2295 2.1.20  SSE4 instructions
 2296 
 2297 There are actually three different sets of instructions under the name SSE4.
 2298 Intel designed two of them, SSE4.1 and SSE4.2, with latter extending the
 2299 former into the full Intel's SSE4 set. On the other hand, the implementation
 2300 by AMD includes only a few instructions from this set, but also contains
 2301 some additional instructions, that are called the SSE4a set.
 2302   The SSE4.1 instructions mostly follow the same rules for operands, as
 2303 the basic SSE operations, so they require destination operand to be SSE
 2304 register and source operand to be 128-bit memory location or SSE register,
 2305 and some operations require a third operand, the 8-bit immediate value.
 2306   "pmulld" performs a signed multiplication of the packed double words and
 2307 stores the low double words of the results in the destination operand.
 2308 "pmuldq" performs a two signed multiplications of the corresponding double
 2309 words in the lower quad words of operands, and stores the results as
 2310 packed quad words into the destination register. "pminsb" and "pmaxsb"
 2311 return the minimum or maximum values of packed signed bytes, "pminuw" and
 2312 "pmaxuw" return the minimum and maximum values of packed unsigned words,
 2313 "pminud", "pmaxud", "pminsd" and "pmaxsd" return minimum or maximum values
 2314 of packed unsigned or signed words. These instructions complement the
 2315 instructions computing packed minimum or maximum introduced by SSE.
 2316   "ptest" sets the ZF flag to one when the result of bitwise AND of the
 2317 both operands is zero, and zeroes the ZF otherwise. It also sets CF flag
 2318 to one, when the result of bitwise AND of the destination operand with
 2319 the bitwise NOT of the source operand is zero, and zeroes the CF otherwise.
 2320 "pcmpeqq" compares packed quad words for equality, and fills the
 2321 corresponding elements of destination operand with either ones or zeros,
 2322 depending on the result of comparison.
 2323   "packusdw" converts packed signed double words from both the source and
 2324 destination operand into the unsigned words using saturation, and stores
 2325 the eight resulting word values into the destination register.
 2326   "phminposuw" finds the minimum unsigned word value in source operand and
 2327 places it into the lowest word of destination operand, setting the remaining
 2328 upper bits of destination to zero.
 2329   "roundps", "roundss", "roundpd" and "roundsd" perform the rounding of packed
 2330 or individual floating point value of single or double precision, using the
 2331 rounding mode specified by the third operand.
 2332 
 2333     roundsd xmm0,xmm1,0011b ; round toward zero
 2334 
 2335   "dpps" calculates dot product of packed single precision floating point
 2336 values, that is it multiplies the corresponding pairs of values from source and
 2337 destination operand and then sums the products up. The high four bits of the
 2338 8-bit immediate third operand control which products are calculated and taken
 2339 to the sum, and the low four bits control, into which elements of destination
 2340 the resulting dot product is copied (the other elements are filled with zero).
 2341 "dppd" calculates dot product of packed double precision floating point values.
 2342 The bits 4 and 5 of third operand control, which products are calculated and
 2343 added, and bits 0 and 1 of this value control, which elements in destination
 2344 register should get filled with the result. "mpsadbw" calculates multiple sums
 2345 of absolute differences of unsigned bytes. The third operand controls, with
 2346 value in bits 0-1, which of the four-byte blocks in source operand is taken to
 2347 calculate the absolute differencies, and with value in bit 2, at which of the
 2348 two first four-byte block in destination operand start calculating multiple
 2349 sums. The sum is calculated from four absolute differencies between the
 2350 corresponding unsigned bytes in the source and destination block, and each next
 2351 sum is calculated in the same way, but taking the four bytes from destination
 2352 at the position one byte after the position of previous block. The four bytes
 2353 from the source stay the same each time. This way eight sums of absolute
 2354 differencies are calculated and stored as packed word values into the
 2355 destination operand. The instructions described in this paragraph follow the
 2356 same rules for operands, as "roundps" instruction.
 2357   "blendps", "blendvps", "blendpd" and "blendvpd" conditionally copy the
 2358 values from source operand into the destination operand, depending on the bits
 2359 of the mask provided by third operand. If a mask bit is set, the corresponding
 2360 element of source is copied into the same place in destination, otherwise this
 2361 position is destination is left unchanged. The rules for the first two operands
 2362 are the same, as for general SSE instructions. "blendps" and "blendpd" need
 2363 third operand to be 8-bit immediate, and they operate on single or double
 2364 precision values, respectively. "blendvps" and "blendvpd" require third operand
 2365 to be the XMM0 register.
 2366 
 2367     blendvps xmm3,xmm7,xmm0 ; blend according to mask
 2368 
 2369   "pblendw" conditionally copies word elements from the source operand into the
 2370 destination, depending on the bits of mask provided by third operand, which
 2371 needs to be 8-bit immediate value. "pblendvb" conditionally copies byte
 2372 elements from the source operands into destination, depending on mask defined
 2373 by the third operand, which has to be XMM0 register. These instructions follow
 2374 the same rules for operands as "blendps" and "blendvps" instructions,
 2375 respectively.
 2376   "insertps" inserts a single precision floating point value taken from the
 2377 position in source operand specified by bits 6-7 of third operand into location
 2378 in destination register selected by bits 4-5 of third operand. Additionally,
 2379 the low four bits of third operand control, which elements in destination
 2380 register will be set to zero. The first two operands follow the same rules as
 2381 for the general SSE operation, the third operand should be 8-bit immediate.
 2382   "extractps" extracts a single precision floating point value taken from the
 2383 location in source operand specified by low two bits of third operand, and
 2384 stores it into the destination operand. The destination can be a 32-bit memory
 2385 value or general purpose register, the source operand must be SSE register,
 2386 and the third operand should be 8-bit immediate value.
 2387 
 2388     extractps edx,xmm3,3 ; extract the highest value
 2389 
 2390   "pinsrb", "pinsrd" and "pinsrq" copy a byte, double word or quad word from
 2391 the source operand into the location of destination operand determined by the
 2392 third operand. The destination operand has to be SSE register, the source
 2393 operand can be a memory location of appropriate size, or the 32-bit general
 2394 purpose register (but 64-bit general purpose register for "pinsrq", which is
 2395 only available in long mode), and the third operand has to be 8-bit immediate
 2396 value. These instructions complement the "pinsrw" instruction operating on SSE
 2397 register destination, which was introduced by SSE2.
 2398 
 2399     pinsrd xmm4,eax,1 ; insert double word into second position
 2400 
 2401   "pextrb", "pextrw", "pextrd" and "pextrq" copy a byte, word, double word or
 2402 quad word from the location in source operand specified by third operand, into
 2403 the destination. The source operand should be SSE register, the third operand
 2404 should be 8-bit immediate, and the destination operand can be memory location
 2405 of appropriate size, or the 32-bit general purpose register (but 64-bit general
 2406 purpose register for "pextrq", which is only available in long mode). The
 2407 "pextrw" instruction with SSE register as source was already introduced by
 2408 SSE2, but SSE4 extends it to allow memory operand as destination.
 2409 
 2410     pextrw [ebx],xmm3,7 ; extract highest word into memory
 2411 
 2412   "pmovsxbw" and "pmovzxbw" perform sign extension or zero extension of eight 
 2413 byte values from the source operand into packed word values in destination 
 2414 operand, which has to be SSE register. The source can be 64-bit memory or SSE 
 2415 register - when it is register, only its low portion is used. "pmovsxbd" and 
 2416 "pmovzxbd" perform sign extension or zero extension of the four byte values 
 2417 from the source operand into packed double word values in destination operand, 
 2418 the source can be 32-bit memory or SSE register. "pmovsxbq" and "pmovzxbq" 
 2419 perform sign extension or zero extension of the two byte values from the 
 2420 source operand into packed quad word values in destination operand, the source
 2421 can be 16-bit memory or SSE register. "pmovsxwd" and "pmovzxwd" perform sign
 2422 extension or zero extension of the four word values from the source operand 
 2423 into packed double words in destination operand, the source can be 64-bit 
 2424 memory or SSE register. "pmovsxwq" and "pmovzxwq" perform sign extension or 
 2425 zero extension of the two word values from the source operand into packed quad
 2426 words in destination operand, the source can be 32-bit memory or SSE register. 
 2427 "pmovsxdq" and "pmovzxdq" perform sign extension or zero extension of the two 
 2428 double word values from the source operand into packed quad words in 
 2429 destination operand, the source can be 64-bit memory or SSE register.
 2430 
 2431     pmovzxbq xmm0,word [si]  ; zero-extend bytes to quad words
 2432     pmovsxwq xmm0,xmm1       ; sign-extend words to quad words 
 2433 
 2434   "movntdqa" loads double quad word from the source operand to the destination
 2435 using a non-temporal hint. The destination operand should be SSE register,
 2436 and the source operand should be 128-bit memory location.
 2437   The SSE4.2, described below, adds not only some new operations on SSE
 2438 registers, but also introduces some completely new instructions operating on
 2439 general purpose registers only.
 2440   "pcmpistri" compares two zero-ended (implicit length) strings provided in
 2441 its source and destination operand and generates an index stored to ECX;
 2442 "pcmpistrm" performs the same comparison and generates a mask stored to XMM0.
 2443 "pcmpestri" compares two strings of explicit lengths, with length provided
 2444 in EAX for the destination operand and in EDX for the source operand, and
 2445 generates an index stored to ECX; "pcmpestrm" performs the same comparision
 2446 and generates a mask stored to XMM0. The source and destination operand follow
 2447 the same rules as for general SSE instructions, the third operand should be
 2448 8-bit immediate value determining the details of performed operation - refer to
 2449 Intel documentation for information on those details.
 2450   "pcmpgtq" compares packed quad words, and fills the corresponding elements of
 2451 destination operand with either ones or zeros, depending on whether the value
 2452 in destination is greater than the one in source, or not. This instruction
 2453 follows the same rules for operands as "pcmpeqq".
 2454   "crc32" accumulates a CRC32 value for the source operand starting with
 2455 initial value provided by destination operand, and stores the result in
 2456 destination. Unless in long mode, the destination operand should be a 32-bit
 2457 general purpose register, and the source operand can be a byte, word, or double
 2458 word register or memory location. In long mode the destination operand can
 2459 also be a 64-bit general purpose register, and the source operand in such case
 2460 can be a byte or quad word register or memory location.
 2461 
 2462     crc32 eax,dl          ; accumulate CRC32 on byte value
 2463     crc32 eax,word [ebx]  ; accumulate CRC32 on word value
 2464     crc32 rax,qword [rbx] ; accumulate CRC32 on quad word value
 2465 
 2466   "popcnt" calculates the number of bits set in the source operand, which can
 2467 be 16-bit, 32-bit, or 64-bit general purpose register or memory location,
 2468 and stores this count in the destination operand, which has to be register of
 2469 the same size as source operand. The 64-bit variant is available only in long
 2470 mode.
 2471 
 2472     popcnt ecx,eax        ; count bits set to 1
 2473 
 2474   The SSE4a extension, which also includes the "popcnt" instruction introduced
 2475 by SSE4.2, at the same time adds the "lzcnt" instruction, which follows the
 2476 same syntax, and calculates the count of leading zero bits in source operand
 2477 (if the source operand is all zero bits, the total number of bits in source
 2478 operand is stored in destination).
 2479   "extrq" extract the sequence of bits from the low quad word of SSE register
 2480 provided as first operand and stores them at the low end of this register,
 2481 filling the remaining bits in the low quad word with zeros. The position of bit
 2482 string and its length can either be provided with two 8-bit immediate values
 2483 as second and third operand, or by SSE register as second operand (and there
 2484 is no third operand in such case), which should contain position value in bits
 2485 8-13 and length of bit string in bits 0-5.
 2486 
 2487     extrq xmm0,8,7        ; extract 8 bits from position 7
 2488     extrq xmm0,xmm5       ; extract bits defined by register
 2489 
 2490   "insertq" writes the sequence of bits from the low quad word of the source
 2491 operand into specified position in low quad word of the destination operand,
 2492 leaving the other bits in low quad word of destination intact. The position
 2493 where bits should be written and the length of bit string can either be
 2494 provided with two 8-bit immediate values as third and fourth operand, or by
 2495 the bit fields in source operand (and there are only two operands in such
 2496 case), which should contain position value in bits 72-77 and length of bit
 2497 string in bits 64-69.
 2498 
 2499     insertq xmm1,xmm0,4,2 ; insert 4 bits at position 2
 2500     insertq xmm1,xmm0     ; insert bits defined by register
 2501 
 2502   "movntss" and "movntsd" store single or double precision floating point
 2503 value from the source SSE register into 32-bit or 64-bit destination memory
 2504 location respectively, using non-temporal hint.
 2505 
 2506 
 2507 2.1.21  AVX instructions
 2508 
 2509 The Advanced Vector Extensions introduce instructions that are new variants 
 2510 of SSE instructions, with new scheme of encoding that allows extended syntax 
 2511 having a destination operand separate from all the source operands. It also 
 2512 introduces 256-bit AVX registers, which extend up the old 128-bit SSE 
 2513 registers. Any AVX instruction that puts some result into SSE register, puts 
 2514 zero bits into high portion of the AVX register containing it.
 2515   The AVX version of SSE instruction has the mnemonic obtained by prepending
 2516 SSE instruction name with "v". For any SSE arithmetic instruction which had a
 2517 destination operand also being used as one of the source values, the AVX 
 2518 variant has a new syntax with three operands - the destination and two sources. 
 2519 The destination and first source can be SSE registers, and second source can be
 2520 SSE register or memory. If the operation is performed on single pair of values,
 2521 the remaining bits of first source SSE register are copied into the the 
 2522 destination register.
 2523  
 2524     vsubss xmm0,xmm2,xmm3         ; subtract two 32-bit floats
 2525     vmulsd xmm0,xmm7,qword [esi]  ; multiply two 64-bit floats 
 2526 
 2527 In case of packed operations, each instruction can also operate on the 256-bit 
 2528 data size when the AVX registers are specified instead of SSE registers, and 
 2529 the size of memory operand is also doubled then.
 2530 
 2531     vaddps ymm1,ymm5,yword [esi]  ; eight sums of 32-bit float pairs 
 2532 
 2533 The instructions that operate on packed integer types (in particular the ones
 2534 that earlier had been promoted from MMX to SSE) also acquired the new syntax
 2535 with three operands, however they are only allowed to operate on 128-bit 
 2536 packed types and thus cannot use the whole AVX registers.
 2537 
 2538     vpavgw xmm3,xmm0,xmm2         ; average of 16-bit integers
 2539     vpslld xmm1,xmm0,1            ; shift double words left
 2540      
 2541 If the SSE version of instruction had a syntax with three operands, the third
 2542 one being an immediate value, the AVX version of such instruction takes four
 2543 operands, with immediate remaining the last one.
 2544 
 2545     vshufpd ymm0,ymm1,ymm2,10010011b ; shuffle 64-bit floats
 2546     vpalignr xmm0,xmm4,xmm2,3        ; extract byte aligned value
 2547      
 2548 The promotion to new syntax according to the rules described above has been 
 2549 applied to all the instructions from SSE extensions up to SSE4, with the 
 2550 exceptions described below.   
 2551   "vdppd" instruction has syntax extended to four operans, but it does not 
 2552 have a 256-bit version.
 2553   The are a few instructions, namely "vsqrtpd", "vsqrtps", "vrcpps" and
 2554 "vrsqrtps", which can operate on 256-bit data size, but retained the syntax 
 2555 with only two operands, because they use data from only one source:
 2556     
 2557     vsqrtpd ymm1,ymm0         ; put square roots into other register
 2558 
 2559 In a similar way "vroundpd" and "vroundps" retained the syntax with three 
 2560 operands, the last one being immediate value.   
 2561 
 2562     vroundps ymm0,ymm1,0011b  ; round toward zero
 2563                               
 2564   Also some of the operations on packed integers kept their two-operand or
 2565 three-operand syntax while being promoted to AVX version. In such case these
 2566 instructions follow exactly the same rules for operands as their SSE 
 2567 counterparts (since operations on packed integers do not have 256-bit variants
 2568 in AVX extension). These include "vpcmpestri", "vpcmpestrm", "vpcmpistri",
 2569 "vpcmpistrm", "vphminposuw", "vpshufd", "vpshufhw", "vpshuflw". And there are 
 2570 more instructions that in AVX versions keep exactly the same syntax for 
 2571 operands as the one from SSE, without any additional options: "vcomiss", 
 2572 "vcomisd", "vcvtss2si", "vcvtsd2si", "vcvttss2si", "vcvttsd2si", "vextractps", 
 2573 "vpextrb", "vpextrw", "vpextrd", "vpextrq", "vmovd", "vmovq", "vmovntdqa", 
 2574 "vmaskmovdqu", "vpmovmskb", "vpmovsxbw", "vpmovsxbd", "vpmovsxbq", "vpmovsxwd", 
 2575 "vpmovsxwq", "vpmovsxdq", "vpmovzxbw", "vpmovzxbd", "vpmovzxbq", "vpmovzxwd", 
 2576 "vpmovzxwq" and "vpmovzxdq".
 2577   The move and conversion instructions have mostly been promoted to allow
 2578 256-bit size operands in addition to the 128-bit variant with syntax identical
 2579 to that from SSE version of the same instruction. Each of the "vcvtdq2ps", 
 2580 "vcvtps2dq" and "vcvttps2dq", "vmovaps", "vmovapd", "vmovups", "vmovupd",
 2581 "vmovdqa", "vmovdqu", "vlddqu", "vmovntps", "vmovntpd", "vmovntdq", 
 2582 "vmovsldup", "vmovshdup", "vmovmskps" and "vmovmskpd" inherits the 128-bit 
 2583 syntax from SSE without any changes, and also allows a new form with 256-bit 
 2584 operands in place of 128-bit ones.  
 2585 
 2586     vmovups [edi],ymm6        ; store unaligned 256-bit data
 2587     
 2588   "vmovddup" has the identical 128-bit syntax as its SSE version, and it also 
 2589 has a 256-bit version, which stores the duplicates of the lowest quad word 
 2590 from the source operand in the lower half of destination operand, and in the 
 2591 upper half of destination the duplicates of the low quad word from the upper 
 2592 half of source. Both source and destination operands need then to be 256-bit 
 2593 values.
 2594   "vmovlhps" and "vmovhlps" have only 128-bit versions, and each takes three
 2595 operands, which all must be SSE registers. "vmovlhps" copies two single 
 2596 precision values from the low quad word of second source register to the high 
 2597 quad word of destination register, and copies the low quad word of first 
 2598 source register into the low quad word of destination register. "vmovhlps" 
 2599 copies two single  precision values from the high quad word of second source 
 2600 register to the low quad word of destination register, and copies the high 
 2601 quad word of first source register into the high quad word of destination 
 2602 register. 
 2603   "vmovlps", "vmovhps", "vmovlpd" and "vmovhpd" have only 128-bit versions and
 2604 their syntax varies depending on whether memory operand is a destination or
 2605 source. When memory is destination, the syntax is identical to the one of
 2606 equivalent SSE instruction, and when memory is source, the instruction requires
 2607 three operands, first two being SSE registers and the third one 64-bit memory.
 2608 The value put into destination is then the value copied from first source with
 2609 either low or high quad word replaced with value from second source (the
 2610 memory operand).
 2611 
 2612     vmovhps [esi],xmm7       ; store upper half to memory
 2613     vmovlps xmm0,xmm7,[ebx]  ; low from memory, rest from register  
 2614   
 2615   "vmovss" and "vmovsd" have syntax identical to their SSE equivalents as long
 2616 as one of the operands is memory, while the versions that operate purely on 
 2617 registers require three operands (each being SSE register). The value stored
 2618 in destination is then the value copied from first source with lowest data
 2619 element replaced with the lowest value from second source.
 2620 
 2621     vmovss xmm3,[edi]        ; low from memory, rest zeroed
 2622     vmovss xmm0,xmm1,xmm2    ; one value from xmm2, three from xmm1 
 2623   
 2624   "vcvtss2sd", "vcvtsd2ss", "vcvtsi2ss" and "vcvtsi2d" use the three-operand
 2625 syntax, where destination and first source are always SSE registers, and the
 2626 second source follows the same rules and the source in syntax of equivalent
 2627 SSE instruction. The value stored in destination is then the value copied from
 2628 first source with lowest data element replaced with the result of conversion. 
 2629 
 2630     vcvtsi2sd xmm4,xmm4,ecx  ; 32-bit integer to 64-bit float
 2631     vcvtsi2ss xmm0,xmm0,rax  ; 64-bit integer to 32-bit float
 2632 
 2633   "vcvtdq2pd" and "vcvtps2pd" allow the same syntax as their SSE equivalents, 
 2634 plus the new variants with AVX register as destination and SSE register or 
 2635 128-bit memory as source. Analogously "vcvtpd2dq", "vcvttpd2dq" and 
 2636 "vcvtpd2ps", in addition to variant with syntax identical to SSE version, 
 2637 allow a variant with SSE register as destination and AVX register or 256-bit 
 2638 memory as source.          
 2639   "vinsertps", "vpinsrb", "vpinsrw", "vpinsrd", "vpinsrq" and "vpblendw" use 
 2640 a syntax with four operands, where destination and first source have to be SSE
 2641 registers, and the third and fourth operand follow the same rules as second 
 2642 and third operand in the syntax of equivalent SSE instruction. Value stored in 
 2643 destination is the the value copied from first source with some data elements 
 2644 replaced with values extracted from the second source, analogously to the 
 2645 operation of corresponding SSE instruction.   
 2646   
 2647     vpinsrd xmm0,xmm0,eax,3  ; insert double word
 2648 
 2649   "vblendvps", "vblendvpd" and "vpblendvb" use a new syntax with four register
 2650 operands: destination, two sources and a mask, where second source can also be
 2651 a memory operand. "vblendvps" and "vblendvpd" have 256-bit variant, where 
 2652 operands are AVX registers or 256-bit memory, as well as 128-bit variant, 
 2653 which has operands being SSE registers or 128-bit memory. "vpblendvb" has only
 2654 a 128-bit variant. Value stored in destination is the value copied from the
 2655 first source with some data elements replaced, according to mask, by values 
 2656 from the second source.
 2657 
 2658     vblendvps ymm3,ymm1,ymm2,ymm7  ; blend according to mask     
 2659    
 2660   "vptest" allows the same syntax as its SSE version and also has a 256-bit
 2661 version, with both operands doubled in size. There are also two new 
 2662 instructions, "vtestps" and "vtestpd", which perform analogous tests, but only
 2663 of the sign bits of corresponding single precision or double precision values,
 2664 and set the ZF and CF accordingly. They follow the same syntax rules as 
 2665 "vptest".
 2666 
 2667     vptest ymm0,yword [ebx]  ; test 256-bit values
 2668     vtestpd xmm0,xmm1        ; test sign bits of 64-bit floats
 2669 
 2670   "vbroadcastss", "vbroadcastsd" and "vbroadcastf128" are new instructions, 
 2671 which broadcast the data element defined by source operand into all elements
 2672 of corresponing size in the destination register. "vbroadcastss" needs
 2673 source to be 32-bit memory and destination to be either SSE or AVX register. 
 2674 "vbroadcastsd" requires 64-bit memory as source, and AVX register as 
 2675 destination. "vbroadcastf128" requires 128-bit memory as source, and AVX
 2676 register as destination.
 2677 
 2678     vbroadcastss ymm0,dword [eax]  ; get eight copies of value          
 2679 
 2680   "vinsertf128" is the new instruction, which takes four operands. The
 2681 destination and first source have to be AVX registers, second source can be 
 2682 SSE register or 128-bit memory location, and fourth operand should be an 
 2683 immediate value. It stores in destination the value obtained by taking 
 2684 contents of first source and replacing one of its 128-bit units with value of
 2685 the second source. The lowest bit of fourth operand specifies at which 
 2686 position that replacement is done (either 0 or 1). 
 2687   "vextractf128" is the new instruction with three operands. The destination
 2688 needs to be SSE register or 128-bit memory location, the source must be AVX
 2689 register, and the third operand should be an immediate value. It extracts
 2690 into destination one of the 128-bit units from source. The lowest bit of third
 2691 operand specifies, which unit is extracted.  
 2692   "vmaskmovps" and "vmaskmovpd" are the new instructions with three operands
 2693 that selectively store in destination the elements from second source 
 2694 depending on the sign bits of corresponding elements from first source. These
 2695 instructions can operate on either 128-bit data (SSE registers) or 256-bit 
 2696 data (AVX registers). Either destination or second source has to be a memory
 2697 location of appropriate size, the two other operands should be registers.   
 2698   
 2699     vmaskmovps [edi],xmm0,xmm5  ; conditionally store
 2700     vmaskmovpd ymm5,ymm0,[esi]  ; conditionally load   
 2701 
 2702   "vpermilpd" and "vpermilps" are the new instructions with three operands 
 2703 that permute the values from first source according to the control fields from 
 2704 second source and put the result into destination operand. It allows to use
 2705 either three SSE registers or three AVX registers as its operands, the second
 2706 source can be a memory of size equal to the registers used. In alternative
 2707 form the second source can be immediate value and then the first source
 2708 can be a memory location of the size equal to destination register.
 2709   "vperm2f128" is the new instruction with four operands, which selects 
 2710 128-bit blocks of floating point data from first and second source according
 2711 to the bit fields from fourth operand, and stores them in destination.
 2712 Destination and first source need to be AVX registers, second source can be
 2713 AVX register or 256-bit memory area, and fourth operand should be an immediate
 2714 value.
 2715 
 2716     vperm2f128 ymm0,ymm6,ymm7,12h  ; permute 128-bit blocks
 2717 
 2718   "vzeroall" instruction sets all the AVX registers to zero. "vzeroupper" sets
 2719 the upper 128-bit portions of all AVX registers to zero, leaving the SSE 
 2720 registers intact. These new instructions take no operands.
 2721   "vldmxcsr" and "vstmxcsr" are the AVX versions of "ldmxcsr" and "stmxcsr"
 2722 instructions. The rules for their operands remain unchanged.  
 2723 
 2724   
 2725 2.1.22  AVX2 instructions
 2726 
 2727 The AVX2 extension allows all the AVX instructions operating on packed integers
 2728 to use 256-bit data types, and introduces some new instructions as well.
 2729   The AVX instructions that operate on packed integers and had only a 128-bit
 2730 variants, have been supplemented with 256-bit variants, and thus their syntax
 2731 rules became analogous to AVX instructions operating on packed floating point
 2732 types.
 2733 
 2734     vpsubb ymm0,ymm0,[esi]   ; subtract 32 packed bytes
 2735     vpavgw ymm3,ymm0,ymm2    ; average of 16-bit integers
 2736 
 2737 However there are some instructions that have not been equipped with the 
 2738 256-bit variants. "vpcmpestri", "vpcmpestrm", "vpcmpistri", "vpcmpistrm", 
 2739 "vpextrb", "vpextrw", "vpextrd", "vpextrq", "vpinsrb", "vpinsrw", "vpinsrd", 
 2740 "vpinsrq" and "vphminposuw" are not affected by AVX2 and allow only the 
 2741 128-bit operands.
 2742   The packed shift instructions, which allowed the third operand specifying
 2743 amount to be SSE register or 128-bit memory location, use the same rules
 2744 for the third operand in their 256-bit variant.
 2745 
 2746     vpsllw ymm2,ymm2,xmm4        ; shift words left
 2747     vpsrad ymm0,ymm3,xword [ebx] ; shift double words right
 2748 
 2749   There are also new packed shift instructions with standard three-operand AVX
 2750 syntax, which shift each element from first source by the amount specified in 
 2751 corresponding element of second source, and store the results in destination. 
 2752 "vpsllvd" shifts 32-bit elements left, "vpsllvq" shifts 64-bit elements left, 
 2753 "vpsrlvd" shifts 32-bit elements right logically, "vpsrlvq" shifts 64-bit 
 2754 elements right logically and "vpsravd" shifts 32-bit elements right 
 2755 arithmetically.
 2756   The sign-extend and zero-extend instructions, which in AVX versions allowed
 2757 source operand to be SSE register or a memory of specific size, in the new
 2758 256-bit variant need memory of that size doubled or SSE register as source and
 2759 AVX register as destination.
 2760 
 2761     vpmovzxbq ymm0,dword [esi]   ; bytes to quad words
 2762     
 2763   Also "vmovntdqa" has been upgraded with 256-bit variant, so it allows to 
 2764 transfer 256-bit value from memory to AVX register, it needs memory address 
 2765 to be aligned to 32 bytes.   
 2766   "vpmaskmovd" and "vpmaskmovq" are the new instructions with syntax identical
 2767 to "vmaskmovps" or "vmaskmovpd", and they performs analogous operation on
 2768 packed 32-bit or 64-bit values.    
 2769   "vinserti128", "vextracti128", "vbroadcasti128" and "vperm2i128" are the new 
 2770 instructions with syntax identical to "vinsertf128", "vextractf128",
 2771 "vbroadcastf128" and "vperm2f128" respectively, and they perform analogous 
 2772 operations on 128-bit blocks of integer data.
 2773   "vbroadcastss" and "vbroadcastsd" instructions have been extended to allow
 2774 SSE register as a source operand (which in AVX could only be a memory).
 2775   "vpbroadcastb", "vpbroadcastw", "vpbroadcastd" and "vpbroadcastq" are the 
 2776 new instructions which broadcast the byte, word, double word or quad word from
 2777 the source operand into all elements of corresponing size in the destination 
 2778 register. The destination operand can be either SSE or AVX register, and the
 2779 source operand can be SSE register or memory of size equal to the size of data
 2780 element.
 2781 
 2782     vpbroadcastb ymm0,byte [ebx]  ; get 32 identical bytes
 2783                  
 2784   "vpermd" and "vpermps" are new three-operand instructions, which use each 
 2785 32-bit element from first source as an index of element in second source which
 2786 is copied into destination at position corresponding to element containing
 2787 index. The destination and first source have to be AVX registers, and the
 2788 second source can be AVX register or 256-bit memory.
 2789   "vpermq" and "vpermpd" are new three-operand instructions, which use 2-bit
 2790 indexes from the immediate value specified as third operand to determine which
 2791 element from source store at given position in destination. The destination
 2792 has to be AVX register, source can be AVX register or 256-bit memory, and the
 2793 third operand must be 8-bit immediate value.    
 2794   The family of new instructions performing "gather" operation have special
 2795 syntax, as in their memory operand they use addressing mode that is unique to
 2796 them. The base of address can be a 32-bit or 64-bit general purpose register
 2797 (the latter only in long mode), and the index (possibly multiplied by scale
 2798 value, as in standard addressing) is specified by SSE or AVX register. It is
 2799 possible to use only index without base and any numerical displacement can be
 2800 added to the address. Each of those instructions takes three operands. First 
 2801 operand is the destination register, second operand is memory addressed with
 2802 a vector index, and third operand is register containing a mask. The most 
 2803 significant bit of each element of mask determines whether a value will be 
 2804 loaded from memory into corresponding element in destination. The address of
 2805 each element to load is determined by using the corresponding element from 
 2806 index register in memory operand to calculate final address with given base
 2807 and displacement. When the index register contains less elements than the 
 2808 destination and mask registers, the higher elements of destination are zeroed.
 2809 After the value is successfuly loaded, the corresponding element in mask 
 2810 register is set to zero. The destination, index and mask should all be
 2811 distinct registers, it is not allowed to use the same register in two 
 2812 different roles.
 2813   "vgatherdps" loads single precision floating point values addressed by 
 2814 32-bit indexes. The destination, index and mask should all be registers of the
 2815 same type, either SSE or AVX. The data addressed by memory operand is 32-bit
 2816 in size. 
 2817 
 2818     vgatherdps xmm0,[eax+xmm1],xmm3    ; gather four floats
 2819     vgatherdps ymm0,[ebx+ymm7*4],ymm3  ; gather eight floats
 2820 
 2821   "vgatherqps" loads single precision floating point values addressed by
 2822 64-bit indexes. The destination and mask should always be SSE registers, while
 2823 index register can be either SSE or AVX register. The data addressed by memory
 2824 operand is 32-bit in size.
 2825   
 2826     vgatherqps xmm0,[xmm2],xmm3        ; gather two floats     
 2827     vgatherqps xmm0,[ymm2+64],xmm3     ; gather four floats  
 2828   
 2829   "vgatherdpd" loads double precision floating point values addressed by
 2830 32-bit indexes. The index register should always be SSE register, the 
 2831 destination and mask should be two registers of the same type, either SSE or
 2832 AVX. The data addressed by memory operand is 64-bit in size. 
 2833   
 2834     vgatherdpd xmm0,[ebp+xmm1],xmm3    ; gather two doubles
 2835     vgatherdpd ymm0,[xmm3*8],ymm5      ; gather four doubles
 2836 
 2837   "vgatherqpd" loads double precision floating point values addressed by
 2838 64-bit indexes. The destination, index and mask should all be registers of the
 2839 same type, either SSE or AVX. The data addressed by memory operand is 64-bit
 2840 in size.      
 2841   "vpgatherdd" and "vpgatherqd" load 32-bit values addressed by either 32-bit
 2842 or 64-bit indexes. They follow the same rules as "vgatherdps" and "vgatherqps"
 2843 respectively.  
 2844   "vpgatherdq" and "vpgatherqq" load 64-bit values addressed by either 32-bit
 2845 or 64-bit indexes. They follow the same rules as "vgatherdpd" and "vgatherqpd"
 2846 respectively.  
 2847   
 2848 
 2849 2.1.23  Auxiliary sets of computational instructions
 2850 
 2851   There is a number of additional instruction set extensions related to 
 2852 AVX. They introduce new vector instructions (and sometimes also their SSE 
 2853 equivalents that use classic instruction encoding), and even some new
 2854 instructions operating on general registers that use the AVX-like encoding
 2855 allowing the extended syntax with separate destination and source operands.
 2856 The CPU support for each of these instructions sets needs to be determined
 2857 separately.    
 2858   The AES extension provides a specialized set of instructions for the 
 2859 purpose of cryptographic computations defined by Advanced Encryption Standard.
 2860 Each of these instructions has two versions: the AVX one and the one with 
 2861 SSE-like syntax that uses classic encoding. Refer to the Intel manuals for the
 2862 details of operation of these instructions.
 2863   "aesenc" and "aesenclast" perform a single round of AES encryption on data
 2864 from first source with a round key from second source, and store result in
 2865 destination. The destination and first source are SSE registers, and the 
 2866 second source can be SSE register or 128-bit memory. The AVX versions of these
 2867 instructions, "vaesenc" and "vaesenclast", use the syntax with three operands,
 2868 while the SSE-like version has only two operands, with first operand being 
 2869 both the destination and first source.
 2870   "aesdec" and "aesdeclast" perform a single round of AES decryption on data
 2871 from first source with a round key from second source. The syntax rules for
 2872 them and their AVX versions are the same as for "aesenc".
 2873   "aesimc" performs the InvMixColumns transformation of source operand and
 2874 store the result in destination. Both "aesimc" and "vaesimc" use only two
 2875 operands, destination being SSE register, and source being SSE register or
 2876 128-bit memory location.
 2877   "aeskeygenassist" is a helper instruction for generating the round key.
 2878 It needs three operands: destination being SSE register, source being SSE
 2879 register or 128-bit memory, and third operand being 8-bit immediate value.  
 2880 The AVX version of this instruction uses the same syntax.  
 2881   The CLMUL extension introduces just one instruction, "pclmulqdq", and its
 2882 AVX version as well. This instruction performs a carryless multiplication of
 2883 two 64-bit values selected from first and second source according to the bit
 2884 fields in immediate value. The destination and first source are SSE registers,
 2885 second source is SSE register or 128-bit memory, and immediate value is 
 2886 provided as last operand. "vpclmulqdq" takes four operands, while "pclmulqdq"
 2887 takes only three operands, with the first one serving both the role of 
 2888 destination and first source.
 2889   The FMA (Fused Multiply-Add) extension introduces additional AVX 
 2890 instructions which perform multiplication and summation as single operation. 
 2891 Each one takes three operands, first one serving both the role of destination 
 2892 and first source, and the following ones being the second and third source. 
 2893 The mnemonic of FMA instruction is obtained by appending to "vf" prefix: first 
 2894 either "m" or "nm" to select whether result of multiplication should be taken 
 2895 as-is or negated, then either "add" or "sub" to select whether third value 
 2896 will be added to the product or subtracted from the product, then either 
 2897 "132", "213" or "231" to select which source operands are multiplied and which 
 2898 one is added or subtracted, and finally the type of data on which the 
 2899 instruction operates, either "ps", "pd", "ss" or "sd". As it was with SSE 
 2900 instructions promoted to AVX, instructions operating on packed floating point 
 2901 values allow 128-bit or 256-bit syntax, in former all the operands are SSE 
 2902 registers, but the third one can also be a 128-bit memory, in latter the 
 2903 operands are AVX registers and the third one can also be a 256-bit memory. 
 2904 Instructions that compute just one floating point result need operands to be 
 2905 SSE registers, and the third operand can also be a memory, either 32-bit for 
 2906 single precision or 64-bit for double precision.
 2907 
 2908     vfmsub231ps ymm1,ymm2,ymm3     ; multiply and subtract
 2909     vfnmadd132sd xmm0,xmm5,[ebx]   ; multiply, negate and add        
 2910 
 2911 In addition to the instructions created by the rule described above, there are
 2912 families of instructions with mnemonics starting with either "vfmaddsub" or
 2913 "vfmsubadd", followed by either "132", "213" or "231" and then either "ps" or
 2914 "pd" (the operation must always be on packed values in this case). They add
 2915 to the result of multiplication or subtract from it depending on the position
 2916 of value in packed data - instructions from the "vfmaddsub" group add when the
 2917 position is odd and subtract when the position is even, instructions from the
 2918 "vfmsubadd" group add when the position is even and subtstract when the 
 2919 position is odd. The rules for operands are the same as for other FMA 
 2920 instructions.
 2921   The FMA4 instructions are similar to FMA, but use syntax with four operands
 2922 and thus allow destination to be different than all the sources. Their 
 2923 mnemonics are identical to FMA instructions with the "132", "213" or "231" cut
 2924 out, as having separate destination operand makes such selection of operands
 2925 superfluous. The multiplication is always performed on values from the first 
 2926 and second source, and then the value from third source is added or 
 2927 subtracted. Either second or third source can be a memory operand, and the
 2928 rules for the sizes of operands are the same as for FMA instructions.
 2929 
 2930     vfmaddpd ymm0,ymm1,[esi],ymm2  ; multiply and add   
 2931     vfmsubss xmm0,xmm1,xmm2,[ebx]  ; multiply and subtract
 2932     
 2933   The F16C extension consists of two instructions, "vcvtps2ph" and 
 2934 "vcvtph2ps", which convert floating point values between single precision and
 2935 half precision (the 16-bit floating point format). "vcvtps2ph" takes three
 2936 operands: destination, source, and rounding controls. The third operand is
 2937 always an immediate, the source is either SSE or AVX register containing 
 2938 single precision values, and the destination is SSE register or memory, the
 2939 size of memory is 64 bits when the source is SSE register and 128 bits when
 2940 the source is AVX register. "vcvtph2ps" takes two operands, the destination
 2941 that can be SSE or AVX register, and the source that is SSE register or memory
 2942 with size of the half of destination operand's size.
 2943   The AMD XOP extension introduces a number of new vector instructions with 
 2944 encoding and syntax analogous to AVX instructions. "vfrczps", "vfrczss",
 2945 "vfrczpd" and "vfrczsd" extract fractional portions of single or double
 2946 precision values, they all take two operands. The packed operations allow
 2947 either SSE or AVX register as destination, for the other two it has to be SSE
 2948 register. Source can be register of the same type as destination, or memory
 2949 of appropriate size (256-bit for destination being AVX register, 128-bit for
 2950 packed operation with destination being SSE register, 64-bit for operation
 2951 on a solitary double precision value and 32-bit for operation on a solitary 
 2952 single precision value).
 2953 
 2954     vfrczps ymm0,[esi]           ; load fractional parts
 2955     
 2956   "vpcmov" copies bits from either first or second source into destination
 2957 depending on the values of corresponding bits in the fourth operand (the
 2958 selector). If the bit in selector is set, the corresponding bit from first
 2959 source is copied into the same position in destination, otherwise the bit from
 2960 second source is copied. Either second source or selector can be memory
 2961 location, 128-bit or 256-bit depending on whether SSE registers or AVX
 2962 registers are specified as the other operands.
 2963 
 2964     vpcmov xmm0,xmm1,xmm2,[ebx]  ; selector in memory
 2965     vpcmov ymm0,ymm5,[esi],ymm2  ; source in memory
 2966 
 2967 The family of packed comparison instructions take four operands, the 
 2968 destination and first source being SSE register, second source being SSE
 2969 register or 128-bit memory and the fourth operand being immediate value
 2970 defining the type of comparison. The mnemonic or instruction is created
 2971 by appending to "vpcom" prefix either "b" or "ub" to compare signed or 
 2972 unsigned bytes, "w" or "uw" to compare signed or unsigned words, "d" or "ud"
 2973 to compare signed or unsigned double words, "q" or "uq" to compare signed or
 2974 unsigned quad words. The respective values from the first and second source 
 2975 are compared and the corresponding data element in destination is set to
 2976 either all ones or all zeros depending on the result of comparison. The fourth
 2977 operand has to specify one of the eight comparison types (table 2.5). All
 2978 these instructions have also variants with only three operands and the type
 2979 of comparison encoded within the instruction name by inserting the comparison 
 2980 mnemonic after "vpcom".
 2981 
 2982     vpcomb   xmm0,xmm1,xmm2,4    ; test for equal bytes
 2983     vpcomgew xmm0,xmm1,[ebx]     ; compare signed words
 2984 
 2985    Table 2.5  XOP comparisons
 2986   /-------------------------------------------\
 2987   | Code | Mnemonic | Description             |
 2988   |======|==========|=========================|
 2989   | 0    | lt       | less than               |
 2990   | 1    | le       | less than or equal      |
 2991   | 2    | gt       | greater than            |
 2992   | 3    | ge       | greater than or equal   |
 2993   | 4    | eq       | equal                   |
 2994   | 5    | neq      | not equal               |
 2995   | 6    | false    | false                   |
 2996   | 7    | true     | true                    |
 2997   \-------------------------------------------/
 2998 
 2999   "vpermil2ps" and "vpermil2pd" set the elements in destination register to
 3000 zero or to a value selected from first or second source depending on the 
 3001 corresponding bit fields from the fourth operand (the selector) and the 
 3002 immediate value provided in fifth operand. Refer to the AMD manuals for the
 3003 detailed explanation of the operation performed by these instructions. Each
 3004 of the first four operands can be a register, and either second source or
 3005 selector can be memory location, 128-bit or 256-bit depending on whether SSE 
 3006 registers or AVX registers are used for the other operands.
 3007 
 3008     vpermil2ps ymm0,ymm3,ymm7,ymm2,0  ; permute from two sources
 3009   
 3010   "vphaddbw" adds pairs of adjacent signed bytes to form 16-bit values and 
 3011 stores them at the same positions in destination. "vphaddubw" does the same 
 3012 but treats the bytes as unsigned. "vphaddbd" and "vphaddubd" sum all bytes 
 3013 (either signed or unsigned) in each four-byte block to 32-bit results, 
 3014 "vphaddbq" and "vphaddubq" sum all bytes in each eight-byte block to 
 3015 64-bit results, "vphaddwd" and "vphadduwd" add pairs of words to 32-bit 
 3016 results, "vphaddwq" and "vphadduwq" sum all words in each four-word block to 
 3017 64-bit results, "vphadddq" and "vphaddudq" add pairs of double words to 64-bit
 3018 results. "vphsubbw" subtracts in each two-byte block the byte at higher 
 3019 position from the one at lower position, and stores the result as a signed 
 3020 16-bit value at the corresponding position in destination, "vphsubwd" 
 3021 subtracts in each two-word block the word at higher position from the one at
 3022 lower position and makes signed 32-bit results, "vphsubdq" subtract in each
 3023 block of two double word the one at higher position from the one at lower
 3024 position and makes signed 64-bit results. Each of these instructions takes
 3025 two operands, the destination being SSE register, and the source being SSE
 3026 register or 128-bit memory.
 3027 
 3028     vphadduwq xmm0,xmm1          ; sum quadruplets of words 
 3029   
 3030   "vpmacsww" and "vpmacssww" multiply the corresponding signed 16-bit values 
 3031 from the first and second source and then add the products to the parallel 
 3032 values from the third source, then "vpmacsww" takes the lowest 16 bits of the 
 3033 result and "vpmacssww" saturates the result down to 16-bit value, and they 
 3034 store the final 16-bit results in the destination. "vpmacsdd" and "vpmacssdd" 
 3035 perform the analogous operation on 32-bit values. "vpmacswd" and "vpmacsswd" do
 3036 the same calculation only on the low 16-bit values from each 32-bit block and 
 3037 form the 32-bit results. "vpmacsdql" and "vpmacssdql" perform such operation 
 3038 on the low 32-bit values from each 64-bit block and form the 64-bit results, 
 3039 while "vpmacsdqh" and "vpmacssdqh" do the same on the high 32-bit values from 
 3040 each 64-bit block, also forming the 64-bit results. "vpmadcswd" and 
 3041 "vpmadcsswd" multiply the corresponding signed 16-bit value from the first
 3042 and second source, then sum all the four products and add this sum to each
 3043 16-bit element from third source, storing the truncated or saturated result
 3044 in destination. All these instructions take four operands, the second source
 3045 can be 128-bit memory or SSE register, all the other operands have to be
 3046 SSE registers.
 3047 
 3048     vpmacsdd xmm6,xmm1,[ebx],xmm6  ; accumulate product
 3049 
 3050   "vpperm" selects bytes from first and second source, optionally applies a
 3051 separate transformation to each of them, and stores them in the destination. 
 3052 The bit fields in fourth operand (the selector) specify for each position in 
 3053 destination what byte from which source is taken and what operation is applied 
 3054 to it before it is stored there. Refer to the AMD manuals for the detailed 
 3055 information about these bit fields. This instruction takes four operands, 
 3056 either second source or selector can be a 128-bit memory (or they can be SSE
 3057 registers both), all the other operands have to be SSE registers.
 3058   "vpshlb", "vpshlw", "vpshld" and "vpshlq" shift logically bytes, words, double
 3059 words or quad words respectively. The amount of bits to shift by is specified
 3060 for each element separately by the signed byte placed at the corresponding
 3061 position in the third operand. The source containing elements to shift is
 3062 provided as second operand. Either second or third operand can be 128-bit 
 3063 memory (or they can be SSE registers both) and the other operands have to be 
 3064 SSE registers.
 3065 
 3066     vpshld xmm3,xmm1,[ebx]       ; shift bytes from xmm1
 3067  
 3068 "vpshab", "vpshaw", "vpshad" and "vpshaq" arithmetically shift bytes, words, 
 3069 double words or quad words. These instructions follow the same rules as the 
 3070 logical shifts described above. "vprotb", "vprotw", "vprotd" and "vprotq" 
 3071 rotate bytes, word, double words or quad words. They follow the same rules as
 3072 shifts, but additionally allow third operand to be immediate value, in which
 3073 case the same amount of rotation is specified for all the elements in source.
 3074 
 3075     vprotb xmm0,[esi],3          ; rotate bytes to the left 
 3076 
 3077   The MOVBE extension introduces just one new instruction, "movbe", which
 3078 swaps bytes in value from source before storing it in destination, so can
 3079 be used to load and store big endian values. It takes two operands, either 
 3080 the destination or source should be a 16-bit, 32-bit or 64-bit memory (the 
 3081 last one being only allowed in long mode), and the other operand should be 
 3082 a general register of the same size.  
 3083   The BMI extension, consisting of two subsets - BMI1 and BMI2, introduces 
 3084 new instructions operating on general registers, which use the same encoding
 3085 as AVX instructions and so allow the extended syntax. All these instructions
 3086 use 32-bit operands, and in long mode they also allow the forms with 64-bit
 3087 operands.
 3088   "andn" calculates the bitwise AND of second source with the inverted bits
 3089 of first source and stores the result in destination. The destination and 
 3090 the first source have to be general registers, the second source can be 
 3091 general register or memory.
 3092 
 3093     andn edx,eax,[ebx]   ; bit-multiply inverted eax with memory
 3094 
 3095   "bextr" extracts from the first source the sequence of bits using an index
 3096 and length specified by bit fields in the second source operand and stores
 3097 it into destination. The lowest 8 bits of second source specify the position 
 3098 of bit sequence to extract and the next 8 bits of second source specify the 
 3099 length of sequence. The first source can be a general register or memory,
 3100 the other two operands have to be general registers.
 3101 
 3102     bextr eax,[esi],ecx  ; extract bit field from memory
 3103     
 3104   "blsi" extracts the lowest set bit from the source, setting all the other 
 3105 bits in destination to zero. The destination must be a general register,
 3106 the source can be general register or memory.
 3107 
 3108     blsi rax,r11         ; isolate the lowest set bit       
 3109   
 3110   "blsmsk" sets all the bits in the destination up to the lowest set bit in 
 3111 the source, including this bit. "blsr" copies all the bits from the source to
 3112 destination except for the lowest set bit, which is replaced by zero. These
 3113 instructions follow the same rules for operands as "blsi".
 3114   "tzcnt" counts the number of trailing zero bits, that is the zero bits up to
 3115 the lowest set bit of source value. This instruction is analogous to "lzcnt"
 3116 and follows the same rules for operands, so it also has a 16-bit version, 
 3117 unlike the other BMI instructions.
 3118   "bzhi" is BMI2 instruction, which copies the bits from first source to
 3119 destination, zeroing all the bits up from the position specified by second
 3120 source. It follows the same rules for operands as "bextr".
 3121   "pext" uses a mask in second source operand to select bits from first 
 3122 operands and puts the selected bits as a continuous sequence into destination.
 3123 "pdep" performs the reverse operation - it takes sequence of bits from the
 3124 first source and puts them consecutively at the positions where the bits in 
 3125 second source are set, setting all the other bits in destination to zero.
 3126 These BMI2 instructions follow the same rules for operands as "andn".    
 3127   "mulx" is a BMI2 instruction which performs an unsigned multiplication of
 3128 value from EDX or RDX register (depending on the size of specified operands)
 3129 by the value from third operand, and stores the low half of result in the
 3130 second operand, and the high half of result in the first operand, and it does
 3131 it without affecting the flags. The third operand can be general register or 
 3132 memory, and both the destination operands have to be general registers.
 3133 
 3134     mulx edx,eax,ecx     ; multiply edx by ecx into edx:eax   
 3135 
 3136   "shlx", "shrx" and "sarx" are BMI2 instructions, which perform logical or
 3137 arithmetical shifts of value from first source by the amount specified by
 3138 second source, and store the result in destination without affecting the 
 3139 flags. The have the same rules for operands as "bzhi" instruction.
 3140   "rorx" is a BMI2 instruction which rotates right the value from source
 3141 operand by the constant amount specified in third operand and stores the
 3142 result in destination without affecting the flags. The destination operand
 3143 has to be general register, the source operand can be general register or
 3144 memory, and the third operand has to be an immediate value.
 3145 
 3146     rorx eax,edx,7       ; rotate without affecting flags
 3147                      
 3148   The TBM is an extension designed by AMD to supplement the BMI set. The 
 3149 "bextr" instruction is extended with a new form, in which second source is
 3150 a 32-bit immediate value. "blsic" is a new instruction which performs the
 3151 same operation as "blsi", but with the bits of result reversed. It uses the
 3152 same rules for operands as "blsi". "blsfill" is a new instruction, which takes
 3153 the value from source, sets all the bits below the lowest set bit and store
 3154 the result in destination, it also uses the same rules for operands as "blsi".
 3155   "blci", "blcic", "blcs", "blcmsk" and "blcfill" are instructions analogous
 3156 to "blsi", "blsic", "blsr", "blsmsk" and "blsfill" respectively, but they
 3157 perform the bit-inverted versions of the same operations. They follow the
 3158 same rules for operands as the instructions they reflect.
 3159   "tzmsk" finds the lowest set bit in value from source operand, sets all bits
 3160 below it to 1 and all the rest of bits to zero, then writes the result to 
 3161 destination. "t1mskc" finds the least significant zero bit in the value from 
 3162 source  operand, sets the bits below it to zero and all the other bits to 1, 
 3163 and writes the result to destination. These instructions have the same rules
 3164 for operands as "blsi".
 3165       
 3166 
 3167 2.1.24  AVX-512 instructions
 3168 
 3169 The AVX-512 introduces 512-bit vector registers, which extend the 256-bit
 3170 registers used by AVX and AVX2. It also extends the set of vector registers
 3171 from 16 to 32, with the additional registers "zmm16" to "zmm31", their low 
 3172 256-bit portions "ymm16" to "ymm31" and their low 128-bit portions "xmm16"
 3173 to "xmm31". These additional registers can only be accessed in the long mode.
 3174 
 3175    Table 2.6  New registers available in long mode with AVX-512
 3176   /------------------------------------------------------------------\
 3177   | Size    | Registers                                              |
 3178   |---------|--------------------------------------------------------|
 3179   | 128-bit | xmm16  xmm17  xmm18  xmm19  xmm20  xmm21  xmm22  xmm23 |
 3180   |         | xmm24  xmm25  xmm26  xmm27  xmm28  xmm29  xmm30  xmm31 |
 3181   |---------|--------------------------------------------------------|
 3182   | 256-bit | ymm16  ymm17  ymm18  ymm19  ymm20  ymm21  ymm22  ymm23 |
 3183   |         | ymm24  ymm25  ymm26  ymm27  ymm28  ymm29  ymm30  ymm31 |
 3184   |---------|--------------------------------------------------------|
 3185   | 512-bit | zmm16  zmm17  zmm18  zmm19  zmm20  zmm21  zmm22  zmm23 |
 3186   |         | zmm24  zmm25  zmm26  zmm27  zmm28  zmm29  zmm30  zmm31 |
 3187   \------------------------------------------------------------------/
 3188 
 3189   In addition to new operand sizes and registers, the AVX-512 introduces
 3190 a number of supplementary settings that can be included in the operands
 3191 of AVX instructions.
 3192   The destination operand of the most of AVX instructions can be followed
 3193 by the name of an opmask register enclosed in braces, this modifier
 3194 specifies a mask that decides which units of data in the destination
 3195 operand are going to be updated. The "k0" register cannot be used as a
 3196 destination mask. This setting can be further followed by "{z}" modifier
 3197 to choose that the data units not selected by mask should be zeroed
 3198 instead of leaving them unchanged.
 3199 
 3200     vaddpd zmm1{k1},zmm5,zword [rsi]  ; update selected floats
 3201     vaddps ymm6{k1}{z},ymm12,ymm24    ; update selected, zero other ones
 3202 
 3203   When an instruction that operates on packed data has a source operand
 3204 loaded from a memory, the memory location may be just a single unit of data
 3205 and the source used for the operation is created by broadcasting this
 3206 value into all the units within the required size. To specify that such
 3207 broadcasting method is used the memory operand should be followed by one
 3208 of the "{1to2}", "{1to4}", "{1to8}", "{1to16}", "{1to32}" and "{1to64}"
 3209 modifiers, selecting the appropriate multiply of a unit.
 3210 
 3211     vsubps zmm1,zmm2,dword [rsi] {1to16} ; subtract from all floats
 3212 
 3213   When an instruction does not use a memory operand often an additional
 3214 operand may follow the source operands, containing the rounding mode
 3215 specifier. When an instruction has variants that operate on different
 3216 sizes of data, the rounding mode can be specified only when the
 3217 register operands are 512-bit.
 3218 
 3219     vdivps zmm2,zmm3,zmm5,{ru-sae}    ; round results up
 3220  
 3221    Table 2.7  AVX-512 rounding modes
 3222   /----------------------------------------------------------\
 3223   | Operand  | Description                                   |
 3224   |==========|===============================================|
 3225   | {rn-sae} | round to nearest and suppress all exceptions  |
 3226   | {rd-sae} | round down and suppress all exceptions        |
 3227   | {ru-sae} | round up and suppress all exceptions          |
 3228   | {rz-sae} | round toward zero and suppress all exceptions |
 3229   \----------------------------------------------------------/
 3230 
 3231 Some of the instructions do not use a rounding mode but still allow
 3232 to specify the exception suppression option with "{sae}" modifier in the
 3233 additional operand.
 3234 
 3235     vmaxpd zmm0,zmm1,zmm2,{sae}       ; suppress all exceptions
 3236 
 3237   The family of "gather" instructions in their AVX-512 variants use a new
 3238 syntax with only two operands. The opmask register takes the role which
 3239 was played by the third operand in the AVX2 syntax and it is mandatory
 3240 in this case.
 3241 
 3242     vgatherdps xmm0{k1},[eax+xmm1]    ; gather four floats
 3243     vgatherdpd zmm0{k3},[ymm3*8]      ; gather eight doubles
 3244 
 3245   The new family of "scatter" instructions perform an operation reverse to
 3246 the one of "gather". They also take two operands, the destination is a 
 3247 memory with vector indexing and opmask modifier, and the source is a vector
 3248 register.
 3249 
 3250     vscatterdps [eax+xmm1]{k1},xmm0    ; scatter four floats
 3251     vscatterdpd [ymm3*8]{k3},zmm0      ; scatter eight doubles
 3252 
 3253   The AVX512_4VNNI extension introduces instructions with another unusual
 3254 syntax variant. The first source operand of "vp4dpwssd" or "vp4dpwssds"
 3255 instruction refers to an aligned block of four 512-bit registers, containing
 3256 the base register specified by the operand. This can be indicated by attaching
 3257 "+3" to the name of register, although it is optional.
 3258 
 3259     vp4dpwssd zmm1{k1}{z},zmm2+3,xword[rbx]
 3260 
 3261 
 3262 2.1.25  Other extensions of instruction set
 3263 
 3264 There is a number of additional instruction set extensions recognized by flat
 3265 assembler, and examples of syntax of the instructions introduced by those
 3266 extensions are provided here. For a detailed information on the operations
 3267 performed by them, check out the manuals from Intel or AMD.
 3268   The Virtual-Machine Extensions (VMX) provide a set of instructions for the
 3269 management of virtual machines. The "vmxon" instruction, which enters the VMX
 3270 operation, requires a single 64-bit memory operand, which should be a physical
 3271 address of memory region, which the logical processor may use to support VMX
 3272 operation. The "vmxoff" instruction, which leaves the VMX operation, has no
 3273 operands. The "vmlaunch" and "vmresume", which launch or resume the virtual
 3274 machines, and "vmcall", which allows guest software to call the VM monitor, 
 3275 use no operands either.
 3276   The "vmptrld" loads the physical address of current Virtual Machine Control
 3277 Structure (VMCS) from its memory operand, "vmptrst" stores the pointer to
 3278 current VMCS into address specified by its memory operand, and "vmclear" sets
 3279 the launch state of the VMCS referenced by its memory operand to clear. These
 3280 three instruction all require single 64-bit memory operand.
 3281   The "vmread" reads from VCMS a field specified by the source operand and
 3282 stores it into the destination operand. The source operand should be a
 3283 general purpose register, and the destination operand can be a register of
 3284 memory. The "vmwrite" writes into a VMCS field specified by the destination
 3285 operand the value provided by source operand. The source operand can be a
 3286 general purpose register or memory, and the destination operand must be a
 3287 register. The size of operands for those instructions should be 64-bit when
 3288 in long mode, and 32-bit otherwise.
 3289   The "invept" and "invvpid" invalidate the translation lookaside buffers
 3290 (TLBs) and paging-structure caches, either derived from extended page tables
 3291 (EPT), or based on the virtual processor identifier (VPID). These instructions
 3292 require two operands, the first one being the general purpose register
 3293 specifying the type of invalidation, and the second one being a 128-bit
 3294 memory operand providing the invalidation descriptor. The first operand
 3295 should be a 64-bit register when in long mode, and 32-bit register otherwise.
 3296   The Safer Mode Extensions (SMX) provide the functionalities available
 3297 throught the "getsec" instruction. This instruction takes no operands, and
 3298 the function that is executed is determined by the contents of EAX register
 3299 upon executing this instruction.
 3300   The Secure Virtual Machine (SVM) is a variant of virtual machine extension
 3301 used by AMD. The "skinit" instruction securely reinitializes the processor
 3302 allowing the startup of trusted software, such as the virtual machine monitor
 3303 (VMM). This instruction takes a single operand, which must be EAX, and
 3304 provides a physical address of the secure loader block (SLB).
 3305   The "vmrun" instruction is used to start a guest virtual machine,
 3306 its only operand should be an accumulator register (AX, EAX or RAX, the
 3307 last one available only in long mode) providing the physical address of the
 3308 virtual machine control block (VMCB). The "vmsave" stores a subset of 
 3309 processor state into VMCB specified by its operand, and "vmload" loads the 
 3310 same subset of processor state from a specified VMCB. The same operand rules 
 3311 as for the "vmrun" apply to those two instructions.
 3312   "vmmcall" allows the guest software to call the VMM. This instruction takes
 3313 no operands.
 3314   "stgi" set the global interrupt flag to 1, and "clgi" zeroes it. These
 3315 instructions take no operands.
 3316   "invlpga" invalidates the TLB mapping for a virtual page specified by the
 3317 first operand (which has to be accumulator register) and address space
 3318 identifier specified by the second operand (which must be ECX register).
 3319   The XSAVE set of instructions allows to save and restore processor state
 3320 components. "xsave" and "xsaveopt" store the components of processor state 
 3321 defined by bit mask in EDX and EAX registers into area defined by memory 
 3322 operand. "xrstor" restores from the area specified by memory operand the 
 3323 components of processor state defined by mask in EDX and EAX. The "xsave64",
 3324 "xsaveopt64" and "xrstor64" are 64-bit versions of these instructions, allowed
 3325 only in long mode.
 3326   "xgetbv" read the contents of 64-bit XCR (extended control register)
 3327 specified in ECX register into EDX and EAX registers. "xsetbv" writes the
 3328 contents of EDX and EAX into the 64-bit XCR specified by ECX register. These
 3329 instructions have no operands.
 3330   The RDRAND extension introduces one new instruction, "rdrand", which loads
 3331 the hardware-generated random value into general register. It takes one
 3332 operand, which can be 16-bit, 32-bit or 64-bit register (with the last one 
 3333 being allowed only in long mode).
 3334   The FSGSBASE extension adds long mode instructions that allow to read and 
 3335 write the segment base registers for FS and GS segments. "rdfsbase" and 
 3336 "rdgsbase" read the corresponding segment base registers into operand, while 
 3337 "wrfsbase" and "wrgsbase" write the value of operand into those register.
 3338 All these instructions take one operand, which can be 32-bit or 64-bit general
 3339 register.  
 3340   The INVPCID extension adds "invpcid" instruction, which invalidates mapping
 3341 in the TLBs and paging caches based on the invalidation type specified in 
 3342 first operand and PCID invalidate descriptor specified in second operand.
 3343 The first operands should be 32-bit general register when not in long mode,
 3344 or 64-bit general register when in long mode. The second operand should be
 3345 128-bit memory location.  
 3346   The HLE and RTM extensions provide set of instructions for the transactional
 3347 management. The "xacquire" and "xrelease" are new prefixes that can be used
 3348 with some of the instructions to start or end lock elision on the memory
 3349 address specified by prefixed instruction. The "xbegin" instruction starts
 3350 the transactional execution, its operand is the address a fallback routine
 3351 that gets executes in case of transaction abort, specified like the operand
 3352 for near jump instruction. "xend" marks the end of transcational execution
 3353 region, it takes no operands. "xabort" forces the transaction abort, it takes
 3354 an 8-bit immediate value as its only operand, this value is passed in the
 3355 highest bits of EAX to the fallback routine. "xtest" checks whether there is
 3356 transactional execution in progress, this instruction takes no operands.
 3357   The MPX extension adds instructions that operate on new bounds registers
 3358 and aid in checking the memory references. For some of these instructions
 3359 flat assemblers allows a special syntax that allows a fine control over their
 3360 operation, where an address of a memory operand is separated into two parts
 3361 with a comma. With "bndmk" instruction the first part of such address specifies
 3362 the lower bound and the second one the upper bound. The lower bound can be
 3363 either zero or a register, the upper bound can be any address that uses no more
 3364 than one register (multiplied by 1, 2, 4, or 8). The addressing registers need
 3365 to be 64-bit when in long mode, and 32-bit otherwise.
 3366 
 3367     bndmk bnd0,[rbx,100000h] ; lower bound in register, upper directly
 3368     bndmk bnd1,[0,rbx]       ; lower bound zero, upper in register
 3369 
 3370 In case of "bndldx" and "bndstx", the first part of memory operand specifies an
 3371 address used to access a bound table entry, while the second part is either zero
 3372 or a register that plays a role of an additional operand for such instruction.
 3373 The address in the first part may use no more than one register and the register
 3374 cannot be multiplied by a number other than 1.
 3375 
 3376     bndstx [rcx,rsi],bnd3  ; store bnd3 and rsi at rcx in the bound table
 3377     bndldx bnd2,[rcx,rsi]  ; load from bound table if entry matches rsi
 3378 
 3379 
 3380 2.2  Control directives
 3381 
 3382 This section describes the directives that control the assembly process, they
 3383 are processed during the assembly and may cause some blocks of instructions
 3384 to be assembled differently or not assembled at all.
 3385 
 3386 
 3387 2.2.1  Numerical constants
 3388 
 3389 The "=" directive allows to define the numerical constant. It should be
 3390 preceded by the name for the constant and followed by the numerical expression
 3391 providing the value. The value of such constants can be a number or an address,
 3392 but - unlike labels - the numerical constants are not allowed to hold the
 3393 register-based addresses. Besides this difference, in their basic variant
 3394 numerical constants behave very much like labels and you can even
 3395 forward-reference them (access their values before they actually get defined).
 3396   There is, however, a second variant of numerical constants, which is
 3397 recognized by assembler when you try to define the constant of name, under
 3398 which there already was a numerical constant defined. In such case assembler
 3399 treats that constant as an assembly-time variable and allows it to be assigned
 3400 with new value, but forbids forward-referencing it (for obvious reasons). Let's
 3401 see both the variant of numerical constants in one example:
 3402 
 3403     dd sum
 3404     x = 1
 3405     x = x+2
 3406     sum = x
 3407 
 3408 Here the "x" is an assembly-time variable, and every time it is accessed, the
 3409 value that was assigned to it the most recently is used. Thus if we tried to
 3410 access the "x" before it gets defined the first time, like if we wrote "dd x"
 3411 in place of the "dd sum" instruction, it would cause an error. And when it is
 3412 re-defined with the "x = x+2" directive, the previous value of "x" is used to
 3413 calculate the new one. So when the "sum" constant gets defined, the "x" has
 3414 value of 3, and this value is assigned to the "sum". Since this one is defined
 3415 only once in source, it is the standard numerical constant, and can be
 3416 forward-referenced. So the "dd sum" is assembled as "dd 3". To read more about
 3417 how the assembler is able to resolve this, see section 2.2.6.
 3418   The value of numerical constant can be preceded by size operator, which can
 3419 ensure that the value will fit in the range for the specified size, and can
 3420 affect also how some of the calculations inside the numerical expression are
 3421 performed. This example:
 3422 
 3423     c8 = byte -1
 3424     c32 = dword -1
 3425 
 3426 defines two different constants, the first one fits in 8 bits, the second one
 3427 fits in 32 bits.
 3428   When you need to define constant with the value of address, which may be
 3429 register-based (and thus you cannot employ numerical constant for this
 3430 purpose), you can use the extended syntax of "label" directive (already
 3431 described in section 1.2.3), like:
 3432 
 3433     label myaddr at ebp+4
 3434 
 3435 which declares label placed at "ebp+4" address. However remember that labels,
 3436 unlike numerical constants, cannot become assembly-time variables.
 3437 
 3438 
 3439 2.2.2  Conditional assembly
 3440 
 3441 "if" directive causes some block of instructions to be assembled only under
 3442 certain condition. It should be followed by logical expression specifying the
 3443 condition, instructions in next lines will be assembled only when this
 3444 condition is met, otherwise they will be skipped. The optional "else if"
 3445 directive followed with logical expression specifying additional condition
 3446 begins the next block of instructions that will be assembled if previous
 3447 conditions were not met, and the additional condition is met. The optional
 3448 "else" directive begins the block of instructions that will be assembled if
 3449 all the conditions were not met. The "end if" directive ends the last block of
 3450 instructions.
 3451   You should note that "if" directive is processed at assembly stage and
 3452 therefore it doesn't affect any preprocessor directives, like the definitions
 3453 of symbolic constants and macroinstructions - when the assembler recognizes the
 3454 "if" directive, all the preprocessing has been already finished.
 3455   The logical expression consist of logical values and logical operators. The
 3456 logical operators are "~" for logical negation, "&" for logical and, "|" for
 3457 logical or. The negation has the highest priority. Logical value can be a
 3458 numerical expression, it will be false if it is equal to zero, otherwise it
 3459 will be true. Two numerical expression can be compared using one of the
 3460 following operators to make the logical value: "=" (equal), "<" (less),
 3461 ">" (greater), "<=" (less or equal), ">=" (greater or equal),
 3462 "<>" (not equal).
 3463   The "used" operator followed by a symbol name, is the logical value that
 3464 checks whether the given symbol is used somewhere (it returns correct result
 3465 even if symbol is used only after this check). The "defined" operator can be
 3466 followed by any expression, usually just by a single symbol name; it checks
 3467 whether the given expression contains only symbols that are defined in the
 3468 source and accessible from the current position. The "definite" operator
 3469 does a similar check with restriction to symbols defined before current
 3470 position in source.
 3471   With "relativeto" operator it is possible to check whether values of two
 3472 expressions differ only by constant amount. The valid syntax is a numerical
 3473 expression followed by "relativeto" and then another expression (possibly
 3474 register-based). Labels that have no simple numerical value can be tested
 3475 this way to determine what kind of operations may be possible with them.
 3476   The following simple example uses the "count" constant that should be
 3477 defined somewhere in source:
 3478 
 3479     if count>0
 3480         mov cx,count
 3481         rep movsb
 3482     end if
 3483 
 3484 These two assembly instructions will be assembled only if the "count" constant
 3485 is greater than 0. The next sample shows more complex conditional structure:
 3486 
 3487     if count & ~ count mod 4
 3488         mov cx,count/4
 3489         rep movsd
 3490     else if count>4
 3491         mov cx,count/4
 3492         rep movsd
 3493         mov cx,count mod 4
 3494         rep movsb
 3495     else
 3496         mov cx,count
 3497         rep movsb
 3498     end if
 3499 
 3500 The first block of instructions gets assembled when the "count" is non zero and
 3501 divisible by four, if this condition is not met, the second logical expression,
 3502 which follows the "else if", is evaluated and if it's true, the second block
 3503 of instructions get assembled, otherwise the last block of instructions, which
 3504 follows the line containing only "else", is assembled.
 3505   There are also operators that allow comparison of values being any chains of
 3506 symbols. The "eq" compares whether two such values are exactly the same.
 3507 The "in" operator checks whether given value is a member of the list of values
 3508 following this operator, the list should be enclosed between "<" and ">"
 3509 characters, its members should be separated with commas. The symbols are
 3510 considered the same when they have the same meaning for the assembler - for
 3511 example "pword" and "fword" for assembler are the same and thus are not
 3512 distinguished by the above operators. In the same way "16 eq 10h" is the true
 3513 condition, however "16 eq 10+4" is not.
 3514   The "eqtype" operator checks whether the two compared values have the same
 3515 structure, and whether the structural elements are of the same type. The
 3516 distinguished types include numerical expressions, individual quoted strings,
 3517 floating point numbers, address expressions (the expressions enclosed in square
 3518 brackets or preceded by "ptr" operator), instruction mnemonics, registers, size
 3519 operators, jump type and code type operators. And each of the special
 3520 characters that act as a separators, like comma or colon, is the separate type
 3521 itself. For example, two values, each one consisting of register name followed
 3522 by comma and numerical expression, will be regarded as of the same type, no
 3523 matter what kind of register and how complicated numerical expression is used;
 3524 with exception for the quoted strings and floating point values, which are the
 3525 special kinds of numerical expressions and are treated as different types. Thus
 3526 "eax,16 eqtype fs,3+7" condition is true, but "eax,16 eqtype eax,1.6" is false.
 3527 
 3528 
 3529 2.2.3 Repeating blocks of instructions
 3530 
 3531 "times" directive repeats one instruction specified number of times. It
 3532 should be followed by numerical expression specifying number of repeats and
 3533 the instruction to repeat (optionally colon can be used to separate number and
 3534 instruction). When special symbol "%" is used inside the instruction, it is
 3535 equal to the number of current repeat. For example "times 5 db %" will define
 3536 five bytes with values 1, 2, 3, 4, 5. Recursive use of "times" directive is
 3537 also allowed, so "times 3 times % db %" will define six bytes with values
 3538 1, 1, 2, 1, 2, 3.
 3539   "repeat" directive repeats the whole block of instructions. It should be
 3540 followed by numerical expression specifying number of repeats. Instructions
 3541 to repeat are expected in next lines, ended with the "end repeat" directive,
 3542 for example:
 3543 
 3544     repeat 8
 3545         mov byte [bx],%
 3546         inc bx
 3547     end repeat
 3548 
 3549 The generated code will store byte values from one to eight in the memory
 3550 addressed by BX register.
 3551   Number of repeats can be zero, in that case the instructions are not
 3552 assembled at all.
 3553   The "break" directive allows to stop repeating earlier and continue assembly
 3554 from the first line after the "end repeat". Combined with the "if" directive it
 3555 allows to stop repeating under some special condition, like:
 3556 
 3557     s = x/2
 3558     repeat 100
 3559         if x/s = s
 3560             break
 3561         end if
 3562         s = (s+x/s)/2
 3563     end repeat
 3564 
 3565   The "while" directive repeats the block of instructions as long as the
 3566 condition specified by the logical expression following it is true. The block
 3567 of instructions to be repeated should end with the "end while" directive.
 3568 Before each repetition the logical expression is evaluated and when its value
 3569 is false, the assembly is continued starting from the first line after the
 3570 "end while". Also in this case the "%" symbol holds the number of current
 3571 repeat. The "break" directive can be used to stop this kind of loop in the same
 3572 way as with "repeat" directive. The previous sample can be rewritten to use the
 3573 "while" instead of "repeat" this way:
 3574 
 3575     s = x/2
 3576     while x/s <> s
 3577         s = (s+x/s)/2
 3578         if % = 100
 3579             break
 3580         end if
 3581     end while
 3582 
 3583   The blocks defined with "if", "repeat" and "while" can be nested in any
 3584 order, however they should be closed in the same order in which they were
 3585 started. The "break" directive always stops processing the block that was
 3586 started last with either the "repeat" or "while" directive.
 3587 
 3588 
 3589 2.2.4  Addressing spaces
 3590 
 3591   "org" directive sets address at which the following code is expected to
 3592 appear in memory. It should be followed by numerical expression specifying
 3593 the address. This directive begins the new addressing space, the following
 3594 code itself is not moved in any way, but all the labels defined within it
 3595 and the value of "$" symbol are affected as if it was put at the given
 3596 address. However it's the responsibility of programmer to put the code at
 3597 correct address at run-time.
 3598   The "load" directive allows to define constant with a binary value loaded
 3599 from the already assembled code. This directive should be followed by the name
 3600 of the constant, then optionally size operator, then "from" operator and a
 3601 numerical expression specifying a valid address in current addressing space.
 3602 The size operator has unusual meaning in this case - it states how many bytes
 3603 (up to 8) have to be loaded to form the binary value of constant. If no size
 3604 operator is specified, one byte is loaded (thus value is in range from 0 to
 3605 255). The loaded data cannot exceed current offset.
 3606   The "store" directive can modify the already generated code by replacing
 3607 some of the previously generated data with the value defined by given
 3608 numerical expression, which follows. The expression can be preceded by the
 3609 optional size operator to specify how large value the expression defines, and
 3610 therefore how much bytes will be stored, if there is no size operator, the
 3611 size of one byte is assumed. Then the "at" operator and the numerical
 3612 expression defining the valid address in current addressing code space, at
 3613 which the given value have to be stored should follow. This is a directive for
 3614 advanced appliances and should be used carefully.
 3615   Both "load" and "store" directives in their basic variant (defined above)
 3616 are limited to operate on places in current addressing space. The "$$" symbol
 3617 is always equal to the base address of current addressing space, and the "$"
 3618 symbol is the address of current position in that addressing space, therefore
 3619 these two values define limits of the area, where "load" and "store" can
 3620 operate.
 3621   Combining the "load" and "store" directives allows to do things like encoding
 3622 some of the already generated code. For example to encode the whole code
 3623 generated in current addressing space you can use such block of directives:
 3624 
 3625     repeat $-$$
 3626         load a byte from $$+%-1
 3627         store byte a xor c at $$+%-1
 3628     end repeat
 3629 
 3630 and each byte of code will be xored with the value defined by "c" constant.
 3631   "virtual" defines virtual data at specified address. This data will not be
 3632 included in the output file, but labels defined there can be used in other
 3633 parts of source. This directive can be followed by "at" operator and the
 3634 numerical expression specifying the address for virtual data, otherwise is
 3635 uses current address, the same as "virtual at $". Instructions defining data
 3636 are expected in next lines, ended with "end virtual" directive. The block of
 3637 virtual instructions itself is an independent addressing space, after it's
 3638 ended, the context of previous addressing space is restored.
 3639   The "virtual" directive can be used to create union of some variables, for
 3640 example:
 3641 
 3642     GDTR dp ?
 3643     virtual at GDTR
 3644         GDT_limit dw ?
 3645         GDT_address dd ?
 3646     end virtual
 3647 
 3648 It defines two labels for parts of the 48-bit variable at "GDTR" address.
 3649   It can be also used to define labels for some structures addressed by a
 3650 register, for example:
 3651 
 3652     virtual at bx
 3653         LDT_limit dw ?
 3654         LDT_address dd ?
 3655     end virtual
 3656 
 3657 With such definition instruction "mov ax,[LDT_limit]" will be assembled
 3658 to the same instruction as "mov ax,[bx]".
 3659   Declaring defined data values or instructions inside the virtual block could
 3660 also be useful, because the "load" directive may be used to load the values
 3661 from the virtually generated code into a constants. This directive in its basic
 3662 version should be used after the code it loads but before the virtual block
 3663 ends, because it can only load the values from the same addressing space. For
 3664 example:
 3665 
 3666     virtual at 0
 3667         xor eax,eax
 3668         and edx,eax
 3669         load zeroq dword from 0
 3670     end virtual
 3671 
 3672 The above piece of code will define the "zeroq" constant containing four bytes
 3673 of the machine code of the instructions defined inside the virtual block.
 3674 This method can be also used to load some binary value from external file.
 3675 For example this code:
 3676 
 3677     virtual at 0
 3678         file 'a.txt':10h,1
 3679         load char from 0
 3680     end virtual
 3681 
 3682 loads the single byte from offset 10h in file "a.txt" into the "char"
 3683 constant.
 3684   Instead of or in addition to an "at" argument, "virtual" can also be followed
 3685 by an "as" keyword and a string defining an extension of additional file where
 3686 the initialized content of the addressing space started by "virtual" is going
 3687 to be stored at the end of a successful assembly.
 3688 
 3689     virtual at 0 as 'asc'
 3690         times 256 db %-1
 3691     end virtual
 3692 
 3693   Any of the "section" directives described in 2.4 also begins a new
 3694 addressing space.
 3695   It is possible to declare a special kind of label that marks the current
 3696 addressing space, by appending a double colon instead of a single one after a
 3697 label name. This symbol cannot then be used in numerical expressions, the only
 3698 place where it is allowed to use it is the extended syntax of "load" and
 3699 "store" directives. It is possible to make these directives operate on a
 3700 different addressing space than the current one, by specifying address with
 3701 the two components: first the name of a special label that marks the
 3702 addressing space, followed by the colon character and a numerical expression
 3703 defining a valid address inside that addressing space. In the following
 3704 example this extended syntax is used to load the value from a block after it
 3705 has been closed:
 3706 
 3707     virtual at 0
 3708         hex_digits::
 3709         db '0123456789ABCDEF'
 3710     end virtual
 3711     load a byte from hex_digits:10
 3712 
 3713   This way it is possible to operate on values inside any code block,
 3714 including all the ones defined with "virtual". However it is not allowed to
 3715 specify addressing space that has not been assembled yet, just as it is not
 3716 allowed to specify an address in the current addressing space that exceeds
 3717 the current offset. The addresses in any other addressing space are also
 3718 limited by the boundaries of the block.
 3719   The "virtual" directive can have a previously defined addressing space
 3720 label as the only argument. This variant allows to extend a previously defined
 3721 and closed block with additional data. Any definition of data within
 3722 an extending block is going to have the same effect as if that definition was
 3723 present in the original "virtual" block.
 3724 
 3725     virtual at 0 as 'log'
 3726         Log::
 3727     end virtual
 3728 
 3729     virtual Log
 3730         db 'Hello!',13,10
 3731     end virtual
 3732 
 3733 
 3734 2.2.5  Other directives
 3735 
 3736 "align" directive aligns code or data to the specified boundary. It should
 3737 be followed by a numerical expression specifying the number of bytes, to the
 3738 multiply of which the current address has to be aligned. The boundary value
 3739 has to be the power of two.
 3740   The "align" directive fills the bytes that had to be skipped to perform the
 3741 alignment with the "nop" instructions and at the same time marks this area as
 3742 uninitialized data, so if it is placed among other uninitialized data that
 3743 wouldn't take space in the output file, the alignment bytes will act the same
 3744 way. If you need to fill the alignment area with some other values, you can
 3745 combine "align" with "virtual" to get the size of alignment needed and then
 3746 create the alignment yourself, like:
 3747 
 3748     virtual
 3749         align 16
 3750         a = $ - $$
 3751     end virtual
 3752     db a dup 0
 3753 
 3754 The "a" constant is defined to be the difference between address after
 3755 alignment and address of the "virtual" block (see previous section), so it is
 3756 equal to the size of needed alignment space.
 3757   "display" directive displays the message at the assembly time. It should
 3758 be followed by the quoted strings or byte values, separated with commas. It
 3759 can be used to display values of some constants, for example:
 3760 
 3761     bits = 16
 3762     display 'Current offset is 0x'
 3763     repeat bits/4
 3764         d = '0' + $ shr (bits-%*4) and 0Fh
 3765         if d > '9'
 3766             d = d + 'A'-'9'-1
 3767         end if
 3768         display d
 3769     end repeat
 3770     display 13,10
 3771 
 3772 This block of directives calculates the four hexadecimal digits of 16-bit
 3773 value and converts them into characters for displaying. Note that this will 
 3774 not work if the adresses in current addressing space are relocatable (as it 
 3775 might happen with PE or object output formats), since only absolute values can
 3776 be used this way. The absolute value may be obtained by calculating the 
 3777 relative address, like "$-$$", or "rva $" in case of PE format.
 3778   The "err" directive immediately terminates the assembly process when it is
 3779 encountered by assembler.
 3780   The "assert" directive tests whether the logical expression that follows it
 3781 is true, and if not, it signalizes the error.
 3782 
 3783 
 3784 2.2.6  Multiple passes
 3785 
 3786 Because the assembler allows to reference some of the labels or constants
 3787 before they get actually defined, it has to predict the values of such labels
 3788 and if there is even a suspicion that prediction failed in at least one case,
 3789 it does one more pass, assembling the whole source, this time doing better
 3790 prediction based on the values the labels got in the previous pass.
 3791   The changing values of labels can cause some instructions to have encodings
 3792 of different length, and this can cause the change in values of labels again.
 3793 And since the labels and constants can also be used inside the expressions that
 3794 affect the behavior of control directives, the whole block of source can be
 3795 processed completely differently during the new pass. Thus the assembler does
 3796 more and more passes, each time trying to do better predictions to approach
 3797 the final solution, when all the values get predicted correctly. It uses
 3798 various method for predicting the values, which has been chosen to allow
 3799 finding in a few passes the solution of possibly smallest length for the most
 3800 of the programs.
 3801   Some of the errors, like the values not fitting in required boundaries, are
 3802 not signaled during those intermediate passes, since it may happen that when
 3803 some of the values are predicted better, these errors will disappear. However
 3804 if assembler meets some illegal syntax construction or unknown instruction, it
 3805 always stops immediately. Also defining some label more than once causes such
 3806 error, because it makes the predictions groundless.
 3807   Only the messages created with the "display" directive during the last
 3808 performed pass get actually displayed. In case when the assembly has been
 3809 stopped due to an error, these messages may reflect the predicted values that
 3810 are not yet resolved correctly.
 3811   The solution may sometimes not exist and in such cases the assembler will
 3812 never manage to make correct predictions - for this reason there is a limit for
 3813 a number of passes, and when assembler reaches this limit, it stops and
 3814 displays the message that it is not able to generate the correct output.
 3815 Consider the following example:
 3816 
 3817     if ~ defined alpha
 3818         alpha:
 3819     end if
 3820 
 3821 The "defined" operator gives the true value when the expression following it
 3822 could be calculated in this place, what in this case means that the "alpha"
 3823 label is defined somewhere. But the above block causes this label to be defined
 3824 only when the value given by "defined" operator is false, what leads to an
 3825 antynomy and makes it impossible to resolve such code. When processing the "if"
 3826 directive assembler has to predict whether the "alpha" label will be defined
 3827 somewhere (it wouldn't have to predict only if the label was already defined
 3828 earlier in this pass), and whatever the prediction is, the opposite always
 3829 happens. Thus the assembly will fail, unless the "alpha" label is defined
 3830 somewhere in source preceding the above block of instructions - in such case,
 3831 as it was already noted, the prediction is not needed and the block will just
 3832 get skipped.
 3833   The above sample might have been written as a try to define the label only
 3834 when it was not yet defined. It fails, because the "defined" operator does
 3835 check whether the label is defined anywhere, and this includes the definition
 3836 inside this conditionally processed block. It could be easily corrected by
 3837 using "definite" operator instead of "defined". But there is also another
 3838 modification that could get it resolved:
 3839 
 3840     if ~ defined alpha | defined @f
 3841         alpha:
 3842         @@:
 3843     end if
 3844 
 3845 The "@f" is always the same label as the nearest "@@" symbol in the source
 3846 following it, so the above sample would mean the same if any unique name was
 3847 used instead of the anonymous label. When "alpha" is not defined in any other
 3848 place in source, the only possible solution is when this block gets defined,
 3849 and this time this doesn't lead to the antynomy, because of the anonymous
 3850 label which makes this block self-establishing. To better understand this,
 3851 look at the blocks that has nothing more than this self-establishing:
 3852 
 3853     if defined @f
 3854         @@:
 3855     end if
 3856 
 3857 This is an example of source that may have more than one solution, as both
 3858 cases when this block gets processed or not are equally correct. Which one of
 3859 those two solutions we get depends on the algorithm on the assembler, in case
 3860 of flat assembler - on the algorithm of predictions. Back to the previous
 3861 sample, when "alpha" is not defined anywhere else, the condition for "if" block
 3862 cannot be false, so we are left with only one possible solution, and we can
 3863 hope the assembler will arrive at it. On the other hand, when "alpha" is
 3864 defined in some other place, we've got two possible solutions again, but one of
 3865 them causes "alpha" to be defined twice, and such an error causes assembler to
 3866 abort the assembly immediately, as this is the kind of error that deeply
 3867 disturbs the process of resolving. So we can get such source either correctly
 3868 resolved or causing an error, and what we get may depend on the internal
 3869 choices made by the assembler.
 3870   However there are some facts about such choices that are certain. When
 3871 assembler has to check whether the given symbol is defined and it was already
 3872 defined in the current pass, no prediction is needed - it was already noted
 3873 above. And when the given symbol has been defined never before, including all
 3874 the already finished passes, the assembler predicts it to be not defined.
 3875 Knowing this, we can expect that the simple self-establishing block shown
 3876 above will not be assembled at all and that the previous sample will resolve
 3877 correctly when "alpha" is defined somewhere before our conditional block,
 3878 while it will itself define "alpha" when it's not already defined earlier, thus
 3879 potentially causing the error because of double definition if the "alpha" is
 3880 also defined somewhere later.
 3881   The "used" operator may be expected to behave in a similar manner in
 3882 analogous cases, however any other kinds of predictions may not be so simple and
 3883 you should never rely on them this way.
 3884   The "err" directive, usually used to stop the assembly when some condition is
 3885 met, stops the assembly immediately, regardless of whether the current pass
 3886 is final or intermediate. So even when the condition that caused this directive
 3887 to be interpreted is mispredicted and temporary, and would eventually disappear 
 3888 in the later passes, the assembly is stopped anyway.
 3889   The "assert" directive signalizes the error only if its expression is false
 3890 after all the symbols have been resolved. You can use "assert 0" in place of
 3891 "err" when you do not want to have assembly stopped during the intermediate
 3892 passes.
 3893 
 3894 
 3895 2.3  Preprocessor directives
 3896 
 3897 All preprocessor directives are processed before the main assembly process,
 3898 and therefore are not affected by the control directives. At this time also
 3899 all comments are stripped out.
 3900 
 3901 
 3902 2.3.1  Including source files
 3903 
 3904 "include" directive includes the specified source file at the position where
 3905 it is used. It should be followed by the quoted name of file that should be
 3906 included, for example:
 3907 
 3908     include 'macros.inc'
 3909 
 3910 The whole included file is preprocessed before preprocessing the lines next
 3911 to the line containing the "include" directive. There are no limits to the
 3912 number of included files as long as they fit in memory.
 3913   The quoted path can contain environment variables enclosed within "%"
 3914 characters, they will be replaced with their values inside the path, both the
 3915 "\" and "/" characters are allowed as a path separators. The file is first 
 3916 searched for in the directory containing file which included it and when it is
 3917 not found there, the search is continued in the directories specified in the 
 3918 environment variable called INCLUDE (the multiple paths separated with 
 3919 semicolons can be defined there, they will be searched in the same order as 
 3920 specified). If file was not found in any of these places, preprocessor looks
 3921 for it in the directory containing the main source file (the one specified in 
 3922 command line). These rules concern also paths given with the "file" directive.
 3923 
 3924 
 3925 2.3.2  Symbolic constants
 3926 
 3927 The symbolic constants are different from the numerical constants, before the
 3928 assembly process they are replaced with their values everywhere in source
 3929 lines after their definitions, and anything can become their values.
 3930   The definition of symbolic constant consists of name of the constant
 3931 followed by the "equ" directive. Everything that follows this directive will
 3932 become the value of constant. If the value of symbolic constant contains
 3933 other symbolic constants, they are replaced with their values before assigning
 3934 this value to the new constant. For example:
 3935 
 3936     d equ dword
 3937     NULL equ d 0
 3938     d equ edx
 3939 
 3940 After these three definitions the value of "NULL" constant is "dword 0" and
 3941 the value of "d" is "edx". So, for example, "push NULL" will be assembled as
 3942 "push dword 0" and "push d" will be assembled as "push edx". And if then the
 3943 following line was put:
 3944 
 3945     d equ d,eax
 3946 
 3947 the "d" constant would get the new value of "edx,eax". This way the growing
 3948 lists of symbols can be defined.
 3949   "restore" directive allows to get back previous value of redefined symbolic
 3950 constant. It should be followed by one more names of symbolic constants,
 3951 separated with commas. So "restore d" after the above definitions will give
 3952 "d" constant back the value "edx", the second one will restore it to value
 3953 "dword", and one more will revert "d" to original meaning as if no such
 3954 constant was defined. If there was no constant defined of given name,
 3955 "restore" will not cause an error, it will be just ignored.
 3956   Symbolic constant can be used to adjust the syntax of assembler to personal
 3957 preferences. For example the following set of definitions provides the handy
 3958 shortcuts for all the size operators:
 3959 
 3960     b equ byte
 3961     w equ word
 3962     d equ dword
 3963     p equ pword
 3964     f equ fword
 3965     q equ qword
 3966     t equ tword
 3967     x equ dqword
 3968     y equ qqword
 3969 
 3970   Because symbolic constant may also have an empty value, it can be used to
 3971 allow the syntax with "offset" word before any address value:
 3972 
 3973     offset equ
 3974 
 3975 After this definition "mov ax,offset char" will be valid construction for
 3976 copying the offset of "char" variable into "ax" register, because "offset" is
 3977 replaced with an empty value, and therefore ignored.
 3978   The "define" directive followed by the name of constant and then the value,
 3979 is the alternative way of defining symbolic constant. The only difference
 3980 between "define" and "equ" is that "define" assigns the value as it is, it does
 3981 not replace the symbolic constants with their values inside it.
 3982   Symbolic constants can also be defined with the "fix" directive, which has
 3983 the same syntax as "equ", but defines constants of high priority - they are
 3984 replaced with their symbolic values even before processing the preprocessor
 3985 directives and macroinstructions, the only exception is "fix" directive
 3986 itself, which has the highest possible priority, so it allows redefinition of
 3987 constants defined this way.
 3988   The "fix" directive can be used for syntax adjustments related to directives
 3989 of preprocessor, what cannot be done with "equ" directive. For example:
 3990 
 3991     incl fix include
 3992 
 3993 defines a short name for "include" directive, while the similar definition done
 3994 with "equ" directive wouldn't give such result, as standard symbolic constants
 3995 are replaced with their values after searching the line for preprocessor
 3996 directives.
 3997 
 3998 
 3999 2.3.3  Macroinstructions
 4000 
 4001 "macro" directive allows you to define your own complex instructions, called
 4002 macroinstructions, using which can greatly simplify the process of
 4003 programming. In its simplest form it's similar to symbolic constant
 4004 definition. For example the following definition defines a shortcut for the
 4005 "test al,0xFF" instruction:
 4006 
 4007     macro tst {test al,0xFF}
 4008 
 4009 After the "macro" directive there is a name of macroinstruction and then its
 4010 contents enclosed between the "{" and "}" characters. You can use "tst"
 4011 instruction anywhere after this definition and it will be assembled as
 4012 "test al,0xFF". Defining symbolic constant "tst" of that value would give the
 4013 similar result, but the difference is that the name of macroinstruction is
 4014 recognized only as an instruction mnemonic. Also, macroinstructions are
 4015 replaced with corresponding code even before the symbolic constants are
 4016 replaced with their values. So if you define macroinstruction and symbolic
 4017 constant of the same name, and use this name as an instruction mnemonic, it
 4018 will be replaced with the contents of macroinstruction, but it will be
 4019 replaced with value if symbolic constant if used somewhere inside the
 4020 operands.
 4021   The definition of macroinstruction can consist of many lines, because
 4022 "{" and "}" characters don't have to be in the same line as "macro" directive.
 4023 For example:
 4024 
 4025     macro stos0
 4026      {
 4027         xor al,al
 4028         stosb
 4029      }
 4030 
 4031 The macroinstruction "stos0" will be replaced with these two assembly
 4032 instructions anywhere it's used.
 4033   Like instructions which needs some number of operands, the macroinstruction
 4034 can be defined to need some number of arguments separated with commas. The
 4035 names of needed argument should follow the name of macroinstruction in the
 4036 line of "macro" directive and should be separated with commas if there is more
 4037 than one. Anywhere one of these names occurs in the contents of
 4038 macroinstruction, it will be replaced with corresponding value, provided when
 4039 the macroinstruction is used. Here is an example of a macroinstruction that
 4040 will do data alignment for binary output format:
 4041 
 4042     macro align value { rb (value-1)-($+value-1) mod value }
 4043 
 4044 When the "align 4" instruction is found after this macroinstruction is
 4045 defined, it will be replaced with contents of this macroinstruction, and the
 4046 "value" will there become 4, so the result will be "rb (4-1)-($+4-1) mod 4".
 4047   If a macroinstruction is defined that uses an instruction with the same name
 4048 inside its definition, the previous meaning of this name is used. Useful
 4049 redefinition of macroinstructions can be done in that way, for example:
 4050 
 4051     macro mov op1,op2
 4052      {
 4053       if op1 in <ds,es,fs,gs,ss> & op2 in <cs,ds,es,fs,gs,ss>
 4054         push  op2
 4055         pop   op1
 4056       else
 4057         mov   op1,op2
 4058       end if
 4059      }
 4060 
 4061 This macroinstruction extends the syntax of "mov" instruction, allowing both
 4062 operands to be segment registers. For example "mov ds,es" will be assembled as
 4063 "push es" and "pop ds". In all other cases the standard "mov" instruction will
 4064 be used. The syntax of this "mov" can be extended further by defining next
 4065 macroinstruction of that name, which will use the previous macroinstruction:
 4066 
 4067     macro mov op1,op2,op3
 4068      {
 4069       if op3 eq
 4070         mov   op1,op2
 4071       else
 4072         mov   op1,op2
 4073         mov   op2,op3
 4074       end if
 4075      }
 4076 
 4077 It allows "mov" instruction to have three operands, but it can still have two
 4078 operands only, because when macroinstruction is given less arguments than it
 4079 needs, the rest of arguments will have empty values. When three operands are
 4080 given, this macroinstruction will become two macroinstructions of the previous
 4081 definition, so "mov es,ds,dx" will be assembled as "push ds", "pop es" and
 4082 "mov ds,dx".
 4083   By placing the "*" after the name of argument you can mark the argument as
 4084 required - preprocessor will not allow it to have an empty value. For example 
 4085 the above macroinstruction could be declared as "macro mov op1*,op2*,op3" to 
 4086 make sure that first two arguments will always have to be given some non empty
 4087 values.
 4088   Alternatively, you can provide the default value for argument, by placing
 4089 the "=" followed by value after the name of argument. Then if the argument
 4090 has an empty value provided, the default value will be used instead.
 4091   When it's needed to provide macroinstruction with argument that contains
 4092 some commas, such argument should be enclosed between "<" and ">" characters.
 4093 If it contains more than one "<" character, the same number of ">" should be
 4094 used to tell that the value of argument ends.
 4095   When the name of the last argument of macroinstruction is followed by "&"
 4096 character, this argument consumes everything up to the end of line, including
 4097 commas.
 4098   "purge" directive allows removing the last definition of specified
 4099 macroinstruction. It should be followed by one or more names of
 4100 macroinstructions, separated with commas. If such macroinstruction has not
 4101 been defined, you will not get any error. For example after having the syntax
 4102 of "mov" extended with the macroinstructions defined above, you can disable
 4103 syntax with three operands back by using "purge mov" directive. Next
 4104 "purge mov" will disable also syntax for two operands being segment registers,
 4105 and all the next such directives will do nothing.
 4106   If after the "macro" directive you enclose a group of argument declarations
 4107 in square brackets, it will allow giving more values for this group of arguments
 4108 when using that macroinstruction. Any additional argument following the last
 4109 argument of such group will start the new group and will become the first
 4110 argument of it. For this reason after the closing square bracket no more
 4111 argument names can follow. The contents of macroinstruction will be processed for
 4112 each such group of arguments separately. The simplest example is to enclose one
 4113 argument name in square brackets:
 4114 
 4115     macro stoschar [char]
 4116      {
 4117         mov al,char
 4118         stosb
 4119      }
 4120 
 4121 This macroinstruction accepts unlimited number of arguments, and each one
 4122 will be processed into these two instructions separately. For example
 4123 "stoschar 1,2,3" will be assembled as the following instructions:
 4124 
 4125     mov al,1
 4126     stosb
 4127     mov al,2
 4128     stosb
 4129     mov al,3
 4130     stosb
 4131 
 4132   There are some special directives available only inside the definitions of
 4133 macroinstructions. "local" directive defines local names, which will be
 4134 replaced with unique values each time the macroinstruction is used. It should
 4135 be followed by names separated with commas. If the name given as parameter to
 4136 "local" directive begins with a dot or two dots, the unique labels generated
 4137 by each evaluation of macroinstruction will have the same properties.
 4138 This directive is usually needed for the constants or labels that
 4139 macroinstruction defines and uses internally. For example:
 4140 
 4141     macro movstr
 4142      {
 4143         local move
 4144       move:
 4145         lodsb
 4146         stosb
 4147         test al,al
 4148         jnz move
 4149      }
 4150 
 4151 Each time this macroinstruction is used, "move" will become other unique name
 4152 in its instructions, so you will not get an error you normally get when some
 4153 label is defined more than once.
 4154   "forward", "reverse" and "common" directives divide macroinstruction into
 4155 blocks, each one processed after the processing of previous is finished. They
 4156 differ in behavior only if macroinstruction allows multiple groups of
 4157 arguments. Block of instructions that follows "forward" directive is processed
 4158 for each group of arguments, from first to last - exactly like the default
 4159 block (not preceded by any of these directives). Block that follows "reverse"
 4160 directive is processed for each group of argument in reverse order - from last
 4161 to first. Block that follows "common" directive is processed only once,
 4162 commonly for all groups of arguments. Local name defined in one of the blocks
 4163 is available in all the following blocks when processing the same group of
 4164 arguments as when it was defined, and when it is defined in common block it is
 4165 available in all the following blocks not depending on which group of
 4166 arguments is processed.
 4167   Here is an example of macroinstruction that will create the table of
 4168 addresses to strings followed by these strings:
 4169 
 4170     macro strtbl name,[string]
 4171      {
 4172       common
 4173         label name dword
 4174       forward
 4175         local label
 4176         dd label
 4177       forward
 4178         label db string,0
 4179      }
 4180 
 4181 First argument given to this macroinstruction will become the label for table
 4182 of addresses, next arguments should be the strings. First block is processed
 4183 only once and defines the label, second block for each string declares its
 4184 local name and defines the table entry holding the address to that string.
 4185 Third block defines the data of each string with the corresponding label.
 4186   The directive starting the block in macroinstruction can be followed by the
 4187 first instruction of this block in the same line, like in the following
 4188 example:
 4189 
 4190     macro stdcall proc,[arg]
 4191      {
 4192       reverse push arg
 4193       common call proc
 4194      }
 4195 
 4196 This macroinstruction can be used for calling the procedures using STDCALL
 4197 convention, which has all the arguments pushed on stack in the reverse order. 
 4198 For example "stdcall foo,1,2,3" will be assembled as:
 4199 
 4200     push 3
 4201     push 2
 4202     push 1
 4203     call foo
 4204 
 4205   If some name inside macroinstruction has multiple values (it is either one
 4206 of the arguments enclosed in square brackets or local name defined in the
 4207 block following "forward" or "reverse" directive) and is used in block
 4208 following the "common" directive, it will be replaced with all of its values,
 4209 separated with commas. For example the following macroinstruction will pass
 4210 all of the additional arguments to the previously defined "stdcall"
 4211 macroinstruction:
 4212 
 4213     macro invoke proc,[arg]
 4214      { common stdcall [proc],arg }
 4215 
 4216 It can be used to call indirectly (by the pointer stored in memory) the
 4217 procedure using STDCALL convention.
 4218   Inside macroinstruction also special operator "#" can be used. This
 4219 operator causes two names to be concatenated into one name. It can be useful,
 4220 because it's done after the arguments and local names are replaced with their
 4221 values. The following macroinstruction will generate the conditional jump
 4222 according to the "cond" argument:
 4223 
 4224     macro jif op1,cond,op2,label
 4225      {
 4226         cmp op1,op2
 4227         j#cond label
 4228      }
 4229 
 4230 For example "jif ax,ae,10h,exit" will be assembled as "cmp ax,10h" and
 4231 "jae exit" instructions.
 4232   The "#" operator can be also used to concatenate two quoted strings into one.
 4233 Also conversion of name into a quoted string is possible, with the "`" operator,
 4234 which likewise can be used inside the macroinstruction. It converts the name
 4235 that follows it into a quoted string - but note, that when it is followed by
 4236 a macro argument which is being replaced with value containing more than one
 4237 symbol, only the first of them will be converted, as the "`" operator converts
 4238 only one symbol that immediately follows it. Here's an example of utilizing
 4239 those two features:
 4240 
 4241     macro label name
 4242      {
 4243         label name
 4244         if ~ used name
 4245           display `name # " is defined but not used.",13,10
 4246         end if
 4247      }
 4248 
 4249 When label defined with such macro is not used in the source, macro will warn
 4250 you with the message, informing to which label it applies.
 4251   To make macroinstruction behaving differently when some of the arguments are
 4252 of some special type, for example a quoted strings, you can use "eqtype"
 4253 comparison operator. Here's an example of utilizing it to distinguish a
 4254 quoted string from an other argument:
 4255 
 4256     macro message arg
 4257      {
 4258       if arg eqtype ""
 4259         local str
 4260         jmp   @f
 4261         str   db arg,0Dh,0Ah,24h
 4262         @@:
 4263         mov   dx,str
 4264       else
 4265         mov   dx,arg
 4266       end if
 4267         mov   ah,9
 4268         int   21h
 4269      }
 4270 
 4271 The above macro is designed for displaying messages in DOS programs. When the
 4272 argument of this macro is some number, label, or variable, the string from
 4273 that address is displayed, but when the argument is a quoted string, the
 4274 created code will display that string followed by the carriage return and
 4275 line feed.
 4276   It is also possible to put a declaration of macroinstruction inside another
 4277 macroinstruction, so one macro can define another, but there is a problem
 4278 with such definitions caused by the fact, that "}" character cannot occur
 4279 inside the macroinstruction, as it always means the end of definition. To
 4280 overcome this problem, the escaping of symbols inside macroinstruction can be
 4281 used. This is done by placing one or more backslashes in front of any other
 4282 symbol (even the special character). Preprocessor sees such sequence as a
 4283 single symbol, but each time it meets such symbol during the macroinstruction
 4284 processing, it cuts the backslash character from the front of it. For example
 4285 "\{" is treated as single symbol, but during processing of the macroinstruction
 4286 it becomes the "{" symbol. This allows to put one definition of
 4287 macroinstruction inside another:
 4288 
 4289     macro ext instr
 4290      {
 4291       macro instr op1,op2,op3
 4292        \{
 4293         if op3 eq
 4294           instr op1,op2
 4295         else
 4296           instr op1,op2
 4297           instr op2,op3
 4298         end if
 4299        \}
 4300      }
 4301 
 4302     ext add
 4303     ext sub
 4304 
 4305 The macro "ext" is defined correctly, but when it is used, the "\{" and "\}"
 4306 become the "{" and "}" symbols. So when the "ext add" is processed, the
 4307 contents of macro becomes valid definition of a macroinstruction and this way
 4308 the "add" macro becomes defined. In the same way "ext sub" defines the "sub"
 4309 macro. The use of "\{" symbol wasn't really necessary here, but is done this
 4310 way to make the definition more clear.
 4311   If some directives specific to macroinstructions, like "local" or "common"
 4312 are needed inside some macro embedded this way, they can be escaped in the same
 4313 way. Escaping the symbol with more than one backslash is also allowed, which
 4314 allows multiple levels of nesting the macroinstruction definitions.
 4315   The another technique for defining one macroinstruction by another is to
 4316 use the "fix" directive, which becomes useful when some macroinstruction only
 4317 begins the definition of another one, without closing it. For example:
 4318 
 4319     macro tmacro [params]
 4320      {
 4321       common macro params {
 4322      }
 4323 
 4324     MACRO fix tmacro
 4325     ENDM fix }
 4326 
 4327 defines an alternative syntax for defining macroinstructions, which looks like:
 4328 
 4329     MACRO stoschar char
 4330         mov al,char
 4331         stosb
 4332     ENDM
 4333 
 4334 Note that symbol that has such customized definition must be defined with "fix"
 4335 directive, because only the prioritized symbolic constants are processed before
 4336 the preprocessor looks for the "}" character while defining the macro. This
 4337 might be a problem if one needed to perform some additional tasks one the end
 4338 of such definition, but there is one more feature which helps in such cases.
 4339 Namely it is possible to put any directive, instruction or  macroinstruction
 4340 just after the "}" character that ends the macroinstruction and it will be
 4341 processed in the same way as if it was put in the next line.
 4342   The "postpone" directive can be used to define a special type of
 4343 macroinstruction that has no name or arguments and will get automatically
 4344 called when the preprocessor reaches the end of source:
 4345 
 4346     postpone
 4347      {
 4348       code_size = $
 4349      }
 4350 
 4351 It is a very simplified kind of macroinstruction and it simply delegates a
 4352 block of instructions to be put at the end.
 4353 
 4354 
 4355 2.3.4  Structures
 4356 
 4357 "struc" directive is a special variant of "macro" directive that is used to
 4358 define data structures. Macroinstruction defined using the "struc" directive
 4359 must be preceded by a label (like the data definition directive) when it's
 4360 used. This label will be also attached at the beginning of every name starting
 4361 with dot in the contents of macroinstruction. The macroinstruction defined
 4362 using the "struc" directive can have the same name as some other
 4363 macroinstruction defined using the "macro" directive, structure
 4364 macroinstruction will not prevent the standard macroinstruction from being 
 4365 processed when there is no label before it and vice versa. All the rules and 
 4366 features concerning standard macroinstructions apply to structure 
 4367 macroinstructions.
 4368   Here is the sample of structure macroinstruction:
 4369 
 4370     struc point x,y
 4371      {
 4372         .x dw x
 4373         .y dw y
 4374      }
 4375 
 4376 For example "my point 7,11" will define structure labeled "my", consisting of
 4377 two variables: "my.x" with value 7 and "my.y" with value 11.
 4378   If somewhere inside the definition of structure the name consisting of a
 4379 single dot it found, it is replaced by the name of the label for the given
 4380 instance of structure and this label will not be defined automatically in
 4381 such case, allowing to completely customize the definition. The following
 4382 example utilizes this feature to extend the data definition directive "db"
 4383 with ability to calculate the size of defined data:
 4384 
 4385     struc db [data]
 4386      {
 4387        common
 4388         . db data
 4389         .size = $ - .
 4390      }
 4391 
 4392 With such definition "msg db 'Hello!',13,10" will define also "msg.size"
 4393 constant, equal to the size of defined data in bytes.
 4394   Defining data structures addressed by registers or absolute values should be
 4395 done using the "virtual" directive with structure macroinstruction
 4396 (see 2.2.4).
 4397   "restruc" directive removes the last definition of the structure, just like
 4398 "purge" does with macroinstructions and "restore" with symbolic constants.
 4399 It also has the same syntax - should be followed by one or more names of
 4400 structure macroinstructions, separated with commas.
 4401 
 4402 
 4403 2.3.5  Repeating macroinstructions
 4404 
 4405 The "rept" directive is a special kind of macroinstruction, which makes given
 4406 amount of duplicates of the block enclosed with braces. The basic syntax is
 4407 "rept" directive followed by number and then block of source enclosed between
 4408 the "{" and "}" characters. The simplest example:
 4409 
 4410     rept 5 { in al,dx }
 4411 
 4412 will make five duplicates of the "in al,dx" line. The block of instructions
 4413 is defined in the same way as for the standard macroinstruction and any
 4414 special operators and directives which can be used only inside
 4415 macroinstructions are also allowed here. When the given count is zero, the
 4416 block is simply skipped, as if you defined macroinstruction but never used
 4417 it. The number of repetitions can be followed by the name of counter symbol,
 4418 which will get replaced symbolically with the number of duplicate currently
 4419 generated. So this:
 4420 
 4421     rept 3 counter
 4422      {
 4423         byte#counter db counter
 4424      }
 4425 
 4426 will generate lines:
 4427 
 4428     byte1 db 1
 4429     byte2 db 2
 4430     byte3 db 3
 4431 
 4432 The repetition mechanism applied to "rept" blocks is the same as the one used
 4433 to process multiple groups of arguments for macroinstructions, so directives
 4434 like "forward", "common" and "reverse" can be used in their usual meaning.
 4435 Thus such macroinstruction:
 4436 
 4437     rept 7 num { reverse display `num }
 4438 
 4439 will display digits from 7 to 1 as text. The "local" directive behaves in the
 4440 same way as inside macroinstruction with multiple groups of arguments, so:
 4441 
 4442     rept 21
 4443      {
 4444        local label
 4445        label: loop label
 4446      }
 4447 
 4448 will generate unique label for each duplicate.
 4449   The counter symbol by default counts from 1, but you can declare different
 4450 base value by placing the number preceded by colon immediately after the name
 4451 of counter. For example:
 4452 
 4453     rept 8 n:0 { pxor xmm#n,xmm#n }
 4454 
 4455 will generate code which will clear the contents of eight SSE registers.
 4456 You can define multiple counters separated with commas, and each one can have
 4457 different base.
 4458   The number of repetitions and the base values for counters can be specified
 4459 using the numerical expressions with operator rules identical as in the case
 4460 of assembler. However each value used in such expression must either be a
 4461 directly specified number, or a symbolic constant with value also being an
 4462 expression that can be calculated by preprocessor (in such case the value
 4463 of expression associated with symbolic constant is calculated first, and then
 4464 substituted into the outer expression in place of that constant). If you need
 4465 repetitions based on values that can only be calculated at assembly time, use
 4466 one of the code repeating directives that are processed by assembler, see
 4467 section 2.2.3.
 4468   The "irp" directive iterates the single argument through the given list of
 4469 parameters. The syntax is "irp" followed by the argument name, then the comma
 4470 and then the list of parameters. The parameters are specified in the same
 4471 way like in the invocation of standard macroinstruction, so they have to be
 4472 separated with commas and each one can be enclosed with the "<" and ">"
 4473 characters. Also the name of argument may be followed by "*" to mark that it
 4474 cannot get an empty value. Such block:
 4475 
 4476    irp value, 2,3,5
 4477     { db value }
 4478 
 4479 will generate lines:
 4480 
 4481    db 2
 4482    db 3
 4483    db 5
 4484 
 4485 The "irps" directive iterates through the given list of symbols, it should
 4486 be followed by the argument name, then the comma and then the sequence of any
 4487 symbols. Each symbol in this sequence, no matter whether it is the name
 4488 symbol, symbol character or quoted string, becomes an argument value for one
 4489 iteration. If there are no symbols following the comma, no iteration is done
 4490 at all. This example:
 4491 
 4492    irps reg, al bx ecx
 4493     { xor reg,reg }
 4494 
 4495 will generate lines:
 4496 
 4497    xor al,al
 4498    xor bx,bx
 4499    xor ecx,ecx
 4500 
 4501 The "irpv" directive iterates through all of the values that were assigned to
 4502 the given symbolic variable. It should be followed by the argument name and
 4503 the name of symbolic variable, separated with comma. When the symbolic
 4504 variable is treated with "restore" directive to remove its latest value, that
 4505 value is removed from the list of values accessed by "irpv". But any
 4506 modifications made to that list during the iterations performed by "irpv" (by
 4507 either defining a new value for symbolic variable, or destroying the value
 4508 with "restore" directive) do not affect the operation performed by this
 4509 directive - the list that gets iterated reflects the state of symbolic
 4510 variable at the time when "irpv" directive was encountered. For example this
 4511 snippet restores a symbolic variable called "d" to its initial state, before
 4512 any values were assigned to it:
 4513 
 4514    irpv value, d
 4515     { restore d }
 4516 
 4517 It simply generates as many copies of "restore" directive, as many values
 4518 there are to remove.
 4519   The blocks defined by the "irp", "irps" and "irpv" directives are also
 4520 processed in the same way as any macroinstructions, so operators and
 4521 directives specific to macroinstructions may be freely used also in this case.
 4522 
 4523 
 4524 2.3.6  Conditional preprocessing
 4525 
 4526 "match" directive causes some block of source to be preprocessed and passed
 4527 to assembler only when the given sequence of symbols matches the specified
 4528 pattern. The pattern comes first, ended with comma, then the symbols that have
 4529 to be matched with the pattern, and finally the block of source, enclosed
 4530 within braces as macroinstruction.
 4531   There are the few rules for building the expression for matching, first is
 4532 that any of symbol characters and any quoted string should be matched exactly
 4533 as is. In this example:
 4534 
 4535     match +,+ { include 'first.inc' }
 4536     match +,- { include 'second.inc' }
 4537 
 4538 the first file will get included, since "+" after comma matches the "+" in
 4539 pattern, and the second file will not be included, since there is no match.
 4540   To match any other symbol literally, it has to be preceded by "=" character
 4541 in the pattern. Also to match the "=" character itself, or the comma, the
 4542 "==" and "=," constructions have to be used. For example the "=a==" pattern
 4543 will match the "a=" sequence.
 4544   If some name symbol is placed in the pattern, it matches any sequence
 4545 consisting of at least one symbol and then this name is replaced with the
 4546 matched sequence everywhere inside the following block, analogously to the
 4547 parameters of macroinstruction. For instance:
 4548 
 4549     match a-b, 0-7
 4550      { dw a,b-a }
 4551 
 4552 will generate the "dw 0,7-0" instruction. Each name is always matched with
 4553 as few symbols as possible, leaving the rest for the following ones, so in
 4554 this case:
 4555 
 4556     match a b, 1+2+3 { db a }
 4557 
 4558 the "a" name will match the "1" symbol, leaving the "+2+3" sequence to be
 4559 matched with "b". But in this case:
 4560 
 4561     match a b, 1 { db a }
 4562 
 4563 there will be nothing left for "b" to match, so the block will not get 
 4564 processed at all.
 4565   The block of source defined by match is processed in the same way as any
 4566 macroinstruction, so any operators specific to macroinstructions can be used
 4567 also in this case.
 4568   What makes "match" directive more useful is the fact, that it replaces the
 4569 symbolic constants with their values in the matched sequence of symbols (that
 4570 is everywhere after comma up to the beginning of the source block) before
 4571 performing the match. Thanks to this it can be used for example to process
 4572 some block of source under the condition that some symbolic constant has the
 4573 given value, like:
 4574 
 4575     match =TRUE, DEBUG { include 'debug.inc' }
 4576 
 4577 which will include the file only when the symbolic constant "DEBUG" was
 4578 defined with value "TRUE".
 4579 
 4580 
 4581 2.3.7  Order of processing
 4582 
 4583 When combining various features of the preprocessor, it's important to know
 4584 the order in which they are processed. As it was already noted, the highest
 4585 priority has the "fix" directive and the replacements defined with it. This
 4586 is done completely before doing any other preprocessing, therefore this
 4587 piece of source:
 4588 
 4589     V fix {
 4590       macro empty
 4591        V
 4592     V fix }
 4593        V
 4594 
 4595 becomes a valid definition of an empty macroinstruction. It can be interpreted
 4596 that the "fix" directive and prioritized symbolic constants are processed in
 4597 a separate stage, and all other preprocessing is done after on the resulting
 4598 source.
 4599   The standard preprocessing that comes after, on each line begins with
 4600 recognition of the first symbol. It starts with checking for the preprocessor
 4601 directives, and when none of them is detected, preprocessor checks whether the
 4602 first symbol is macroinstruction. If no macroinstruction is found, it moves
 4603 to the second symbol of line, and again begins with checking for directives,
 4604 which in this case is only the "equ" directive, as this is the only one that
 4605 occurs as the second symbol in line. If there is no directive, the second
 4606 symbol is checked for the case of structure macroinstruction and when none
 4607 of those checks gives the positive result, the symbolic constants are replaced
 4608 with their values and such line is passed to the assembler.
 4609   To see it on the example, assume that there is defined the macroinstruction
 4610 called "foo" and the structure macroinstruction called "bar". Those lines:
 4611 
 4612     foo equ
 4613     foo bar
 4614 
 4615 would be then both interpreted as invocations of macroinstruction "foo", since
 4616 the meaning of the first symbol overrides the meaning of second one.
 4617   When the macroinstruction generates the new lines from its definition block,
 4618 in every line it first scans for macroinstruction directives, and interpretes
 4619 them accordingly. All the other content in the definition block is used to
 4620 brew the new lines, replacing the macroinstruction parameters with their values
 4621 and then processing the symbol escaping and "#" and "`" operators. The
 4622 conversion operator has the higher priority than concatenation and if any of
 4623 them operates on the escaped symbol, the escaping is cancelled before finishing
 4624 the operation. After this is completed, the newly generated line goes through
 4625 the standard preprocessing, as described above.
 4626   Though the symbolic constants are usually only replaced in the lines, where
 4627 no preprocessor directives nor macroinstructions has been found, there are some
 4628 special cases where those replacements are performed in the parts of lines
 4629 containing directives. First one is the definition of symbolic constant, where
 4630 the replacements are done everywhere after the "equ" keyword and the resulting
 4631 value is then assigned to the new constant (see 2.3.2). The second such case
 4632 is the "match" directive, where the replacements are done in the symbols
 4633 following comma before matching them with pattern. These features can be used
 4634 for example to maintain the lists, like this set of definitions:
 4635 
 4636     list equ
 4637 
 4638     macro append item
 4639      {
 4640        match any, list \{ list equ list,item \}
 4641        match , list \{ list equ item \}
 4642      }
 4643 
 4644 The "list" constant is here initialized with empty value, and the "append"
 4645 macroinstruction can be used to add the new items into this list, separating
 4646 them with commas. The first match in this macroinstruction occurs only when
 4647 the value of list is not empty (see 2.3.6), in such case the new value for the
 4648 list is the previous one with the comma and the new item appended at the end.
 4649 The second match happens only when the list is still empty, and in such case
 4650 the list is defined to contain just the new item. So starting with the empty
 4651 list, the "append 1" would define "list equ 1" and the "append 2" following it
 4652 would define "list equ 1,2". One might then need to use this list as the
 4653 parameters to some macroinstruction. But it cannot be done directly - if "foo"
 4654 is the macroinstruction, then "foo list" would just pass the "list" symbol
 4655 as a parameter to macro, since symbolic constants are not unrolled at this
 4656 stage. For this purpose again "match" directive comes in handy:
 4657 
 4658     match params, list { foo params }
 4659 
 4660 The value of "list", if it's not empty, matches the "params" keyword, which is
 4661 then replaced with matched value when generating the new lines defined by the
 4662 block enclosed with braces. So if the "list" had value "1,2", the above line
 4663 would generate the line containing "foo 1,2", which would then go through the
 4664 standard preprocessing.
 4665   The other special case is in the parameters of "rept" directive. The amount
 4666 of repetitions and the base value for counter can be specified using
 4667 numerical expressions, and if there is a symbolic constant with non-numerical
 4668 name used in such an expression, preprocessor tries to evaluate its value as 
 4669 a numerical expression and if succeeds, it replaces the symbolic constant with 
 4670 the result of that calculation and continues to evaluate the primary 
 4671 expression. If the expression inside that symbolic constants also contains 
 4672 some symbolic constants, preprocessor will try to calculate all the needed 
 4673 values recursively. 
 4674   This allows to perform some calculations at the time of preprocessing, as
 4675 long as all the values used are the numbers known at the preprocessing stage. 
 4676 A single repetition with "rept" can be used for the sole purpose of 
 4677 calculating some value, like in this example: 
 4678 
 4679     define a b+4
 4680     define b 3
 4681     rept 1 result:a*b+2 { define c result }
 4682     
 4683 To compute the base value for "result" counter, preprocessor replaces the "b"
 4684 with its value and recursively calculates the value of "a", obtaining 7 as
 4685 the result, then it calculates the main expression with the result being 23.
 4686 The "c" then gets defined with the first value of counter (because the block
 4687 is processed just one time), which is the result of the computation, so the 
 4688 value of "c" is simple "23" symbol. Note that if "b" is later redefined with
 4689 some other numerical value, the next time and expression containing "a" is
 4690 calculated, the value of "a" will reflect the new value of "b", because the
 4691 symbolic constant contains just the text of the expression.
 4692   There is one more special case - when preprocessor goes to checking the
 4693 second symbol in the line and it happens to be the colon character (what is
 4694 then interpreted by assembler as definition of a label), it stops in this
 4695 place and finishes the preprocessing of the first symbol (so if it's the
 4696 symbolic constant it gets unrolled) and if it still appears to be the label,
 4697 it performs the standard preprocessing starting from the place after the
 4698 label. This allows to place preprocessor directives and macroinstructions
 4699 after the labels, analogously to the instructions and directives processed
 4700 by assembler, like:
 4701 
 4702     start: include 'start.inc'
 4703 
 4704 However if the label becomes broken during preprocessing (for example when
 4705 it is the symbolic constant with empty value), only replacing of the symbolic
 4706 constants is continued for the rest of line.
 4707   It should be remembered, that the jobs performed by preprocessor are the
 4708 preliminary operations on the texts symbols, that are done in a simple
 4709 single pass before the main process of assembly. The text that is the
 4710 result of preprocessing is passed to assembler, and it then does its
 4711 multiple passes on it. Thus the control directives, which are recognized and
 4712 processed only by the assembler - as they are dependent on the numerical
 4713 values that may even vary between passes - are not recognized in any way by
 4714 the preprocessor and have no effect on the preprocessing. Consider this
 4715 example source:
 4716 
 4717     if 0
 4718     a = 1
 4719     b equ 2
 4720     end if
 4721     dd b
 4722 
 4723 When it is preprocessed, they only directive that is recognized by the
 4724 preprocessor is the "equ", which defines symbolic constant "b", so later
 4725 in the source the "b" symbol is replaced with the value "2". Except for this
 4726 replacement, the other lines are passes unchanged to the assembler. So
 4727 after preprocessing the above source becomes:
 4728 
 4729     if 0
 4730     a = 1
 4731     end if
 4732     dd 2
 4733 
 4734 Now when assembler processes it, the condition for the "if" is false, and
 4735 the "a" constant doesn't get defined. However symbolic constant "b" was
 4736 processed normally, even though its definition was put just next to the one
 4737 of "a". So because of the possible confusion you should be very careful
 4738 every time when mixing the features of preprocessor and assembler - in such
 4739 cases it is important to realize what the source will become after the 
 4740 preprocessing, and thus what the assembler will see and do its multiple passes 
 4741 on.
 4742 
 4743 
 4744 2.4  Formatter directives
 4745 
 4746 These directives are actually also a kind of control directives, with the
 4747 purpose of controlling the format of generated code.
 4748   "format" directive followed by the format identifier allows to select the
 4749 output format. This directive should be put at the beginning of the source.
 4750 It can always be followed in the same line by the "as" keyword and
 4751 a quoted string specifying the default file extension for the output
 4752 file. Unless the output file name was specified from the command line,
 4753 assembler will use this extension when generating the output file.
 4754   "use16" and "use32" directives force the assembler to generate 16-bit or
 4755 32-bit code, omitting the default setting for selected output format. "use64"
 4756 enables generating the code for the long mode of x86-64 processors.
 4757   Default output format is a flat binary file, it can also be selected by using
 4758 "format binary" directive. When this format is chosen, special symbol "$%" can
 4759 be used to get an offset within the output and "$%%" can be used to get the
 4760 actual offset in the output file, omitting any undefined data that would be
 4761 discarded if the output was ended at this point. Additionally, for this format
 4762 "load" and "store" directives allow access to any data within the already
 4763 generated output by following "from" or "at" keyword with ":" character and
 4764 then an expression specifying the offset within the output.
 4765   Below are described different output formats with the directives specific to
 4766 these formats.
 4767 
 4768 
 4769 2.4.1  MZ executable
 4770 
 4771 To select the MZ output format, use "format MZ" directive. The default code
 4772 setting for this format is 16-bit.
 4773   "segment" directive defines a new segment, it should be followed by label,
 4774 which value will be the number of defined segment, optionally "use16" or
 4775 "use32" word can follow to specify whether code in this segment should be
 4776 16-bit or 32-bit. The origin of segment is aligned to paragraph (16 bytes).
 4777 All the labels defined then will have values relative to the beginning of this
 4778 segment.
 4779   "entry" directive sets the entry point for MZ executable, it should be
 4780 followed by the far address (name of segment, colon and the offset inside
 4781 segment) of desired entry point.
 4782   "stack" directive sets up the stack for MZ executable. It can be followed by
 4783 numerical expression specifying the size of stack to be created automatically
 4784 or by the far address of initial stack frame when you want to set up the stack
 4785 manually. When no stack is defined, the stack of default size 4096 bytes will
 4786 be created.
 4787   "heap" directive should be followed by a 16-bit value defining maximum size
 4788 of additional heap in paragraphs (this is heap in addition to stack and
 4789 undefined data). Use "heap 0" to always allocate only memory program really
 4790 needs. Default size of heap is 65535.
 4791 
 4792 
 4793 2.4.2  Portable Executable
 4794 
 4795 To select the Portable Executable output format, use "format PE" directive, it
 4796 can be followed by additional format settings: first the target subsystem
 4797 setting, which can be "console" or "GUI" for Windows applications, "native"
 4798 for Windows drivers, "EFI", "EFIboot" or "EFIruntime" for the UEFI, it may be
 4799 followed by the minimum version of system that the executable is targeted to
 4800 (specified in form of floating-point value). Optional "DLL" and "WDM" keywords
 4801 mark the output file as a dynamic link library and WDM driver respectively,
 4802 the "large" keyword marks the executable as able to handle addresses larger
 4803 than 2 GB and the "NX" keyword signalizes that the executable conforms to the
 4804 restriction of not executing code residing in non-executable sections.
 4805   After those settings can follow the "at" operator and a numerical expression
 4806 specifying the base of PE image and then optionally "on" operator followed by
 4807 the quoted string containing file name selects custom MZ stub for PE program
 4808 (when specified file is not a MZ executable, it is treated as a flat binary
 4809 executable file and converted into MZ format). The default code setting for
 4810 this format is 32-bit. The example of fully featured PE format declaration:
 4811 
 4812     format PE GUI 4.0 DLL at 7000000h on 'stub.exe'
 4813 
 4814   To create PE file for the x86-64 architecture, use "PE64" keyword instead of
 4815 "PE" in the format declaration, in such case the long mode code is generated
 4816 by default.
 4817   "section" directive defines a new section, it should be followed by quoted
 4818 string defining the name of section, then one or more section flags can
 4819 follow. Available flags are: "code", "data", "readable", "writeable",
 4820 "executable", "shareable", "discardable", "notpageable". The origin of section
 4821 is aligned to page (4096 bytes). Example declaration of PE section:
 4822 
 4823     section '.text' code readable executable
 4824 
 4825 Among with flags also one of the special PE data identifiers can be specified
 4826 to mark the whole section as a special data, possible identifiers are
 4827 "export", "import", "resource" and "fixups". If the section is marked to
 4828 contain fixups, they are generated automatically and no more data needs to be
 4829 defined in this section. Also resource data can be generated automatically
 4830 from the resource file, it can be achieved by writing the "from" operator and
 4831 quoted file name after the "resource"  identifier. Below are the examples of
 4832 sections containing some special PE data:
 4833 
 4834     section '.reloc' data readable discardable fixups
 4835     section '.rsrc' data readable resource from 'my.res'
 4836 
 4837   "entry" directive sets the entry point for Portable Executable, the value of
 4838 entry point should follow.
 4839   "stack" directive sets up the size of stack for Portable Executable, value
 4840 of stack reserve size should follow, optionally value of stack commit
 4841 separated with comma can follow. When stack is not defined, it's set by
 4842 default to size of 4096 bytes.
 4843   "heap" directive chooses the size of heap for Portable Executable, value of
 4844 heap reserve size should follow, optionally value of heap commit separated
 4845 with comma can follow. When no heap is defined, it is set by default to size
 4846 of 65536 bytes, when size of heap commit is unspecified, it is by default set
 4847 to zero.
 4848   "data" directive begins the definition of special PE data, it should be
 4849 followed by one of the data identifiers ("export", "import", "resource" or
 4850 "fixups") or by the number of data entry in PE header. The data should be
 4851 defined in next lines, ended with "end data" directive. When fixups data
 4852 definition is chosen, they are generated automatically and no more data needs
 4853 to be defined there. The same applies to the resource data when the "resource"
 4854 identifier is followed by "from" operator and quoted file name - in such case
 4855 data is  taken from the given resource file.
 4856   The "rva" operator can be used inside the numerical expressions to obtain
 4857 the RVA of the item addressed by the value it is applied to, that is the
 4858 offset relative to the base of PE image.
 4859 
 4860 
 4861 2.4.3  Common Object File Format
 4862 
 4863 To select Common Object File Format, use "format COFF" or "format MS COFF"
 4864 directive, depending whether you want to create classic (DJGPP) or Microsoft's 
 4865 variant of COFF file. The default code setting for this format is 32-bit. To 
 4866 create the file in Microsoft's COFF format for the x86-64 architecture, use 
 4867 "format MS64 COFF" setting, in such case long mode code is generated by 
 4868 default.
 4869   "section" directive defines a new section, it should be followed by quoted
 4870 string defining the name of section, then one or more section flags can
 4871 follow. Section flags available for both COFF variants are "code" and "data",
 4872 while flags "readable", "writeable", "executable", "shareable", "discardable",
 4873 "notpageable", "linkremove" and "linkinfo" are available only with Microsoft's
 4874 COFF variant.
 4875   By default section is aligned to double word (four bytes), in case of
 4876 Microsoft COFF variant other alignment can be specified by providing the
 4877 "align" operator followed by alignment value (any power of two up to 8192)
 4878 among the section flags.
 4879   "extrn" directive defines the external symbol, it should be followed by the
 4880 name of symbol and optionally the size operator specifying the size of data
 4881 labeled by this symbol. The name of symbol can be also preceded by quoted
 4882 string containing name of the external symbol and the "as" operator.
 4883 Some example declarations of external symbols:
 4884 
 4885     extrn exit
 4886     extrn '__imp__MessageBoxA@16' as MessageBox:dword
 4887 
 4888   "public" directive declares the existing symbol as public, it should be
 4889 followed by the name of symbol, optionally it can be followed by the "as"
 4890 operator and the quoted string containing name under which symbol should be
 4891 available as public. Some examples of public symbols declarations:
 4892 
 4893     public main
 4894     public start as '_start'
 4895 
 4896 Additionally, with COFF format it's possible to specify exported symbol as
 4897 static, it's done by preceding the name of symbol with the "static" keyword.
 4898   When using the Microsoft's COFF format, the "rva" operator can be used
 4899 inside the numerical expressions to obtain the RVA of the item addressed by the
 4900 value it is applied to.
 4901 
 4902 2.4.4  Executable and Linkable Format
 4903 
 4904 To select ELF output format, use "format ELF" directive. The default code
 4905 setting for this format is 32-bit. To create ELF file for the x86-64
 4906 architecture, use "format ELF64" directive, in such case the long mode code is
 4907 generated by default.
 4908   "section" directive defines a new section, it should be followed by quoted
 4909 string defining the name of section, then can follow one or both of the
 4910 "executable" and "writeable" flags, optionally also "align" operator followed
 4911 by the number specifying the alignment of section (it has to be the power of
 4912 two), if no alignment is specified, the default value is used, which is 4 or 8,
 4913 depending on which format variant has been chosen.
 4914   "extrn" and "public" directives have the same meaning and syntax as when the
 4915 COFF output format is selected (described in previous section).
 4916   The "rva" operator can be used also in the case of this format (however not
 4917 when target architecture is x86-64), it converts the address into the offset
 4918 relative to the GOT table, so it may be useful to create position-independent
 4919 code. There's also a special "plt" operator, which allows to call the external
 4920 functions through the Procedure Linkage Table. You can even create an alias
 4921 for external function that will make it always be called through PLT, with
 4922 the code like:
 4923 
 4924     extrn 'printf' as _printf
 4925     printf = PLT _printf
 4926 
 4927   To create executable file, follow the format choice directive with the
 4928 "executable" or "dynamic" keyword and optionally the number specifying
 4929 the brand of the target operating system (for example value 3 would mark
 4930 the executable for Linux system). With this format selected it is allowed
 4931 to use "entry" directive followed by the value to set as entry point of program.
 4932 On the other hand it makes "extrn" and "public" directives unavailable,
 4933 and instead of "section" there should be the "segment" directive used,
 4934 followed by one or more segment permission flags and optionally a marker of
 4935 special ELF executable segment, which can be "interpreter", "dynamic", "note",
 4936 "gnuehframe", "gnustack" or "gnurelro". Available permission flags are: "readable",
 4937 "writeable" and "executable". The origin of a non-special segment is aligned
 4938 to page (4096 bytes). 
 4939 
 4940 EOF