"Fossies" - the Fresh Open Source Software Archive

Member "gmp-6.2.1/mpn/alpha/README" (14 Nov 2020, 7922 Bytes) of package /linux/misc/gmp-6.2.1.tar.xz:


As a special service "Fossies" has tried to format the requested text file into HTML format (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 Copyright 1996, 1997, 1999-2005 Free Software Foundation, Inc.
    2 
    3 This file is part of the GNU MP Library.
    4 
    5 The GNU MP Library is free software; you can redistribute it and/or modify
    6 it under the terms of either:
    7 
    8   * the GNU Lesser General Public License as published by the Free
    9     Software Foundation; either version 3 of the License, or (at your
   10     option) any later version.
   11 
   12 or
   13 
   14   * the GNU General Public License as published by the Free Software
   15     Foundation; either version 2 of the License, or (at your option) any
   16     later version.
   17 
   18 or both in parallel, as here.
   19 
   20 The GNU MP Library is distributed in the hope that it will be useful, but
   21 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
   22 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
   23 for more details.
   24 
   25 You should have received copies of the GNU General Public License and the
   26 GNU Lesser General Public License along with the GNU MP Library.  If not,
   27 see https://www.gnu.org/licenses/.
   28 
   29 
   30 
   31 
   32 
   33 This directory contains mpn functions optimized for DEC Alpha processors.
   34 
   35 ALPHA ASSEMBLY RULES AND REGULATIONS
   36 
   37 The `.prologue N' pseudo op marks the end of instruction that needs special
   38 handling by unwinding.  It also says whether $27 is really needed for computing
   39 the gp.  The `.mask M' pseudo op says which registers are saved on the stack,
   40 and at what offset in the frame.
   41 
   42 Cray T3 code is very very different...
   43 
   44 "$6" / "$f6" etc is the usual syntax for registers, but on Unicos instead "r6"
   45 / "f6" is required.  We use the "r6" / "f6" forms, and have m4 defines expand
   46 them to "$6" or "$f6" where necessary.
   47 
   48 "0x" introduces a hex constant in gas and DEC as, but on Unicos "^X" is
   49 required.  The X() macro accommodates this difference.
   50 
   51 "cvttqc" is required by DEC as, "cvttq/c" is required by Unicos, and gas will
   52 accept either.  We use cvttqc and have an m4 define expand to cvttq/c where
   53 necessary.
   54 
   55 "not" as an alias for "ornot r31, ..." is available in gas and DEC as, but not
   56 the Unicos assembler.  The full "ornot" must be used.
   57 
   58 "unop" is not available in Unicos.  We make an m4 define to the usual "ldq_u
   59 r31,0(r30)", and in fact use that define on all systems since it comes out the
   60 same.
   61 
   62 "!literal!123" etc explicit relocations as per Tru64 4.0 are apparently not
   63 available in older alpha assemblers (including gas prior to 2.12), according to
   64 the GCC manual, so the assembler macro forms must be used (eg. ldgp).
   65 
   66 
   67 
   68 RELEVANT OPTIMIZATION ISSUES
   69 
   70 EV4
   71 
   72 1. This chip has very limited store bandwidth.  The on-chip L1 cache is write-
   73    through, and a cache line is transferred from the store buffer to the off-
   74    chip L2 in as much 15 cycles on most systems.  This delay hurts mpn_add_n,
   75    mpn_sub_n, mpn_lshift, and mpn_rshift.
   76 
   77 2. Pairing is possible between memory instructions and integer arithmetic
   78    instructions.
   79 
   80 3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these
   81    cycles are pipelined.  Thus, multiply instructions can be issued at a rate
   82    of one each 21st cycle.
   83 
   84 EV5
   85 
   86 1. The memory bandwidth of this chip is good, both for loads and stores.  The
   87    L1 cache can handle two loads or one store per cycle, but two cycles after a
   88    store, no ld can issue.
   89 
   90 2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
   91    umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle.
   92    (Note that published documentation gets these numbers slightly wrong.)
   93 
   94 3. mpn_add_n.  With 4-fold unrolling, we need 37 instructions, whereof 12
   95    are memory operations.  This will take at least
   96 	ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles
   97    We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
   98    cache cycles, which should be completely hidden in the 19 issue cycles.
   99    The computation is inherently serial, with these dependencies:
  100 
  101 	       ldq  ldq
  102 		 \  /\
  103 	  (or)   addq |
  104 	   |\   /   \ |
  105 	   | addq  cmpult
  106 	    \  |     |
  107 	     cmpult  |
  108 		 \  /
  109 		  or
  110 
  111    I.e., 3 operations are needed between carry-in and carry-out, making 12
  112    cycles the absolute minimum for the 4 limbs.  We could replace the `or' with
  113    a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that
  114    might waste a cycle on EV4.  The total depth remain unaffected, since cmov
  115    has a latency of 2 cycles.
  116 
  117      addq
  118      /   \
  119    addq  cmpult
  120      |      \
  121    cmpult -> cmovne
  122 
  123   Montgomery has a slightly different way of computing carry that requires one
  124   less instruction, but has depth 4 (instead of the current 3).  Since the code
  125   is currently instruction issue bound, Montgomery's idea should save us 1/2
  126   cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb.
  127   Unfortunately, this method will not be good for the EV6.
  128 
  129 4. addmul_1 and friends: We previously had a scheme for splitting the single-
  130    limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks,
  131    and then use FP operations for every 2nd multiply, and integer operations
  132    for every 2nd multiply.
  133 
  134    But it seems much better to split the single-limb operand in 16-bit chunks,
  135    since we save many integer shifts and adds that way.  See powerpc64/README
  136    for some more details.
  137 
  138 EV6
  139 
  140 Here we have a really parallel pipeline, capable of issuing up to 4 integer
  141 instructions per cycle.  In actual practice, it is never possible to sustain
  142 more than 3.5 integer insns/cycle due to rename register shortage.  One integer
  143 multiply instruction can issue each cycle.  To get optimal speed, we need to
  144 pretend we are vectorizing the code, i.e., minimize the depth of recurrences.
  145 
  146 There are two dependencies to watch out for.  1) Address arithmetic
  147 dependencies, and 2) carry propagation dependencies.
  148 
  149 We can avoid serializing due to address arithmetic by unrolling loops, so that
  150 addresses don't depend heavily on an index variable.  Avoiding serializing
  151 because of carry propagation is trickier; the ultimate performance of the code
  152 will be determined of the number of latency cycles it takes from accepting
  153 carry-in to a vector point until we can generate carry-out.
  154 
  155 Most integer instructions can execute in either the L0, U0, L1, or U1
  156 pipelines.  Shifts only execute in U0 and U1, and multiply only in U1.
  157 
  158 CMOV instructions split into two internal instructions, CMOV1 and CMOV2.  CMOV
  159 split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV
  160 should always be placed as the last instruction of an aligned 4 instruction
  161 block, or perhaps simply avoided.
  162 
  163 Perhaps the most important issue is the latency between the L0/U0 and L1/U1
  164 clusters; a result obtained on either cluster has an extra cycle of latency for
  165 consumers in the opposite cluster.  Because of the dynamic nature of the
  166 implementation, it is hard to predict where an instruction will execute.
  167 
  168 
  169 
  170 REFERENCES
  171 
  172 "Alpha Architecture Handbook", version 4, Compaq, October 1998, order number
  173 EC-QD2KC-TE.
  174 
  175 "Alpha 21164 Microprocessor Hardware Reference Manual", Compaq, December 1998,
  176 order number EC-QP99C-TE.
  177 
  178 "Alpha 21264/EV67 Microprocessor Hardware Reference Manual", revision 1.4,
  179 Compaq, September 2000, order number DS-0028B-TE.
  180 
  181 "Compiler Writer's Guide for the Alpha 21264", Compaq, June 1999, order number
  182 EC-RJ66A-TE.
  183 
  184 All of the above are available online from
  185 
  186   http://ftp.digital.com/pub/Digital/info/semiconductor/literature/dsc-library.html
  187   ftp://ftp.compaq.com/pub/products/alphaCPUdocs
  188 
  189 "Tru64 Unix Assembly Language Programmer's Guide", Compaq, March 1996, part
  190 number AA-PS31D-TE.
  191 
  192 "Digital UNIX Calling Standard for Alpha Systems", Digital Equipment Corp,
  193 March 1996, part number AA-PY8AC-TE.
  194 
  195 The above are available online,
  196 
  197   http://h30097.www3.hp.com/docs/pub_page/V40F_DOCS.HTM
  198 
  199 (Dunno what h30097 means in this URL, but if it moves try searching for "tru64
  200 online documentation" from the main www.hp.com page.)
  201 
  202 
  203 
  204 ----------------
  205 Local variables:
  206 mode: text
  207 fill-column: 79
  208 End: