"Fossies" - the Fresh Open Source Software Archive

Member "gmp-6.2.1/mpn/s390_64/README" (14 Nov 2020, 2772 Bytes) of package /linux/misc/gmp-6.2.1.tar.xz:


As a special service "Fossies" has tried to format the requested text file into HTML format (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file. See also the last Fossies "Diffs" side-by-side code changes report for "README": 6.1.2_vs_6.2.0.

    1 Copyright 2011 Free Software Foundation, Inc.
    2 
    3 This file is part of the GNU MP Library.
    4 
    5 The GNU MP Library is free software; you can redistribute it and/or modify
    6 it under the terms of either:
    7 
    8   * the GNU Lesser General Public License as published by the Free
    9     Software Foundation; either version 3 of the License, or (at your
   10     option) any later version.
   11 
   12 or
   13 
   14   * the GNU General Public License as published by the Free Software
   15     Foundation; either version 2 of the License, or (at your option) any
   16     later version.
   17 
   18 or both in parallel, as here.
   19 
   20 The GNU MP Library is distributed in the hope that it will be useful, but
   21 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
   22 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
   23 for more details.
   24 
   25 You should have received copies of the GNU General Public License and the
   26 GNU Lesser General Public License along with the GNU MP Library.  If not,
   27 see https://www.gnu.org/licenses/.
   28 
   29 
   30 
   31 There are 5 generations of 64-bit s390 processors, z900, z990, z9,
   32 z10, and z196.  The current GMP code was optimised for the two oldest,
   33 z900 and z990.
   34 
   35 
   36 mpn_copyi
   37 
   38 This code makes use of a loop around MVC.  It almost surely runs very
   39 close to optimally.  A small improvement could be done by using one
   40 MVC for size 256 bytes, now we use two (we use an extra MVC when
   41 copying any multiple of 256 bytes).
   42 
   43 
   44 mpn_copyd
   45 
   46 We have tried several feed-in variants here, branch tree, jump table
   47 and computed goto.  The fastest (on z990) turned out to be computed
   48 goto.
   49 
   50 An approach not tried is EX of LMG and STMG, modifying the register set
   51 on-the-fly.  Using that trick, we could completely avoid using
   52 separate feed-in paths.
   53 
   54 
   55 mpn_lshift, mpn_rshift
   56 
   57 The current code runs at pipeline decode bandwidth on z990.
   58 
   59 
   60 mpn_add_n, mpn_sub_n
   61 
   62 The current code is 4-way unrolled.  It should be unrolled more, at
   63 least 8x, in order to reach 2.5 c/l.
   64 
   65 
   66 mpn_mul_1, mpn_addmul_1, mpn_submul_1
   67 
   68 The current code is very naive, but due to the non-pipelined nature of
   69 MLGR on z900 and z990, more sophisticated code would not gain much.
   70 
   71 On z10 one would need to cluster at least 4 MLGR together, in order to
   72 reduce stalling.
   73 
   74 On z196, one surely want to use unrolling and pipelining, to perhaps
   75 reach around 12 c/l.  A major issue here and on z10 is ALCGR's 3 cycle
   76 stalling.
   77 
   78 
   79 mpn_mul_2, mpn_addmul_2
   80 
   81 At least for older machines (z900, z990) with very slow MLGR, we
   82 should use Karatsuba's algorithm on 2-limb units, making mul_2 and
   83 addmul_2 the main multiplication primitives.  The newer machines might
   84 benefit less from this approach, perhaps in particular z10, where MLGR
   85 clustering is more important.
   86 
   87 With Karatsuba, one could hope for around 16 cycles per accumulated
   88 128 cross product, on z990.