"Fossies" - the Fresh Open Source Software Archive

Member "gmp-6.2.1/mpn/sparc64/README" (14 Nov 2020, 5284 Bytes) of package /linux/misc/gmp-6.2.1.tar.xz:


As a special service "Fossies" has tried to format the requested text file into HTML format (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 Copyright 1997, 1999-2002 Free Software Foundation, Inc.
    2 
    3 This file is part of the GNU MP Library.
    4 
    5 The GNU MP Library is free software; you can redistribute it and/or modify
    6 it under the terms of either:
    7 
    8   * the GNU Lesser General Public License as published by the Free
    9     Software Foundation; either version 3 of the License, or (at your
   10     option) any later version.
   11 
   12 or
   13 
   14   * the GNU General Public License as published by the Free Software
   15     Foundation; either version 2 of the License, or (at your option) any
   16     later version.
   17 
   18 or both in parallel, as here.
   19 
   20 The GNU MP Library is distributed in the hope that it will be useful, but
   21 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
   22 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
   23 for more details.
   24 
   25 You should have received copies of the GNU General Public License and the
   26 GNU Lesser General Public License along with the GNU MP Library.  If not,
   27 see https://www.gnu.org/licenses/.
   28 
   29 
   30 
   31 
   32 
   33 This directory contains mpn functions for 64-bit V9 SPARC
   34 
   35 RELEVANT OPTIMIZATION ISSUES
   36 
   37 Notation:
   38   IANY = shift/add/sub/logical/sethi
   39   IADDLOG = add/sub/logical/sethi
   40   MEM = ld*/st*
   41   FA = fadd*/fsub*/f*to*/fmov*
   42   FM = fmul*
   43 
   44 UltraSPARC can issue four instructions per cycle, with these restrictions:
   45 * Two IANY instructions, but only one of these may be a shift.  If there is a
   46   shift and an IANY instruction, the shift must precede the IANY instruction.
   47 * One FA.
   48 * One FM.
   49 * One branch.
   50 * One MEM.
   51 * IANY/IADDLOG/MEM must be insn 1, 2, or 3 in an issue bundle.  Taken branches
   52   should not be in slot 4, since that makes the delay insn come from separate
   53   bundle.
   54 * If two IANY/IADDLOG instructions are to be executed in the same cycle and one
   55   of these is setting the condition codes, that instruction must be the second
   56   one.
   57 
   58 To summarize, ignoring branches, these are the bundles that can reach the peak
   59 execution speed:
   60 
   61 insn1	iany	iany	mem	iany	iany	mem	iany	iany	mem
   62 insn2	iaddlog	mem	iany	mem	iaddlog	iany	mem	iaddlog	iany
   63 insn3	mem	iaddlog	iaddlog	fa	fa	fa	fm	fm	fm
   64 insn4	fa/fm	fa/fm	fa/fm	fm	fm	fm	fa	fa	fa
   65 
   66 The 64-bit integer multiply instruction mulx takes from 5 cycles to 35 cycles,
   67 depending on the position of the most significant bit of the first source
   68 operand.  When used for 32x32->64 multiplication, it needs 20 cycles.
   69 Furthermore, it stalls the processor while executing.  We stay away from that
   70 instruction, and instead use floating-point operations.
   71 
   72 Floating-point add and multiply units are fully pipelined.  The latency for
   73 UltraSPARC-1/2 is 3 cycles and for UltraSPARC-3 it is 4 cycles.
   74 
   75 Integer conditional move instructions cannot dual-issue with other integer
   76 instructions.  No conditional move can issue 1-5 cycles after a load.  (This
   77 might have been fixed for UltraSPARC-3.)
   78 
   79 The UltraSPARC-3 pipeline is very simular to the one of UltraSPARC-1/2 , but is
   80 somewhat slower.  Branches execute slower, and there may be other new stalls.
   81 But integer multiply doesn't stall the entire CPU and also has a much lower
   82 latency.  But it's still not pipelined, and thus useless for our needs.
   83 
   84 STATUS
   85 
   86 * mpn_lshift, mpn_rshift: The current code runs at 2.0 cycles/limb on
   87   UltraSPARC-1/2 and 2.65 on UltraSPARC-3.  For UltraSPARC-1/2, the IEU0
   88   functional unit is saturated with shifts.
   89 
   90 * mpn_add_n, mpn_sub_n: The current code runs at 4 cycles/limb on
   91   UltraSPARC-1/2 and 4.5 cycles/limb on UltraSPARC-3.  The 4 instruction
   92   recurrency is the speed limiter.
   93 
   94 * mpn_addmul_1: The current code runs at 14 cycles/limb asymptotically on
   95   UltraSPARC-1/2 and 17.5 cycles/limb on UltraSPARC-3.  On UltraSPARC-1/2, the
   96   code sustains 4 instructions/cycle.  It might be possible to invent a better
   97   way of summing the intermediate 49-bit operands, but it is unlikely that it
   98   will save enough instructions to save an entire cycle.
   99 
  100   The load-use of the u operand is not enough scheduled for good L2 cache
  101   performance.  The UltraSPARC-1/2 L1 cache is direct mapped, and since we use
  102   temporary stack slots that will conflict with the u and r operands, we miss
  103   to L2 very often.  The load-use of the std/ldx pairs via the stack are
  104   perhaps over-scheduled.
  105 
  106   It would be possible to save two instructions: (1) The mov could be avoided
  107   if the std/ldx were less scheduled.  (2) The ldx of the r operand could be
  108   split into two ld instructions, saving the shifts/masks.
  109 
  110   It should be possible to reach 14 cycles/limb for UltraSPARC-3 if the fp
  111   operations where rescheduled for this processor's 4-cycle latency.
  112 
  113 * mpn_mul_1: The current code is a straightforward edit of the mpn_addmul_1
  114   code.  It would be possible to shave one or two cycles from it, with some
  115   labour.
  116 
  117 * mpn_submul_1: Simpleminded code just calling mpn_mul_1 + mpn_sub_n.  This
  118   means that it runs at 18 cycles/limb on UltraSPARC-1/2 and 23 cycles/limb on
  119   UltraSPARC-3.  It would be possible to either match the mpn_addmul_1
  120   performance, or in the worst case use one more instruction group.
  121 
  122 * US1/US2 cache conflict resolving.  The direct mapped L1 date cache of US1/US2
  123   is a problem for mul_1, addmul_1 (and a prospective submul_1).  We should
  124   allocate a larger cache area, and put the stack temp area in a place that
  125   doesn't cause cache conflicts.