"Fossies" - the Fresh Open Source Software Archive

Member "gmp-6.2.1/tune/README" (14 Nov 2020, 19572 Bytes) of package /linux/misc/gmp-6.2.1.tar.xz:


As a special service "Fossies" has tried to format the requested text file into HTML format (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 Copyright 2000-2002, 2004 Free Software Foundation, Inc.
    2 
    3 This file is part of the GNU MP Library.
    4 
    5 The GNU MP Library is free software; you can redistribute it and/or modify
    6 it under the terms of either:
    7 
    8   * the GNU Lesser General Public License as published by the Free
    9     Software Foundation; either version 3 of the License, or (at your
   10     option) any later version.
   11 
   12 or
   13 
   14   * the GNU General Public License as published by the Free Software
   15     Foundation; either version 2 of the License, or (at your option) any
   16     later version.
   17 
   18 or both in parallel, as here.
   19 
   20 The GNU MP Library is distributed in the hope that it will be useful, but
   21 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
   22 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
   23 for more details.
   24 
   25 You should have received copies of the GNU General Public License and the
   26 GNU Lesser General Public License along with the GNU MP Library.  If not,
   27 see https://www.gnu.org/licenses/.
   28 
   29 
   30 
   31 
   32 
   33                GMP SPEED MEASURING AND PARAMETER TUNING
   34 
   35 
   36 The programs in this directory are for knowledgeable users who want to
   37 measure GMP routines on their machine, and perhaps tweak some settings or
   38 identify things that can be improved.
   39 
   40 The programs here are tools, not ready to run solutions.  Nothing is built
   41 in a normal "make all", but various Makefile targets described below exist.
   42 
   43 Relatively few systems and CPUs have been tested, so be sure to verify that
   44 results are sensible before relying on them.
   45 
   46 
   47 
   48 
   49 MISCELLANEOUS NOTES
   50 
   51 --enable-assert
   52 
   53     Don't configure with --enable-assert, since the extra code added by
   54     assertion checking may influence measurements.
   55 
   56 Direct mapped caches
   57 
   58     Some effort has been made to accommodate CPUs with direct mapped caches,
   59     by putting data blocks more or less contiguously on the stack.  But this
   60     will depend on TMP_ALLOC using alloca, and even then it may or may not
   61     be enough.
   62 
   63 FreeBSD 4.2 i486 getrusage
   64 
   65     This getrusage seems to be a bit doubtful, it looks like it's
   66     microsecond accurate, but sometimes ru_utime remains unchanged after a
   67     time of many microseconds has elapsed.  It'd be good to detect this in
   68     the time.c initializations, but for now the suggestion is to pretend it
   69     doesn't exist.
   70 
   71         ./configure ac_cv_func_getrusage=no
   72 
   73 NetBSD 1.4.1 m68k macintosh time base
   74 
   75     On this system it's been found getrusage often goes backwards, making it
   76     unusable (time.c getrusage_backwards_p detects this).  gettimeofday
   77     sometimes doesn't update atomically when it crosses a 1 second boundary.
   78     Not sure what to do about this.  Expect possible intermittent failures.
   79 
   80 SCO OpenUNIX 8 /etc/hw
   81 
   82     /etc/hw takes about a second to return the cpu frequency, which suggests
   83     perhaps it's measuring each time it runs.  If this is annoying when
   84     running the speed program repeatedly then set a GMP_CPU_FREQUENCY
   85     environment variable (see TIME BASE section below).
   86 
   87 Timing on GNU/Linux
   88 
   89     On Linux, timing currently uses the cycle counter. This is unreliable,
   90     since the counter is not saved and restored at context switches (unlike
   91     FreeBSD and Solaris where the cycle counter is "virtualized").
   92 
   93     Using the clock_gettime method with CLOCK_PROCESS_CPUTIME_ID (posix) or
   94     CLOCK_VIRTUAL (BSD) should be more reliable. To get clock_gettime
   95     with glibc, one has to link with -lrt (which also drags in the pthreads
   96     threading library). configure.in must be hacked to detect this and
   97     arrange proper linking. Something like
   98 
   99       old_LIBS="$LIBS"
  100       AC_SEARCH_LIBS(clock_gettime, rt, [AC_DEFINE(HAVE_CLOCK_GETTIME)])
  101       TUNE_LIBS="$LIBS"
  102       LIBS="$old_LIBS"
  103 
  104       AC_SUBST(TUNE_LIBS)
  105 
  106     might work.
  107 
  108 Low resolution timebase
  109 
  110     Parameter tuning can be very time consuming if the only timebase
  111     available is a 10 millisecond clock tick, to the point of being
  112     unusable.  This is currently the case on VAX and ARM systems.
  113 
  114 
  115 
  116 
  117 PARAMETER TUNING
  118 
  119 The "tuneup" program runs some tests designed to find the best settings for
  120 various thresholds, like MUL_TOOM22_THRESHOLD.  Its output can be put
  121 into gmp-mparam.h.  The program is built and run with
  122 
  123         make tune
  124 
  125 If the thresholds indicated are grossly different from the values in the
  126 selected gmp-mparam.h then there may be a performance boost in applicable
  127 size ranges by changing gmp-mparam.h accordingly.
  128 
  129 Be sure to do a full reconfigure and rebuild to get any newly set thresholds
  130 to take effect.  A partial rebuild is enough sometimes, but a fresh
  131 configure and make is certain to be correct.
  132 
  133 If a CPU has specific tuned parameters coming from a gmp-mparam.h in one of
  134 the mpn subdirectories then the values from "make tune" should be similar.
  135 But check that the configured CPU is right and there are no machine specific
  136 effects causing a difference.
  137 
  138 It's hoped the compiler and options used won't have too much effect on
  139 thresholds, since for most CPUs they ultimately come down to comparisons
  140 between assembler subroutines.  Missing out on the longlong.h macros by not
  141 using gcc will probably have an effect.
  142 
  143 Some thresholds produced by the tune program are merely single values chosen
  144 from what's a range of sizes where two algorithms are pretty much the same
  145 speed.  When this happens the program is likely to give somewhat different
  146 values on successive runs.  This is noticeable on the toom3 thresholds for
  147 instance.
  148 
  149 
  150 
  151 
  152 SPEED PROGRAM
  153 
  154 The "speed" program can be used for measuring and comparing various
  155 routines, and producing tables of data or gnuplot graphs.  Compile it with
  156 
  157 	make speed
  158 
  159 (Or on DOS systems "make speed.exe".)
  160 
  161 Here are some examples of how to use it.  Check the code for all the
  162 options.
  163 
  164 Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05
  165 (whichever is greater).
  166 
  167         ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n
  168 	gnuplot foo.gnuplot
  169 
  170 Compare mpn_add_n and an mpn_lshift by 1, showing times in cycles and
  171 showing under mpn_lshift the difference between it and mpn_add_n.
  172 
  173 	./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1
  174 
  175 Using option -c for times in cycles is interesting but normally only
  176 necessary when looking carefully at assembler subroutines.  You might think
  177 it would always give an integer value, but this doesn't happen in practice,
  178 probably due to overheads in the time measurements.
  179 
  180 In the free-form output the "#" symbol against a measurement means the
  181 corresponding routine is fastest at that size.  This is a convenient visual
  182 cue when comparing different routines.  The graph data files <name>.data
  183 don't get this since it would upset gnuplot or other data viewers.
  184 
  185 
  186 
  187 
  188 TIME BASE
  189 
  190 The time measuring method is determined in time.c, based on what the
  191 configured host has available.  A cycle counter is preferred, possibly
  192 supplemented by another method if the counter has a limited range.  A
  193 microsecond accurate getrusage() or gettimeofday() will work quite well too.
  194 
  195 The cycle counters (except possibly on alpha) and gettimeofday() will depend
  196 on the machine being otherwise idle, or rather on other jobs not stealing
  197 CPU time from the measuring program.  Short routines (those that complete
  198 within a timeslice) should work even on a busy machine.
  199 
  200 Some trouble is taken by speed_measure() in common.c to avoid ill effects
  201 from sporadic interrupts, or other intermittent things (like cron waking up
  202 every minute).  But generally an idle machine will be necessary to be
  203 certain of consistent results.
  204 
  205 The CPU frequency is needed to convert between cycles and seconds, or for
  206 when a cycle counter is supplemented by getrusage() etc.  The speed program
  207 will convert as necessary according to the output format requested.  The
  208 tune program will work with either cycles or seconds.
  209 
  210 freq.c knows how to get the frequency on some systems, or can measure a
  211 cycle counter against gettimeofday() or getrusage(), but when that fails, or
  212 needs to be overridden, an environment variable GMP_CPU_FREQUENCY can be
  213 used (in Hertz).  For example in "bash" on a 650 MHz machine,
  214 
  215 	export GMP_CPU_FREQUENCY=650e6
  216 
  217 A high precision time base makes it possible to get accurate measurements in
  218 a shorter time.
  219 
  220 
  221 
  222 
  223 EXAMPLE COMPARISONS - VARIOUS
  224 
  225 Here are some ideas for things that can be done with the speed program.
  226 
  227 There's always going to be a certain amount of overhead in the time
  228 measurements, due to reading the time base, and in the loop that runs a
  229 routine enough times to get a reading of the desired precision.  Noop
  230 functions taking various arguments are available to measure this.  The
  231 "overhead" printed by the speed program each time in its intro is the "noop"
  232 routine, but note that this is just for information, it isn't deducted from
  233 the times printed or anything.
  234 
  235 	./speed -s 1 noop noop_wxs noop_wxys
  236 
  237 To see how many cycles per limb a routine is taking, look at the time
  238 increase when the size increments, using option -D.  This avoids fixed
  239 overheads in the measuring.  Also, remember many of the assembler routines
  240 have unrolled loops, so it might be necessary to compare times at, say, 16,
  241 32, 48, 64 etc to see what the unrolled part is taking, as opposed to any
  242 finishing off.
  243 
  244         ./speed -s 16-64 -t 16 -C -D mpn_add_n
  245 
  246 The -C option on its own gives cycles per limb, but is really only useful at
  247 big sizes where fixed overheads are small compared to the code doing the
  248 real work.  Remember of course memory caching and/or page swapping will
  249 affect results at large sizes.
  250 
  251         ./speed -s 500000 -C mpn_add_n
  252 
  253 Once a calculation stops fitting in the CPU data cache, it's going to start
  254 taking longer.  Exactly where this happens depends on the cache priming in
  255 the measuring routines, and on what sort of "least recently used" the
  256 hardware does.  Here's an example for a CPU with a 16kbyte L1 data cache and
  257 32-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000
  258 limbs.
  259 
  260         ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n
  261 	gnuplot foo.gnuplot
  262 
  263 When a routine has an unrolled loop for, say, multiples of 8 limbs and then
  264 an ordinary loop for the remainder, it can happen that it's actually faster
  265 to do an operation on, say, 8 limbs than it is on 7 limbs.  The following
  266 draws a graph of mpn_sub_n, to see whether times smoothly increase with
  267 size.
  268 
  269         ./speed -s 1-100 -c -P foo mpn_sub_n
  270 	gnuplot foo.gnuplot
  271 
  272 If mpn_lshift and mpn_rshift have special case code for shifts by 1, it
  273 ought to be faster (or at least not slower) than shifting by, say, 2 bits.
  274 
  275         ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2
  276 
  277 An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and
  278 if the lshift isn't faster there's an obvious improvement that's possible.
  279 
  280         ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self
  281 
  282 On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the
  283 destination is one of the sources is faster than a separate destination.
  284 Here's an example to see this.  ".1" selects dst==src1 for mpn_add_n (and
  285 mpn_sub_n), for other values see speed.h SPEED_ROUTINE_MPN_BINARY_N_CALL.
  286 
  287         ./speed -s 1-200 -c mpn_add_n mpn_add_n.1
  288 
  289 The gmp manual points out that divisions by powers of two should be done
  290 using a right shift because it'll be significantly faster than an actual
  291 division.  The following shows by what factor mpn_rshift is faster than
  292 mpn_divrem_1, using division by 32 as an example.
  293 
  294         ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32
  295 
  296 
  297 
  298 
  299 EXAMPLE COMPARISONS - MULTIPLICATION
  300 
  301 mul_basecase takes a ".<r>" parameter. If positive, it gives the second
  302 (smaller) operand size.  For example to show speeds for 3x3 up to 20x3 in
  303 cycles,
  304 
  305         ./speed -s 3-20 -c mpn_mul_basecase.3
  306 
  307 A negative ".<-r>" parameter fixes the size of the product to the absolute
  308 value r.  For example to show speeds for 10x10 up to 19x1 in cycles,
  309 
  310         ./speed -s 10-19 -c mpn_mul_basecase.-20
  311 
  312 mul_basecase with no parameter does an NxN multiply, so for example to show
  313 speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles,
  314 
  315         ./speed -s 1-20 -c mpn_mul_basecase
  316 
  317 sqr_basecase is implemented by a "triangular" method on most CPUs, making it
  318 up to twice as fast as mul_basecase.  In practice loop overheads and the
  319 products on the diagonal mean it falls short of this.  Here's an example
  320 running the two and showing by what factor an NxN mul_basecase is slower
  321 than an NxN sqr_basecase.  (Some versions of sqr_basecase only allow sizes
  322 below SQR_TOOM2_THRESHOLD, so if it crashes at that point don't worry.)
  323 
  324         ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase
  325 
  326 The technique described above with -CD for showing the time difference in
  327 cycles per limb between two size operations can be done on an NxN
  328 mul_basecase using -E to change the basis for the size increment to N*N.
  329 For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16
  330 doing 256 limbs.  The following therefore shows the per crossproduct speed
  331 of mul_basecase and sqr_basecase at around 20x20 limbs.
  332 
  333         ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase
  334 
  335 Of course sqr_basecase isn't really doing NxN crossproducts, but it can be
  336 interesting to compare it to mul_basecase as if it was.  For sqr_basecase
  337 the -F option can be used to base the deltas on N*(N+1)/2 operations, which
  338 is the triangular products sqr_basecase does.  For example,
  339 
  340         ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase
  341 
  342 Both -E and -F are preliminary and might change.  A consistent approach to
  343 using them when claiming certain per crossproduct or per triangularproduct
  344 speeds hasn't really been established, but the increment between speeds in
  345 the range karatsuba will call seems sensible, that being k to k/2.  For
  346 instance, if the karatsuba threshold was 20 for the multiply and 30 for the
  347 square,
  348 
  349         ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase
  350         ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase
  351 
  352 
  353 
  354 EXAMPLE COMPARISONS - MALLOC
  355 
  356 The gmp manual recommends application programs avoid excessive initializing
  357 and clearing of mpz_t variables (and mpq_t and mpf_t too).  Every new
  358 variable will at a minimum go through an init, a realloc for its first
  359 store, and finally a clear.  Quite how long that takes depends on the C
  360 library.  The following compares an mpz_init/realloc/clear to a 10 limb
  361 mpz_add.  Don't be surprised if the mallocing is quite slow.
  362 
  363         ./speed -s 10 -c mpz_init_realloc_clear mpz_add
  364 
  365 On some systems malloc and free are much slower when dynamic linked.  The
  366 speed-dynamic program can be used to see this.  For example the following
  367 measures malloc/free, first static then dynamic.
  368 
  369         ./speed -s 10 -c malloc_free
  370         ./speed-dynamic -s 10 -c malloc_free
  371 
  372 Of course a real world program has big problems if it's doing so many
  373 mallocs and frees that it gets slowed down by a dynamic linked malloc.
  374 
  375 
  376 
  377 
  378 
  379 EXAMPLE COMPARISONS - STRING CONVERSIONS
  380 
  381 mpn_get_str does a binary to string conversion.  The base is specified with
  382 a ".<r>" parameter, or decimal by default.  Power of 2 bases are much faster
  383 than general bases.  The following compares decimal and hex for instance.
  384 
  385         ./speed -s 1-20 -c mpn_get_str mpn_get_str.16
  386 
  387 Smaller bases need more divisions to split a given size number, and so are
  388 slower.  The following compares base 3 and base 9.  On small operands 9 will
  389 be nearly twice as fast, though at bigger sizes this reduces since in the
  390 current implementation both divide repeatedly by 3^20 (or 3^40 for 64 bit
  391 limbs) and those divisions come to dominate.
  392 
  393         ./speed -s 1-20 -cr mpn_get_str.3 mpn_get_str.9
  394 
  395 mpn_set_str does a string to binary conversion.  The base is specified with
  396 a ".<r>" parameter, or decimal by default.  Power of 2 bases are faster than
  397 general bases on large conversions.
  398 
  399 	./speed -s 1-512 -f 2 -c mpn_set_str.8 mpn_set_str.10
  400 
  401 mpn_set_str also has some special case code for decimal which is a bit
  402 faster than the general case, basically by giving the compiler a chance to
  403 optimize some multiplications by 10.
  404 
  405 	./speed -s 20-40 -c mpn_set_str.9 mpn_set_str.10 mpn_set_str.11
  406 
  407 
  408 
  409 
  410 EXAMPLE COMPARISONS - GCDs
  411 
  412 mpn_gcd_1 has a threshold for when to reduce using an initial x%y when both
  413 x and y are single limbs.  This isn't tuned currently, but a value can be
  414 established by a measurement like
  415 
  416 	./speed -s 10-32 mpn_gcd_1.10
  417 
  418 This runs src[0] from 10 to 32 bits, and y fixed at 10 bits.  If the div
  419 threshold is high, say 31 so it's effectively disabled then a 32x10 bit gcd
  420 is done by nibbling away at the 32-bit operands bit-by-bit.  When the
  421 threshold is small, say 1 bit, then an initial x%y is done to reduce it to a
  422 10x10 bit operation.
  423 
  424 The threshold in mpn/generic/gcd_1.c or the various assembler
  425 implementations can be tweaked up or down until there's no more speedups on
  426 interesting combinations of sizes.  Note that this affects only a 1x1 limb
  427 operation and so isn't very important.  (An Nx1 limb operation always does
  428 an initial modular reduction, using mpn_mod_1 or mpn_modexact_1_odd.)
  429 
  430 
  431 
  432 
  433 SPEED PROGRAM EXTENSIONS
  434 
  435 Potentially lots of things could be made available in the program, but it's
  436 been left at only the things that have actually been wanted and are likely
  437 to be reasonably useful in the future.
  438 
  439 Extensions should be fairly easy to make though.  speed-ext.c is an example,
  440 in a style that should suit one-off tests, or new code fragments under
  441 development.
  442 
  443 many.pl is a script for generating a new speed program supplemented with
  444 alternate versions of the standard routines.  It can be used for measuring
  445 experimental code, or for comparing different implementations that exist
  446 within a CPU family.
  447 
  448 
  449 
  450 
  451 THRESHOLD EXAMINING
  452 
  453 The speed program can be used to examine the speeds of different algorithms
  454 to check the tune program has done the right thing.  For example to examine
  455 the karatsuba multiply threshold,
  456 
  457 	./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n
  458 
  459 When examining the toom3 threshold, remember it depends on the karatsuba
  460 threshold, so the right karatsuba threshold needs to be compiled into the
  461 library first.  The tune program uses specially recompiled versions of
  462 mpn/mul_n.c etc for this reason, but the speed program simply uses the
  463 normal libgmp.la.
  464 
  465 Note further that the various routines may recurse into themselves on sizes
  466 far enough above applicable thresholds.  For example, mpn_kara_mul_n will
  467 recurse into itself on sizes greater than twice the compiled-in
  468 MUL_TOOM22_THRESHOLD.
  469 
  470 When doing the above comparison between mul_basecase and kara_mul_n what's
  471 probably of interest is mul_basecase versus a kara_mul_n that does one level
  472 of Karatsuba then calls to mul_basecase, but this only happens on sizes less
  473 than twice the compiled MUL_TOOM22_THRESHOLD.  A larger value for that
  474 setting can be compiled-in to avoid the problem if necessary.  The same
  475 applies to toom3 and DC, though in a trickier fashion.
  476 
  477 There are some upper limits on some of the thresholds, arising from arrays
  478 dimensioned according to a threshold (mpn_mul_n), or asm code with certain
  479 sized displacements (some x86 versions of sqr_basecase).  So putting huge
  480 values for the thresholds, even just for testing, may fail.
  481 
  482 
  483 
  484 
  485 FUTURE
  486 
  487 Make a program to check the time base is working properly, for small and
  488 large measurements.  Make it able to test each available method, including
  489 perhaps the apparent resolution of each.
  490 
  491 Make a general mechanism for specifying operand overlap, and a syntax like
  492 maybe "mpn_add_n.dst=src2" to select it.  Some measuring routines do this
  493 sort of thing with the "r" parameter currently.
  494 
  495 
  496 
  497 ----------------
  498 Local variables:
  499 mode: text
  500 fill-column: 76
  501 End: