"Fossies" - the Fresh Open Source Software Archive  

Source code changes of the file "repeats/repeats.1" between
littleutils-1.2.4.tar.lz and littleutils-1.2.5.tar.lz

About: littleutils are a collection of small and simple utilities (rename files, search for duplicate files, ...).

repeats.1  (littleutils-1.2.4.tar.lz):repeats.1  (littleutils-1.2.5.tar.lz)
skipping to change at line 60 skipping to change at line 60
times, which can substantially increase execution time when duplic ates of large files are present. times, which can substantially increase execution time when duplic ates of large files are present.
[This is stage 5 and is only available in repeats.] [This is stage 5 and is only available in repeats.]
-r ramp_factor -r ramp_factor
In repeats.pl, stage 3 is run repeatedly in place of stage 4, with the number of bytes read in In repeats.pl, stage 3 is run repeatedly in place of stage 4, with the number of bytes read in
each round being multipled by the "ramp rate" value. The default value is 4. each round being multipled by the "ramp rate" value. The default value is 4.
-v Verbose output. Write some statistics concerning number of potent ial duplicates found at each -v Verbose output. Write some statistics concerning number of potent ial duplicates found at each
stage to stderr. stage to stderr.
-z Include even zero-length files in the search. If there is more t -z Include zero-length files in the search. If there is more than on
han one zero-length file, all of e zero-length file, all of those
those files will be considered duplicates. files will be considered duplicates.
NOTES NOTES
If no directory is specified, the current directory is assumed. If no directory is specified, the current directory is assumed.
In terms of program history, the repeats utility was written first (in 20 04). The repeats.pl utility was In terms of program history, the repeats utility was written first (in 20 04). The repeats.pl utility was
written later (in 2020) to explore new algorithms and it currently imple ments a multi-step stage 3 algo- written later (in 2020) to explore new algorithms and it currently imple ments a multi-step stage 3 algo-
rithm that requires less disc I/O than repeats. It still runs slightly s lower than repeats on Linux for rithm that requires less disc I/O than repeats. It still runs slightly s lower than repeats on Linux for
most data sets but is actually faster on Cygwin. most data sets but is significantly faster on Cygwin.
BUGS BUGS
It must be noted that it is theoretically possible (though freakishly imp robable) for two different files It must be noted that it is theoretically possible (though freakishly imp robable) for two different files
to be listed as duplicates if the -p option is not used. If they have th e same size and the same file to be listed as duplicates if the -p option is not used. If they have th e same size and the same file
hash, they will be listed as the duplicates. The odds of two differen t files (of the same size) being hash, they will be listed as the duplicates. The odds of two differen t files (of the same size) being
listed as duplicates is approximately 1.16e77 to 1 for the SHA256 hash. Using arguments similar to the listed as duplicates is approximately 1.16e77 to 1 for the SHA256 hash. Using arguments similar to the
classic "birthday paradox" (i.e., the probability of two people sharin g the same birthday in a room of classic "birthday paradox" (i.e., the probability of two people sharin g the same birthday in a room of
only 23 people is greater than 50%), it can be shown that it would take a pproximately 4.01e38 different only 23 people is greater than 50%), it can be shown that it would take a pproximately 4.01e38 different
files (of exactly the same size) to achieve similar odds. In other wor ds, it'll probably never happen. files (of exactly the same size) to achieve similar odds. In other wor ds, it'll probably never happen.
Ever. However, it's not inconceivable. You have been warned. Ever. However, it's not inconceivable. You have been warned.
skipping to change at line 91 skipping to change at line 91
For the various hashes, the number of same-sized files required for the p robability of a false positive For the various hashes, the number of same-sized files required for the p robability of a false positive
to reach 50% are as follows: to reach 50% are as follows:
MD5: 2.17e19 files MD5: 2.17e19 files
SHA1: 1.42e24 files SHA1: 1.42e24 files
SHA224: 6.11e33 files SHA224: 6.11e33 files
SHA256: 4.01e38 files (default prior to version 1.2.0) SHA256: 4.01e38 files (default prior to version 1.2.0)
SHA384: 7.39e57 files SHA384: 7.39e57 files
SHA512: 1.36e77 files SHA512: 1.36e77 files
BLAKE2B-256: 4.01e38 files BLAKE2B-256: 4.01e38 files
BLAKE2B-512: 1.36e77 files (default as of version 1.2.0) BLAKE2B-512: 1.36e77 files (default for versions 1.2.0 and later)
See https://en.wikipedia.org/wiki/Birthday_problem and https://en.wikipe dia.org/wiki/Birthday_attack for See https://en.wikipedia.org/wiki/Birthday_problem and https://en.wikipe dia.org/wiki/Birthday_attack for
more information. If this extremely remote risk is too much to bear, use the -p option. more information. If this extremely remote risk is too much to bear, use the -p option.
Also, repeats and repeats.pl currently lack logic to mark hardlinks (unde r the -l option) as duplicates Also, repeats and repeats.pl currently lack logic to mark hardlinks (unde r the -l option) as duplicates
without actually reading the entire file multiple times. This will be addressed in a future version of without actually reading the entire file multiple times. This will be addressed in a future version of
littleutils. littleutils.
And finally, repeats will malfunction if asked to examine files that have one or more "tab" (0x09) char- And finally, repeats will malfunction if asked to examine files that have one or more "tab" (0x09) char-
acters in the filename, as tab characters are used as delimiters in the temporary working files that acters in the filename, as tab characters are used as delimiters in the temporary working files that
repeats creates. If scanning a data set with embedded "tabs" in the file repeats creates. If scanning a data set with embedded tabs in the filena
names, use repeats.pl instead, mes, use repeats.pl instead, as
as it maintains file lists in memory. it maintains file lists in memory.
SEE ALSO SEE ALSO
filehash(1), filenode(1), filesize(1), perl, CryptX(3pm), rep_hash(1), re filehash(1), filenode(1), filesize(1), perl(1), CryptX(3pm), rep_has
p_node(1), rep_size(1), duff(1), h(1), rep_node(1), rep_size(1),
dupd(1), fdupes(1), jdupes(1), rdfind(1) duff(1), dupd(1), fdupes(1), jdupes(1), rdfind(1)
COPYRIGHT COPYRIGHT
Copyright (C) 2004-2021 by Brian Lindholm. This program is free software ; you can use it, redistribute Copyright (C) 2004-2021 by Brian Lindholm. This program is free software ; you can use it, redistribute
it, and/or modify it under the terms of the GNU General Public License a s published by the Free Software it, and/or modify it under the terms of the GNU General Public License a s published by the Free Software
Foundation; either version 3, or (at your option) any later version. Foundation; either version 3, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHO UT ANY WARRANTY; without even This program is distributed in the hope that it will be useful, but WITHO UT ANY WARRANTY; without even
the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURP OSE. See the GNU General Public the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURP OSE. See the GNU General Public
License for more details. License for more details.
 End of changes. 5 change blocks. 
11 lines changed or deleted 11 lines changed or added

Home  |  About  |  Features  |  All  |  Newest  |  Dox  |  Diffs  |  RSS Feeds  |  Screenshots  |  Comments  |  Imprint  |  Privacy  |  HTTP(S)