"Fossies" - the Fresh Open Source Software Archive

Member "littleutils-1.2.4/repeats/repeats.1" (28 Mar 2021, 7101 Bytes) of package /linux/privat/littleutils-1.2.4.tar.lz:


As a special service "Fossies" has tried to format the requested text file into HTML format (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 .TH REPEATS 1 "2021 Jan 22" littleutils
    2 .SH NAME
    3 repeats and repeats.pl \- search for duplicate files
    4 .SH SYNOPSIS
    5 \fBrepeats\fR
    6 [\fB\-a\fR\~\fIhash_algorithm\fR]
    7 [\fB\-h(elp)\fR]
    8 [\fB\-l(inks_hard)\fR]
    9 [\fB\-m\fR\~\fIbytes_for_partial\fR]
   10 [\fB\-p(aranoid)\fR]
   11 [\fB\-v(erbose)\fR]
   12 [\fB\-z(ero_include)\fR]
   13 [\fIdirectory\|.\|.\|.\fR]
   14 
   15 \fBrepeats.pl\fR
   16 [\fB\-1(_line_output)\fR]
   17 [\fB\-a\fR\~\fIhash_algorithm\fR]
   18 [\fB\-h(elp)\fR]
   19 [\fB\-l(inks_hard)\fR]
   20 [\fB\-m\fR\~\fIbytes_for_partial\fR]
   21 [\fB\-r\fR\~\fIramp_factor\fR]
   22 [\fB\-v(erbose)\fR]
   23 [\fB\-z(ero_include)\fR]
   24 [\fIdirectory\|.\|.\|.\fR]
   25 .SH DESCRIPTION
   26 \fBrepeats\fR (written in \fIC\fR and \fBsh\fR) and \fBrepeats.pl\fR (written
   27 in \fBperl\fR and utilizing routines from the \fBCryptX\fR module) both search
   28 for duplicate files in one or more specified directories, using a three-,
   29 four-, or five-stage process.  This process works as follows:
   30 
   31 Initially, all files in the specified directories (and all of their
   32 subdirectories) are listed as potential duplicates.  In the first stage, all
   33 files with a unique filesize are declared unique and are removed from the list.
   34 In the optional second stage, any files which are actually a hardlink to
   35 another file are removed, since they don't actually take up any more disk
   36 space.  In the third, all files for which the first 65536 (for \fBrepeats\fR)
   37 or 4096 (for \fBrepeats.pl\fR) bytes (both adjustable with the \fB\-m\fR
   38 option) have a unique filehash are declared unique and are removed from the
   39 list.  In the fourth, all files which have a unique filehash (for the entire
   40 file) are declared unique and are removed from the list.  And in the optional
   41 fifth stage, all files with matching filehashes are compared using \fBcmp\fR
   42 and are printed to stdout if they match.
   43 
   44 This process is MUCH less disk and CPU intensive than creating full hashes for
   45 all files.  It is implemented using a combination of the \fBfilehash\fR,
   46 \fBfilenode\fR, \fBfilesize\fR, \fBrep_hash\fR, \fBrep_node\fR, \fBrep_size\fR,
   47 and \fBtempname\fR utilities.  The \fBduff\fR, \fBdupd\fR, \fBfdupes\fR,
   48 \fBjdupes\fR, and \fBrdfind\fR commands utilize similar strategies.
   49 .SH OPTIONS
   50 .TP
   51 \fB\-1\fR
   52 Print each set of duplicate files on a single line.  This option is available
   53 only in \fBrepeats.pl\fR.
   54 .TP
   55 \fB\-a\fR\~\fIhash_algorithm\fR
   56 Specify which hash algorithm should be used.  Choices are \fI1\fR\~(MD5),
   57 \fI2\fR\~(SHA1), \fI3\fR\~(SHA224), \fI4\fR\~(SHA256), \fI5\fR\~(SHA384),
   58 \fI6\fR\~(SHA512), \fI7\fR\~(BLAKE2B-256), and \fI8\fR\~(BLAKE2B-512).  The
   59 default is\~\fI8\fR, for BLAKE2B-512 hashes.
   60 .TP
   61 \fB\-h\fR
   62 Print help and quit.
   63 .TP
   64 \fB\-l\fR
   65 List files that are actually hardlinks as duplicates.  Normally, only the first
   66 hardlink sharing an i-node number is included as a possible repeat.  [This
   67 skips stage\~2.]
   68 .TP
   69 \fB\-m\fR\~\fIbytes_for_partial\fR
   70 Specify the number of bytes read per file in stage\~3.
   71 .TP
   72 \fB\-p\fR
   73 Perform a final \fBcmp\fR-based "paranoia" check to absolutely ensure that
   74 listed duplicates are truly duplicates.  Using this option can result in each
   75 duplicate being read completely two or three times, which can substantially
   76 increase execution time when duplicates of large files are present.  [This is
   77 stage\~5 and is only available in \fBrepeats\fR.]
   78 .TP
   79 \fB\-r\fR\~\fIramp_factor\fR
   80 In \fBrepeats.pl\fR, stage\~3 is run repeatedly in place of stage\~4, with the
   81 number of bytes read in each round being multipled by the "ramp rate" value.
   82 The default value is\~\fI4\fR.
   83 .TP
   84 \fB\-v\fR
   85 Verbose output.  Write some statistics concerning number of potential
   86 duplicates found at each stage to \fIstderr\fR.
   87 .TP
   88 \fB\-z\fR
   89 Include even zero-length files in the search.  If there is more than one
   90 zero-length file, all of those files will be considered duplicates.
   91 .SH NOTES
   92 If no directory is specified, the current directory is assumed.
   93 
   94 In terms of program history, the \fBrepeats\fR utility was written first (in
   95 2004).  The \fBrepeats.pl\fR utility was written later (in 2020) to explore new
   96 algorithms and it currently implements a multi-step stage\~3 algorithm that
   97 requires less disc I/O than \fBrepeats\fR.  It still runs slightly slower than
   98 \fBrepeats\fR on \fILinux\fR for most data sets but is actually faster on
   99 \fICygwin\fR.
  100 .SH BUGS
  101 It must be noted that it is theoretically possible (though freakishly
  102 improbable) for two different files to be listed as duplicates if the \fB\-p\fR
  103 option is not used.  If they have the same size and the same file hash, they
  104 will be listed as the duplicates.  The odds of two different files (of the same
  105 size) being listed as duplicates is approximately 1.16e77 to 1 for the SHA256
  106 hash.  Using arguments similar to the classic "birthday paradox" (i.e., the
  107 probability of two people sharing the same birthday in a room of only 23 people
  108 is greater than 50%), it can be shown that it would take approximately 4.01e38
  109 different files (of exactly the same size) to achieve similar odds.  In other
  110 words, it'll probably never happen.  Ever.  However, it's not inconceivable.
  111 You have been warned.
  112 
  113 For the various hashes, the number of same-sized files required for the
  114 probability of a false positive to reach 50% are as follows:
  115 
  116 MD5:          2.17e19 files
  117 .br
  118 SHA1:         1.42e24 files
  119 .br
  120 SHA224:       6.11e33 files
  121 .br
  122 SHA256:       4.01e38 files  (default prior to version 1.2.0)
  123 .br
  124 SHA384:       7.39e57 files
  125 .br
  126 SHA512:       1.36e77 files
  127 .br
  128 BLAKE2B-256:  4.01e38 files
  129 .br
  130 BLAKE2B-512:  1.36e77 files  (default as of version 1.2.0)
  131 
  132 See \fIhttps://en.wikipedia.org/wiki/Birthday_problem\fR and
  133 \fIhttps://en.wikipedia.org/wiki/Birthday_attack\fR for more information.  If
  134 this extremely remote risk is too much to bear, use the \fB\-p\fR option.
  135 
  136 Also, \fBrepeats\fR and \fBrepeats.pl\fR currently lack logic to mark hardlinks
  137 (under the \fB\-l\fR option) as duplicates without actually reading the entire
  138 file multiple times.  This will be addressed in a future version of
  139 \fIlittleutils\fR.
  140 
  141 And finally, \fBrepeats\fR will malfunction if asked to examine files that have
  142 one or more "tab" (0x09) characters in the filename, as tab characters are used
  143 as delimiters in the temporary working files that \fBrepeats\fR creates.  If
  144 scanning a data set with embedded "tabs" in the filenames, use \fBrepeats.pl\fR
  145 instead, as it maintains file lists in memory.
  146 .SH "SEE ALSO"
  147 \fBfilehash\fR(1), \fBfilenode\fR(1), \fBfilesize\fR(1), \fBperl\fR,
  148 \fBCryptX\fR(3pm), \fBrep_hash\fR(1), \fBrep_node\fR(1), \fBrep_size\fR(1),
  149 \fBduff\fR(1), \fBdupd\fR(1), \fBfdupes\fR(1), \fBjdupes\fR(1), \fBrdfind\fR(1)
  150 .SH COPYRIGHT
  151 Copyright (C) 2004-2021 by Brian Lindholm.  This program is free software; you
  152 can use it, redistribute it, and/or modify it under the terms of the GNU
  153 General Public License as published by the Free Software Foundation; either
  154 version 3, or (at your option) any later version.
  155 
  156 This program is distributed in the hope that it will be useful, but WITHOUT ANY
  157 WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
  158 PARTICULAR PURPOSE.  See the GNU General Public License for more details.