"Fossies" - the Fresh Open Source Software Archive
Member "littleutils-1.2.4/repeats/repeats.1" (28 Mar 2021, 7101 Bytes) of package /linux/privat/littleutils-1.2.4.tar.lz:
As a special service "Fossies" has tried to format the requested text file into HTML format (style: standard
) with prefixed line numbers.
Alternatively you can here view
the uninterpreted source code file.
1 .TH REPEATS 1 "2021 Jan 22" littleutils
2 .SH NAME
3 repeats and repeats.pl \- search for duplicate files
4 .SH SYNOPSIS
25 .SH DESCRIPTION
26 \fBrepeats\fR (written in \fIC\fR and \fBsh\fR) and \fBrepeats.pl\fR (written
27 in \fBperl\fR and utilizing routines from the \fBCryptX\fR module) both search
28 for duplicate files in one or more specified directories, using a three-,
29 four-, or five-stage process. This process works as follows:
31 Initially, all files in the specified directories (and all of their
32 subdirectories) are listed as potential duplicates. In the first stage, all
33 files with a unique filesize are declared unique and are removed from the list.
34 In the optional second stage, any files which are actually a hardlink to
35 another file are removed, since they don't actually take up any more disk
36 space. In the third, all files for which the first 65536 (for \fBrepeats\fR)
37 or 4096 (for \fBrepeats.pl\fR) bytes (both adjustable with the \fB\-m\fR
38 option) have a unique filehash are declared unique and are removed from the
39 list. In the fourth, all files which have a unique filehash (for the entire
40 file) are declared unique and are removed from the list. And in the optional
41 fifth stage, all files with matching filehashes are compared using \fBcmp\fR
42 and are printed to stdout if they match.
44 This process is MUCH less disk and CPU intensive than creating full hashes for
45 all files. It is implemented using a combination of the \fBfilehash\fR,
46 \fBfilenode\fR, \fBfilesize\fR, \fBrep_hash\fR, \fBrep_node\fR, \fBrep_size\fR,
47 and \fBtempname\fR utilities. The \fBduff\fR, \fBdupd\fR, \fBfdupes\fR,
48 \fBjdupes\fR, and \fBrdfind\fR commands utilize similar strategies.
49 .SH OPTIONS
52 Print each set of duplicate files on a single line. This option is available
53 only in \fBrepeats.pl\fR.
56 Specify which hash algorithm should be used. Choices are \fI1\fR\~(MD5),
57 \fI2\fR\~(SHA1), \fI3\fR\~(SHA224), \fI4\fR\~(SHA256), \fI5\fR\~(SHA384),
58 \fI6\fR\~(SHA512), \fI7\fR\~(BLAKE2B-256), and \fI8\fR\~(BLAKE2B-512). The
59 default is\~\fI8\fR, for BLAKE2B-512 hashes.
62 Print help and quit.
65 List files that are actually hardlinks as duplicates. Normally, only the first
66 hardlink sharing an i-node number is included as a possible repeat. [This
67 skips stage\~2.]
70 Specify the number of bytes read per file in stage\~3.
73 Perform a final \fBcmp\fR-based "paranoia" check to absolutely ensure that
74 listed duplicates are truly duplicates. Using this option can result in each
75 duplicate being read completely two or three times, which can substantially
76 increase execution time when duplicates of large files are present. [This is
77 stage\~5 and is only available in \fBrepeats\fR.]
80 In \fBrepeats.pl\fR, stage\~3 is run repeatedly in place of stage\~4, with the
81 number of bytes read in each round being multipled by the "ramp rate" value.
82 The default value is\~\fI4\fR.
85 Verbose output. Write some statistics concerning number of potential
86 duplicates found at each stage to \fIstderr\fR.
89 Include even zero-length files in the search. If there is more than one
90 zero-length file, all of those files will be considered duplicates.
91 .SH NOTES
92 If no directory is specified, the current directory is assumed.
94 In terms of program history, the \fBrepeats\fR utility was written first (in
95 2004). The \fBrepeats.pl\fR utility was written later (in 2020) to explore new
96 algorithms and it currently implements a multi-step stage\~3 algorithm that
97 requires less disc I/O than \fBrepeats\fR. It still runs slightly slower than
98 \fBrepeats\fR on \fILinux\fR for most data sets but is actually faster on
100 .SH BUGS
101 It must be noted that it is theoretically possible (though freakishly
102 improbable) for two different files to be listed as duplicates if the \fB\-p\fR
103 option is not used. If they have the same size and the same file hash, they
104 will be listed as the duplicates. The odds of two different files (of the same
105 size) being listed as duplicates is approximately 1.16e77 to 1 for the SHA256
106 hash. Using arguments similar to the classic "birthday paradox" (i.e., the
107 probability of two people sharing the same birthday in a room of only 23 people
108 is greater than 50%), it can be shown that it would take approximately 4.01e38
109 different files (of exactly the same size) to achieve similar odds. In other
110 words, it'll probably never happen. Ever. However, it's not inconceivable.
111 You have been warned.
113 For the various hashes, the number of same-sized files required for the
114 probability of a false positive to reach 50% are as follows:
116 MD5: 2.17e19 files
118 SHA1: 1.42e24 files
120 SHA224: 6.11e33 files
122 SHA256: 4.01e38 files (default prior to version 1.2.0)
124 SHA384: 7.39e57 files
126 SHA512: 1.36e77 files
128 BLAKE2B-256: 4.01e38 files
130 BLAKE2B-512: 1.36e77 files (default as of version 1.2.0)
132 See \fIhttps://en.wikipedia.org/wiki/Birthday_problem\fR and
133 \fIhttps://en.wikipedia.org/wiki/Birthday_attack\fR for more information. If
134 this extremely remote risk is too much to bear, use the \fB\-p\fR option.
136 Also, \fBrepeats\fR and \fBrepeats.pl\fR currently lack logic to mark hardlinks
137 (under the \fB\-l\fR option) as duplicates without actually reading the entire
138 file multiple times. This will be addressed in a future version of
141 And finally, \fBrepeats\fR will malfunction if asked to examine files that have
142 one or more "tab" (0x09) characters in the filename, as tab characters are used
143 as delimiters in the temporary working files that \fBrepeats\fR creates. If
144 scanning a data set with embedded "tabs" in the filenames, use \fBrepeats.pl\fR
145 instead, as it maintains file lists in memory.
146 .SH "SEE ALSO"
147 \fBfilehash\fR(1), \fBfilenode\fR(1), \fBfilesize\fR(1), \fBperl\fR,
148 \fBCryptX\fR(3pm), \fBrep_hash\fR(1), \fBrep_node\fR(1), \fBrep_size\fR(1),
149 \fBduff\fR(1), \fBdupd\fR(1), \fBfdupes\fR(1), \fBjdupes\fR(1), \fBrdfind\fR(1)
150 .SH COPYRIGHT
151 Copyright (C) 2004-2021 by Brian Lindholm. This program is free software; you
152 can use it, redistribute it, and/or modify it under the terms of the GNU
153 General Public License as published by the Free Software Foundation; either
154 version 3, or (at your option) any later version.
156 This program is distributed in the hope that it will be useful, but WITHOUT ANY
157 WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
158 PARTICULAR PURPOSE. See the GNU General Public License for more details.