"Fossies" - the Fresh Open Source Software Archive

Member "littleutils-1.2.4/repeats/repeats.1" (28 Mar 2021, 7101 Bytes) of package /linux/privat/littleutils-1.2.4.tar.lz:


Caution: As a special service "Fossies" has tried to format the requested manual source page into HTML format but links to other man pages may be missing or even erroneous. Alternatively you can here view or download the uninterpreted manual source code. A member file download can also be achieved by clicking within a package contents listing on the according byte size field. See also the latest Fossies "Diffs" side-by-side code changes report for "repeats.1": 1.2.3_vs_1.2.4.

REPEATS

NAME
SYNOPSIS
DESCRIPTION
OPTIONS
NOTES
BUGS
SEE ALSO
COPYRIGHT

NAME

repeats and repeats.pl − search for duplicate files

SYNOPSIS

repeats [−a hash_algorithm] [−h(elp)] [−l(inks_hard)] [−m bytes_for_partial] [−p(aranoid)] [−v(erbose)] [−z(ero_include)] [directory...]

repeats.pl [−1(_line_output)] [−a hash_algorithm] [−h(elp)] [−l(inks_hard)] [−m bytes_for_partial] [−r ramp_factor] [−v(erbose)] [−z(ero_include)] [directory...]

DESCRIPTION

repeats (written in C and sh) and repeats.pl (written in perl and utilizing routines from the CryptX module) both search for duplicate files in one or more specified directories, using a three-, four-, or five-stage process. This process works as follows:

Initially, all files in the specified directories (and all of their subdirectories) are listed as potential duplicates. In the first stage, all files with a unique filesize are declared unique and are removed from the list. In the optional second stage, any files which are actually a hardlink to another file are removed, since they don’t actually take up any more disk space. In the third, all files for which the first 65536 (for repeats) or 4096 (for repeats.pl) bytes (both adjustable with the −m option) have a unique filehash are declared unique and are removed from the list. In the fourth, all files which have a unique filehash (for the entire file) are declared unique and are removed from the list. And in the optional fifth stage, all files with matching filehashes are compared using cmp and are printed to stdout if they match.

This process is MUCH less disk and CPU intensive than creating full hashes for all files. It is implemented using a combination of the filehash, filenode, filesize, rep_hash, rep_node, rep_size, and tempname utilities. The duff, dupd, fdupes, jdupes, and rdfind commands utilize similar strategies.

OPTIONS

−1

Print each set of duplicate files on a single line. This option is available only in repeats.pl.

−a hash_algorithm

Specify which hash algorithm should be used. Choices are (MD5), (SHA1), (SHA224), (SHA256), (SHA384), (SHA512), (BLAKE2B-256), and (BLAKE2B-512). The default is 8, for BLAKE2B-512 hashes.

−h

Print help and quit.

−l

List files that are actually hardlinks as duplicates. Normally, only the first hardlink sharing an i-node number is included as a possible repeat. [This skips stage 2.]

−m bytes_for_partial

Specify the number of bytes read per file in stage 3.

−p

Perform a final cmp-based "paranoia" check to absolutely ensure that listed duplicates are truly duplicates. Using this option can result in each duplicate being read completely two or three times, which can substantially increase execution time when duplicates of large files are present. [This is stage 5 and is only available in repeats.]

−r ramp_factor

In repeats.pl, stage 3 is run repeatedly in place of stage 4, with the number of bytes read in each round being multipled by the "ramp rate" value. The default value is 4.

−v

Verbose output. Write some statistics concerning number of potential duplicates found at each stage to stderr.

−z

Include even zero-length files in the search. If there is more than one zero-length file, all of those files will be considered duplicates.

NOTES

If no directory is specified, the current directory is assumed.

In terms of program history, the repeats utility was written first (in 2004). The repeats.pl utility was written later (in 2020) to explore new algorithms and it currently implements a multi-step stage 3 algorithm that requires less disc I/O than repeats. It still runs slightly slower than repeats on Linux for most data sets but is actually faster on Cygwin.

BUGS

It must be noted that it is theoretically possible (though freakishly improbable) for two different files to be listed as duplicates if the −p option is not used. If they have the same size and the same file hash, they will be listed as the duplicates. The odds of two different files (of the same size) being listed as duplicates is approximately 1.16e77 to 1 for the SHA256 hash. Using arguments similar to the classic "birthday paradox" (i.e., the probability of two people sharing the same birthday in a room of only 23 people is greater than 50%), it can be shown that it would take approximately 4.01e38 different files (of exactly the same size) to achieve similar odds. In other words, it’ll probably never happen. Ever. However, it’s not inconceivable. You have been warned.

For the various hashes, the number of same-sized files required for the probability of a false positive to reach 50% are as follows:

MD5: 2.17e19 files
SHA1: 1.42e24 files
SHA224: 6.11e33 files
SHA256: 4.01e38 files (default prior to version 1.2.0)
SHA384: 7.39e57 files
SHA512: 1.36e77 files
BLAKE2B-256: 4.01e38 files
BLAKE2B-512: 1.36e77 files (default as of version 1.2.0)

See https://en.wikipedia.org/wiki/Birthday_problem and https://en.wikipedia.org/wiki/Birthday_attack for more information. If this extremely remote risk is too much to bear, use the −p option.

Also, repeats and repeats.pl currently lack logic to mark hardlinks (under the −l option) as duplicates without actually reading the entire file multiple times. This will be addressed in a future version of littleutils.

And finally, repeats will malfunction if asked to examine files that have one or more "tab" (0x09) characters in the filename, as tab characters are used as delimiters in the temporary working files that repeats creates. If scanning a data set with embedded "tabs" in the filenames, use repeats.pl instead, as it maintains file lists in memory.

SEE ALSO

filehash(1), filenode(1), filesize(1), perl, CryptX(3pm), rep_hash(1), rep_node(1), rep_size(1), duff(1), dupd(1), fdupes(1), jdupes(1), rdfind(1)

COPYRIGHT

Copyright (C) 2004-2021 by Brian Lindholm. This program is free software; you can use it, redistribute it, and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.