"Fossies" - the Fresh Open Source Software Archive

Member "xdelta3-3.0.11/draft-korn-vcdiff.txt" (24 Mar 2015, 60706 Bytes) of package /linux/misc/xdelta3-3.0.11.tar.gz:


As a special service "Fossies" has tried to format the requested text file into HTML format (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1                                                      David G. Korn, AT&T Labs
    2 				             Joshua P. MacDonald, UC Berkeley
    3                                                  Jeffrey C. Mogul, Compaq WRL
    4 Internet-Draft                                       Kiem-Phong Vo, AT&T Labs
    5 Expires: 09 November 2002                                    09 November 2001
    6 
    7 
    8         The VCDIFF Generic Differencing and Compression Data Format
    9 
   10                          draft-korn-vcdiff-06.txt
   11 
   12 
   13 
   14 Status of this Memo
   15 
   16     This document is an Internet-Draft and is in full conformance
   17     with all provisions of Section 10 of RFC2026.
   18 
   19     Internet-Drafts are working documents of the Internet Engineering
   20     Task Force (IETF), its areas, and its working groups.  Note that
   21     other groups may also distribute working documents as
   22     Internet-Drafts.
   23 
   24     Internet-Drafts are draft documents valid for a maximum of six
   25     months and may be updated, replaced, or obsoleted by other
   26     documents at any time.  It is inappropriate to use Internet-
   27     Drafts as reference material or to cite them other than as
   28     "work in progress."
   29 
   30     The list of current Internet-Drafts can be accessed at
   31     http://www.ietf.org/ietf/1id-abstracts.txt
   32 
   33     The list of Internet-Draft Shadow Directories can be accessed at
   34     http://www.ietf.org/shadow.html.
   35 
   36 
   37 Abstract
   38 
   39     This memo describes a general, efficient and portable data format
   40     suitable for encoding compressed and/or differencing data so that
   41     they can be easily transported among computers.
   42 
   43 
   44 Table of Contents:
   45 
   46     1.  EXECUTIVE SUMMARY ............................................  2
   47     2.  CONVENTIONS ..................................................  3
   48     3.  DELTA INSTRUCTIONS ...........................................  4
   49     4.  DELTA FILE ORGANIZATION ......................................  5
   50     5.  DELTA INSTRUCTION ENCODING ...................................  9
   51     6.  DECODING A TARGET WINDOW ..................................... 14
   52     7.  APPLICATION-DEFINED CODE TABLES .............................. 16
   53     8.  PERFORMANCE .................................................. 16
   54     9.  FURTHER ISSUES ............................................... 17
   55    10.  SUMMARY ...................................................... 18
   56    11.  ACKNOWLEDGEMENTS ............................................. 18
   57    12.  SECURITY CONSIDERATIONS ...................................... 18
   58    13.  SOURCE CODE AVAILABILITY ..................................... 18
   59    14.  INTELLECTUAL PROPERTY RIGHTS ................................. 18
   60    15.  IANA CONSIDERATIONS .......................................... 19
   61    16.  REFERENCES ................................................... 19
   62    17.  AUTHOR'S ADDRESS ............................................. 20
   63 
   64 
   65 1.  EXECUTIVE SUMMARY
   66 
   67     Compression and differencing techniques can greatly improve storage
   68     and transmission of files and file versions.  Since files are often
   69     transported across machines with distinct architectures and performance
   70     characteristics, such data should be encoded in a form that is portable
   71     and can be decoded with little or no knowledge of the encoders.
   72     This document describes Vcdiff, a compact portable encoding format
   73     designed for these purposes.
   74 
   75     Data differencing is the process of computing a compact and invertible
   76     encoding of a "target file" given a "source file".  Data compression
   77     is similar but without the use of source data.  The UNIX utilities diff,
   78     compress, and gzip are well-known examples of data differencing and
   79     compression tools.  For data differencing, the computed encoding is
   80     called a "delta file", and, for data compression, it is called
   81     a "compressed file".  Delta and compressed files are good for storage
   82     and transmission as they are often smaller than the originals.
   83 
   84     Data differencing and data compression are traditionally treated
   85     as distinct types of data processing.  However, as shown in the Vdelta
   86     technique by Korn and Vo [1], compression can be thought of as a special
   87     case of differencing in which the source data is empty. The basic idea
   88     is to unify the string parsing scheme used in the Lempel-Ziv'77 style
   89     compressors [2], and the block-move technique of Tichy [3].  Loosely
   90     speaking, this works as follows:
   91 
   92         a. Concatenate source and target data.
   93         b. Parse the data from left to right as in LZ'77 but
   94 	   make sure that a parsed segment starts the target data.
   95         c. Start to output when reaching target data.
   96 
   97     Parsing is based on string matching algorithms such as suffix trees [4]
   98     or hashing with different time and space performance characteristics.
   99     Vdelta uses a fast string matching algorithm that requires less memory
  100     than other techniques [5,6].  However, even with this algorithm, the
  101     memory requirement can still be prohibitive for large files.  A common
  102     way to deal with memory limitation is to partition an input file into
  103     chunks called "windows" and process them separately. Here, except for
  104     unpublished work by Vo, little has been done on designing effective
  105     windowing schemes. Current techniques, including Vdelta, simply use
  106     source and target windows with corresponding addresses across source
  107     and target files.
  108 
  109     String matching and windowing algorithms have large influence on the
  110     compression rate of delta and compressed files. However, it is desirable
  111     to have a portable encoding format that is independent of such algorithms.
  112     This enables construction of client-server applications in which a server
  113     may serve clients with unknown computing characteristics.  Unfortunately,
  114     all current differencing and compressing tools, including Vdelta, fall
  115     short in this respect. Their storage formats are closely intertwined
  116     with the implemented string matching and/or windowing algorithms.
  117 
  118     The encoding format Vcdiff proposed here addresses the above issues.
  119     Vcdiff achieves the below characteristics:
  120 
  121 	Output compactness:
  122             The basic encoding format compactly represents compressed or delta
  123 	    files. Applications can further extend the basic encoding format
  124 	    with "secondary encoders" to achieve more compression.
  125 
  126 	Data portability:
  127 	    The basic encoding format is free from machine byte order and
  128 	    word size issues. This allows data to be encoded on one machine
  129 	    and decoded on a different machine with different architecture.
  130 
  131     	Algorithm genericity:
  132 	    The decoding algorithm is independent from string matching and
  133 	    windowing algorithms. This allows competition among implementations
  134 	    of the encoder while keeping the same decoder.
  135 
  136     	Decoding efficiency:
  137 	    Except for secondary encoder issues, the decoding algorithm runs
  138 	    in time proportional to the size of the target file and uses space
  139 	    proportional to the maximal window size.  Vcdiff differs from more
  140 	    conventional compressors in that it uses only byte-aligned
  141 	    data, thus avoiding bit-level operations, which improves
  142 	    decoding speed at the slight cost of compression efficiency.
  143 
  144     The Vcdiff data format and the algorithms for decoding data shall be
  145     described next.  Since Vcdiff treats compression as a special case of
  146     differencing, we shall use the term "delta file" to indicate the
  147     compressed output for both cases.
  148 
  149 
  150 2. CONVENTIONS
  151 
  152     The basic data unit is a byte.  For portability, Vcdiff shall limit
  153     a byte to its lower eight bits even on machines with larger bytes.
  154     The bits in a byte are ordered from right to left so that the least
  155     significant bit (LSB) has value 1, and the most significant bit (MSB),
  156     has value 128.
  157 
  158     For purposes of exposition in this document, we adopt the convention
  159     that the LSB is numbered 0, and the MSB is numbered 7.  Bit numbers
  160     never appear in the encoded format itself.
  161 
  162     Vcdiff encodes unsigned integer values using a portable variable-sized
  163     format (originally introduced in the Sfio library [7]). This encoding
  164     treats an integer as a number in base 128. Then, each digit in this
  165     representation is encoded in the lower seven bits of a byte. Except for
  166     the least significant byte, other bytes have their most significant bit
  167     turned on to indicate that there are still more digits in the encoding.
  168     The two key properties of this integer encoding that are beneficial
  169     to a data compression format are:
  170 
  171 	a. The encoding is portable among systems using 8-bit bytes, and
  172         b. Small values are encoded compactly.
  173 
  174     For example, consider the value 123456789 which can be represented with
  175     four 7-bit digits whose values are 58, 111, 26, 21 in order from most
  176     to least significant. Below is the 8-bit byte encoding of these digits.
  177     Note that the MSBs of 58, 111 and 26 are on.
  178 
  179                  +-------------------------------------------+
  180                  | 10111010 | 11101111 | 10011010 | 00010101 |
  181                  +-------------------------------------------+
  182                    MSB+58     MSB+111    MSB+26     0+21
  183 
  184 
  185     Henceforth, the terms "byte" and "integer" will refer to a byte and an
  186     unsigned integer as described.
  187 
  188 
  189     From time to time, algorithms are exhibited to clarify the descriptions
  190     of parts of the Vcdiff format. On such occasions, the C language will be
  191     used to make precise the algorithms.  The C code shown in this
  192     document is meant for clarification only, and is not part of the
  193     actual specification of the Vcdiff format.
  194 
  195     In this specification, the key words "MUST", "MUST NOT",
  196     "SHOULD", "SHOULD NOT", and "MAY" document are to be interpreted as
  197     described in RFC2119 [12].
  198 
  199 
  200 3.  DELTA INSTRUCTIONS
  201 
  202     A large target file is partitioned into non-overlapping sections
  203     called "target windows". These target windows are processed separately
  204     and sequentially based on their order in the target file.
  205 
  206     A target window T of length t may be compared against some source data
  207     segment S of length s. By construction, this source data segment S
  208     comes either from the source file, if one is used, or from a part of
  209     the target file earlier than T.  In this way, during decoding, S is
  210     completely known when T is being decoded.
  211 
  212     The choices of T, t, S and s are made by some window selection algorithm
  213     which can greatly affect the size of the encoding. However, as seen later,
  214     these choices are encoded so that no knowledge of the window selection
  215     algorithm is needed during decoding.
  216 
  217     Assume that S[j] represents the jth byte in S, and T[k] represents
  218     the kth byte in T.  Then, for the delta instructions, we treat the data
  219     windows S and T as substrings of a superstring U formed by concatenating
  220     them like this:
  221 
  222         S[0]S[1]...S[s-1]T[0]T[1]...T[t-1]
  223 
  224     The "address" of a byte in S or T is referred to by its location in U.
  225     For example, the address of T[k] is s+k.
  226 
  227     The instructions to encode and direct the reconstruction of a target
  228     window are called delta instructions. There are three types:
  229 
  230 	ADD: This instruction has two arguments, a size x and a sequence of
  231 	    x bytes to be copied.
  232 	COPY: This instruction has two arguments, a size x and an address p
  233 	    in the string U. The arguments specify the substring of U that
  234 	    must be copied. We shall assert that such a substring must be
  235 	    entirely contained in either S or T.
  236 	RUN: This instruction has two arguments, a size x and a byte b that
  237 	    will be repeated x times.
  238 
  239     Below are example source and target windows and the delta instructions
  240     that encode the target window in terms of the source window.
  241 
  242         a b c d e f g h i j k l m n o p
  243         a b c d w x y z e f g h e f g h e f g h e f g h z z z z
  244 
  245         COPY  4, 0
  246         ADD   4, w x y z
  247         COPY  4, 4
  248         COPY 12, 24
  249 	RUN   4, z
  250 
  251 
  252     Thus, the first letter 'a' in the target window is at location 16
  253     in the superstring. Note that the fourth instruction, "COPY 12, 24",
  254     copies data from T itself since address 24 is position 8 in T.
  255     This instruction also shows that it is fine to overlap the data to be
  256     copied with the data being copied from as long as the latter starts
  257     earlier. This enables efficient encoding of periodic sequences,
  258     i.e., sequences with regularly repeated subsequences. The RUN instruction
  259     is a compact way to encode a sequence repeating the same byte even though
  260     such a sequence can be thought of as a periodic sequence with period 1.
  261 
  262     To reconstruct the target window, one simply processes one delta
  263     instruction at a time and copy the data either from the source window
  264     or the being reconstructed target window based on the type of the
  265     instruction and the associated address, if any.
  266 
  267 
  268 4.  DELTA FILE ORGANIZATION
  269 
  270     A Vcdiff delta file starts with a Header section followed by a sequence
  271     of Window sections. The Header section includes magic bytes to identify
  272     the file type, and information concerning data processing beyond the
  273     basic encoding format. The Window sections encode the target windows.
  274 
  275     Below is the overall organization of a delta file. The indented items
  276     refine the ones immediately above them. An item in square brackets may
  277     or may not be present in the file depending on the information encoded
  278     in the Indicator byte above it.
  279 
  280         Header
  281 	    Header1                                  - byte
  282 	    Header2                                  - byte
  283 	    Header3                                  - byte
  284 	    Header4                                  - byte
  285 	    Hdr_Indicator                            - byte
  286 	    [Secondary compressor ID]                - byte
  287 
  288 [@@@ Why is compressor ID not an integer? ]
  289 [@@@ If we aren't defining any secondary compressors yet, then it seems
  290 that defining the [Secondary compressor ID] and the corresponding
  291 VCD_DECOMPRESS Hdr_Indicator bit in this draft has no real value.  An
  292 implementation of this specification won't be able to decode a VCDIFF
  293 encoded with this option if it doesn't know about any secondary
  294 compressors.  It seems that you should specify the bits related to
  295 secondary compressors once you have defined the first a secondary
  296 compressor.  I can imagine a secondary-compressor might want to supply
  297 extra information, such as a dictionary of some kind, in which case
  298 this speculative treatment wouldn't go far enough.]
  299 
  300 	    [Length of code table data]              - integer
  301 	    [Code table data]
  302 	      	Size of near cache                   - byte
  303 	        Size of same cache                   - byte
  304 	        Compressed code table data
  305 	Window1
  306 	    Win_Indicator                            - byte
  307 	    [Source segment size]                    - integer
  308 	    [Source segment position]                - integer
  309             The delta encoding of the target window
  310 	        Length of the delta encoding         - integer
  311 	        The delta encoding
  312 	            Size of the target window        - integer
  313 	            Delta_Indicator                  - byte
  314 	            Length of data for ADDs and RUNs - integer
  315 	            Length of instructions and sizes - integer
  316 	            Length of addresses for COPYs    - integer
  317 	            Data section for ADDs and RUNs   - array of bytes
  318 	            Instructions and sizes section   - array of bytes
  319 	            Addresses section for COPYs      - array of bytes
  320 	Window2
  321 	...
  322 
  323 
  324 
  325 4.1 The Header Section
  326 
  327     Each delta file starts with a header section organized as below.
  328     Note the convention that square-brackets enclose optional items.
  329 
  330 	    Header1                                  - byte = 0xE6
  331 	    Header2                                  - byte = 0xD3
  332 	    Header3                                  - byte = 0xD4
  333 
  334 HMMM
  335 
  336 0xD6
  337 0xC3
  338 0xC4
  339 
  340 	    Header4                                  - byte
  341 	    Hdr_Indicator                            - byte
  342 	    [Secondary compressor ID]                - byte
  343 	    [Length of code table data]              - integer
  344 	    [Code table data]
  345 
  346     The first three Header bytes are the ASCII characters 'V', 'C' and 'D'
  347     with their most significant bits turned on (in hexadecimal, the values
  348     are 0xE6, 0xD3, and 0xD4). The fourth Header byte is currently set to
  349     zero. In the future, it might be used to indicate the version of Vcdiff.
  350 
  351     The Hdr_Indicator byte shows if there are any initialization data
  352     required to aid in the reconstruction of data in the Window sections.
  353     This byte MAY have non-zero values for either, both, or neither of
  354     the two bits VCD_DECOMPRESS and VCD_CODETABLE below:
  355 
  356 	    7 6 5 4 3 2 1 0
  357 	   +-+-+-+-+-+-+-+-+
  358 	   | | | | | | | | |
  359 	   +-+-+-+-+-+-+-+-+
  360 	                ^ ^
  361 	                | |
  362 	                | +-- VCD_DECOMPRESS
  363 	                +---- VCD_CODETABLE
  364 
  365     If bit 0 (VCD_DECOMPRESS) is non-zero, this indicates that a secondary
  366     compressor may have been used to further compress certain parts of the
  367     delta encoding data as described in Sections 4.3 and 6. In that case,
  368     the ID of the secondary compressor is given next. If this bit is zero,
  369     the compressor ID byte is not included.
  370 
  371 [@@@ If we aren't defining any secondary compressors yet, then it seems
  372 this bit has no real value yet..]
  373 
  374     If bit 1 (VCD_CODETABLE) is non-zero, this indicates that an
  375     application-defined code table is to be used for decoding the delta
  376     instructions. This table itself is compressed.  The length of the data
  377     comprising this compressed code table and the data follow next. Section 7
  378     discusses application-defined code tables.  If this bit is zero, the code
  379     table data length and the code table data are not included.
  380 
  381     If both bits are set, then the compressor ID byte is included
  382     before the code table data length and the code table data.
  383 
  384 
  385 4.2 The Format of a Window Section
  386 
  387     Each Window section is organized as follows:
  388 
  389 	    Win_Indicator                            - byte
  390 	    [Source segment length]                  - integer
  391 	    [Source segment position]                - integer
  392             The delta encoding of the target window
  393 
  394 
  395     Below are the detail of the various items:
  396 
  397 [@@@ Here, I want to replace the Win_Indicator with a source-count,
  398 followed by source-count length/position pairs?]
  399 
  400         Win_Indicator:
  401 	    This byte is a set of bits, as shown:
  402 
  403 	    7 6 5 4 3 2 1 0
  404 	   +-+-+-+-+-+-+-+-+
  405 	   | | | | | | | | |
  406 	   +-+-+-+-+-+-+-+-+
  407 	                ^ ^
  408 	                | |
  409 	                | +-- VCD_SOURCE
  410 	                +---- VCD_TARGET
  411 
  412 
  413 	    If bit 0 (VCD_SOURCE) is non-zero, this indicates that a segment
  414             of data from the "source" file was used as the corresponding
  415             source window of data to encode the target window. The decoder
  416 	    will use this same source data segment to decode the target window.
  417 
  418 	    If bit 1 (VCD_TARGET) is non-zero, this indicates that a segment
  419             of data from the "target" file was used as the corresponding
  420 	    source window of data to encode the target window. As above, this
  421 	    same source data segment is used to decode the target window.
  422 
  423 	    The Win_Indicator byte MUST NOT have more than one of the bits
  424 	    set (non-zero).  It MAY have none of these bits set.
  425 
  426 	    If one of these bits is set, the byte is followed by two
  427             integers to indicate respectively the length and position of
  428             the source data segment in the relevant file.  If the
  429             indicator byte is zero, the target window was compressed
  430             by itself without comparing against another data segment,
  431             and these two integers are not included.
  432 
  433         The delta encoding of the target window:
  434             This contains the delta encoding of the target window either
  435             in terms of the source data segment (i.e., VCD_SOURCE
  436             or VCD_TARGET was set) or by itself if no source window
  437             is specified. This data format is discussed next.
  438 
  439 
  440 4.3 The Delta Encoding of a Target Window
  441 
  442     The delta encoding of a target window is organized as follows:
  443 
  444 	Length of the delta encoding            - integer
  445 	The delta encoding
  446 	    Length of the target window         - integer
  447 	    Delta_Indicator                     - byte
  448 	    Length of data for ADDs and RUNs    - integer
  449 	    Length of instructions section      - integer
  450 	    Length of addresses for COPYs       - integer
  451 	    Data section for ADDs and RUNs      - array of bytes
  452 	    Instructions and sizes section      - array of bytes
  453 	    Addresses section for COPYs         - array of bytes
  454 
  455 
  456 	Length of the delta encoding:
  457 	    This integer gives the total number of remaining bytes that
  458 	    comprise data of the delta encoding for this target window.
  459 
  460         The delta encoding:
  461 	    This contains the data representing the delta encoding which
  462 	    is described next.
  463 
  464     	Length of the target window:
  465 	    This integer indicates the actual size of the target window
  466             after decompression. A decoder can use this value to allocate
  467             memory to store the uncompressed data.
  468 
  469 	Delta_Indicator:
  470 	    This byte is a set of bits, as shown:
  471 
  472 	    7 6 5 4 3 2 1 0
  473 	   +-+-+-+-+-+-+-+-+
  474 	   | | | | | | | | |
  475 	   +-+-+-+-+-+-+-+-+
  476 	              ^ ^ ^
  477 	              | | |
  478 	              | | +-- VCD_DATACOMP
  479 	              | +---- VCD_INSTCOMP
  480 	              +------ VCD_ADDRCOMP
  481 
  482 		VCD_DATACOMP:	bit value 1.
  483 		VCD_INSTCOMP:	bit value 2.
  484 		VCD_ADDRCOMP:	bit value 4.
  485 
  486             As discussed, the delta encoding consists of COPY, ADD and RUN
  487             instructions. The ADD and RUN instructions have accompanying
  488             unmatched data (that is, data that does not specifically match
  489             any data in the source window or in some earlier part of the
  490             target window) and the COPY instructions have addresses of where
  491 	    the matches occur. OPTIONALLY, these types of data MAY be further
  492 	    compressed using a secondary compressor. Thus, Vcdiff separates
  493             the encoding of the delta instructions into three parts:
  494 
  495 	        a. The unmatched data in the ADD and RUN instructions,
  496 	        b. The delta instructions and accompanying sizes, and
  497                 c. The addresses of the COPY instructions.
  498 
  499             If the bit VCD_DECOMPRESS (Section 4.1) was on, each of these
  500             sections may have been compressed using the specified secondary
  501             compressor. The bit positions 0 (VCD_DATACOMP), 1 (VCD_INSTCOMP),
  502             and 2 (VCD_ADDRCOMP) respectively indicate, if non-zero, that
  503             the corresponding parts are compressed. Then, these parts MUST
  504 	    be decompressed before decoding the delta instructions.
  505 
  506 	Length of data for ADDs and RUNs:
  507 	    This is the length (in bytes) of the section of data storing
  508             the unmatched data accompanying the ADD and RUN instructions.
  509 
  510 	Length of instructions section:
  511 	    This is the length (in bytes) of the delta instructions and
  512             accompanying sizes.
  513 
  514 	Length of addresses for COPYs:
  515 	    This is the length (in bytes) of the section storing
  516             the addresses of the COPY instructions.
  517 
  518     	Data section for ADDs and RUNs:
  519 	    This sequence of bytes encodes the unmatched data for the ADD
  520             and RUN instructions.
  521 
  522 	Instructions and sizes section:
  523 	    This sequence of bytes encodes the instructions and their sizes.
  524 
  525 	Addresses section for COPYs:
  526 	    This sequence of bytes encodes the addresses of the COPY
  527 	    instructions.
  528 
  529 
  530 5. DELTA INSTRUCTION ENCODING
  531 
  532     The delta instructions described in Section 3 represent the results of
  533     string matching. For many data differencing applications in which the
  534     changes between source and target data are small, any straightforward
  535     representation of these instructions would be adequate.  However, for
  536     applications including data compression, it is important to encode
  537     these instructions well to achieve good compression rates.  From our
  538     experience, the following observations can be made:
  539 
  540     a. The addresses in COPY instructions are locations of matches and
  541        often occur close by or even exactly equal to one another. This is
  542        because data in local regions are often replicated with minor changes.
  543        In turn, this means that coding a newly matched address against some
  544        set of recently matched addresses can be beneficial.
  545 
  546     b. The matches are often short in length and separated by small amounts
  547        of unmatched data. That is, the lengths of COPY and ADD instructions
  548        are often small. This is particularly true of binary data such as
  549        executable files or structured data such as HTML or XML. In such cases,
  550        compression can be improved by combining the encoding of the sizes
  551        and the instruction types as well as combining the encoding of adjacent
  552        delta instructions with sufficiently small data sizes.
  553 
  554     The below subsections discuss how the Vcdiff data format provides
  555     mechanisms enabling encoders to use the above observations to improve
  556     compression rates.
  557 
  558 
  559 5.1 Address Encoding Modes of COPY Instructions
  560 
  561     As mentioned earlier, addresses of COPY instructions often occur close
  562     to one another or are exactly equal. To take advantage of this phenomenon
  563     and encode addresses of COPY instructions more efficiently, the Vcdiff
  564     data format supports the use of two different types of address caches.
  565     Both the encoder and decoder maintain these caches, so that decoder's
  566     caches remain synchronized with the encoder's caches.
  567 
  568     a. A "near" cache is an array with "s_near" slots, each containing an
  569        address used for encoding addresses nearby to previously encoded
  570        addresses (in the positive direction only).  The near cache also
  571        maintains a "next_slot" index to the near cache.  New entries to the
  572        near cache are always inserted in the next_slot index, which maintains
  573        a circular buffer of the s_near most recent addresses.
  574 
  575     b. A "same" cache is an array with "s_same" multiple of 256 slots, each
  576        containing an address.  The same cache maintains a hash table of recent
  577        addresses used for repeated encoding of the exact same address.
  578 
  579 
  580     By default, the parameters s_near and s_same are respectively set to 4
  581     and 3. An encoder MAY modify these values, but then it MUST encode the
  582     new values in the encoding itself, as discussed in Section 7, so that
  583     the decoder can properly set up its own caches.
  584 
  585     At the start of processing a target window, an implementation
  586     (encoder or decoder) initializes all of the slots in both caches
  587     to zero.  The next_slot pointer of the near cache is set
  588     to point to slot zero.
  589 
  590     Each time a COPY instruction is processed by the encoder or
  591     decoder, the implementation's caches are updated as follows, where
  592     "addr" is the address in the COPY instruction.
  593 
  594     a. The slot in the near cache referenced by the next_slot
  595        index is set to addr.  The next_slot index is then incremented
  596        modulo s_near.
  597 
  598     b. The slot in the same cache whose index is addr%(s_same*256)
  599        is set to addr. [We use the C notations of % for modulo and
  600        * for multiplication.]
  601 
  602 
  603 5.2 Example code for maintaining caches
  604 
  605     To make clear the above description, below are example cache data
  606     structures and algorithms to initialize and update them:
  607 
  608         typedef struct _cache_s
  609         {
  610 	    int*  near;      /* array of size s_near        */
  611             int   s_near;
  612             int   next_slot; /* the circular index for near */
  613             int*  same;      /* array of size s_same*256    */
  614             int   s_same;
  615         } Cache_t;
  616 
  617         cache_init(Cache_t* ka)
  618         {
  619 	    int   i;
  620 
  621             ka->next_slot = 0;
  622             for(i = 0; i < ka->s_near; ++i)
  623                  ka->near[i] = 0;
  624 
  625             for(i = 0; i < ka->s_same*256; ++i)
  626                  ka->same[i] = 0;
  627         }
  628 
  629         cache_update(Cache_t* ka, int addr)
  630         {
  631 	    if(ka->s_near > 0)
  632             {   ka->near[ka->next_slot] = addr;
  633                 ka->next_slot = (ka->next_slot + 1) % ka->s_near;
  634             }
  635 
  636             if(ka->s_same > 0)
  637                 ka->same[addr % (ka->s_same*256)] = addr;
  638         }
  639 
  640 
  641 5.3 Encoding of COPY instruction addresses
  642 
  643     The address of a COPY instruction is encoded using different modes
  644     depending on the type of cached address used, if any.
  645 
  646     Let "addr" be the address of a COPY instruction to be decoded and "here"
  647     be the current location in the target data (i.e., the start of the data
  648     about to be encoded or decoded).  Let near[j] be the jth element in
  649     the near cache, and same[k] be the kth element in the same cache.
  650     Below are the possible address modes:
  651 
  652 	VCD_SELF: This mode has value 0. The address was encoded by itself
  653             as an integer.
  654 
  655 	VCD_HERE: This mode has value 1. The address was encoded as
  656 	    the integer value "here - addr".
  657 
  658 	Near modes: The "near modes" are in the range [2,s_near+1]. Let m
  659 	    be the mode of the address encoding. The address was encoded
  660 	    as the integer value "addr - near[m-2]".
  661 
  662 	Same modes: The "same modes" are in the range
  663 	    [s_near+2,s_near+s_same+1]. Let m be the mode of the encoding.
  664 	    The address was encoded as a single byte b such that
  665 	    "addr == same[(m - (s_near+2))*256 + b]".
  666 
  667 
  668 5.3 Example code for encoding and decoding of COPY instruction addresses
  669 
  670     We show example algorithms below to demonstrate use of address modes more
  671     clearly. The encoder has freedom to choose address modes, the sample
  672     addr_encode() algorithm merely shows one way of picking the address
  673     mode. The decoding algorithm addr_decode() will uniquely decode addresses
  674     regardless of the encoder's algorithm choice.
  675 
  676     Note that the address caches are updated immediately after an address is
  677     encoded or decoded. In this way, the decoder is always synchronized with
  678     the encoder.
  679 
  680         int addr_encode(Cache_t* ka, int addr, int here, int* mode)
  681         {
  682 	    int  i, d, bestd, bestm;
  683 
  684 	    /* Attempt to find the address mode that yields the
  685 	     * smallest integer value for "d", the encoded address
  686 	     * value, thereby minimizing the encoded size of the
  687 	     * address. */
  688 
  689             bestd = addr; bestm = VCD_SELF;      /* VCD_SELF == 0 */
  690 
  691             if((d = here-addr) < bestd)
  692                 { bestd = d; bestm = VCD_HERE; } /* VCD_HERE == 1 */
  693 
  694             for(i = 0; i < ka->s_near; ++i)
  695                 if((d = addr - ka->near[i]) >= 0 && d < bestd)
  696                     { bestd = d; bestm = i+2; }
  697 
  698             if(ka->s_same > 0 && ka->same[d = addr%(ka->s_same*256)] == addr)
  699                 { bestd = d%256; bestm = ka->s_near + 2 + d/256; }
  700 
  701             cache_update(ka,addr);
  702 
  703             *mode = bestm; /* this returns the address encoding mode */
  704             return  bestd; /* this returns the encoded address       */
  705         }
  706 
  707     Note that the addr_encode() algorithm chooses the best address mode using a
  708     local optimization, but that may not lead to the best encoding efficiency
  709     because different modes lead to different instruction encodings, as    described below.
  710 
  711     The functions addrint() and addrbyte() used in addr_decode() obtain from
  712     the "Addresses section for COPYs" (Section 4.3) an integer or a byte,
  713     respectively. These utilities will not be described here.  We simply
  714     recall that an integer is represented as a compact variable-sized string
  715     of bytes as described in Section 2 (i.e., base 128).
  716 
  717         int addr_decode(Cache_t* ka, int here, int mode)
  718         {   int  addr, m;
  719 
  720             if(mode == VCD_SELF)
  721                  addr = addrint();
  722             else if(mode == VCD_HERE)
  723                  addr = here - addrint();
  724             else if((m = mode - 2) >= 0 && m < ka->s_near) /* near cache */
  725                  addr = ka->near[m] + addrint();
  726             else /* same cache */
  727             {    m = mode - (2 + ka->s_near);
  728                  addr = ka->same[m*256 + addrbyte()];
  729             }
  730 
  731             cache_update(ka, addr);
  732 
  733             return addr;
  734         }
  735 
  736 
  737 5.4 Instruction Codes
  738 
  739     As noted, the data sizes associated with delta instructions are often
  740     small. Thus, compression efficiency can be improved by combining the sizes
  741     and instruction types in a single encoding, as well by combining certain
  742     pairs of adjacent delta instructions. Effective choices of when to perform
  743     such combinations depend on many factors including the data being processed
  744     and the string matching algorithm in use. For example, if many COPY
  745     instructions have the same data sizes, it may be worth to encode these
  746     instructions more compactly than others.
  747 
  748     The Vcdiff data format is designed so that a decoder does not need to be
  749     aware of the choices made in encoding algorithms. This is achieved with the
  750     notion of an "instruction code table" containing 256 entries. Each entry
  751     defines either a single delta instruction or a pair of instructions that
  752     have been combined.  Note that the code table itself only exists in main
  753     memory, not in the delta file (unless using an application-defined code
  754     table, described in Section 7). The encoded data simply includes the index
  755     of each instruction and, since there are only 256 indices, each index
  756     can be represented as a single byte.
  757 
  758     Each instruction code entry contains six fields, each of which
  759     is a single byte with unsigned value:
  760 
  761             +-----------------------------------------------+
  762 	    | inst1 | size1 | mode1 | inst2 | size2 | mode2 |
  763 	    +-----------------------------------------------+
  764 
  765 @@@ could be more compact
  766 
  767     Each triple (inst,size,mode) defines a delta instruction. The meanings
  768     of these fields are as follows:
  769 
  770     inst: An "inst" field can have one of the four values: NOOP (0), ADD (1),
  771 	RUN (2) or COPY (3) to indicate the instruction types. NOOP means
  772 	that no instruction is specified. In this case, both the corresponding
  773 	size and mode fields will be zero.
  774 
  775     size: A "size" field is zero or positive. A value zero means that the
  776 	size associated with the instruction is encoded separately as
  777 	an integer in the "Instructions and sizes section" (Section 6).
  778 	A positive value for "size" defines the actual data size.
  779 	Note that since the size is restricted to a byte, the maximum
  780 	value for any instruction with size implicitly defined in the code
  781 	table is 255.
  782 
  783     mode: A "mode" field is significant only when the associated delta
  784 	instruction is a COPY. It defines the mode used to encode the
  785 	associated addresses. For other instructions, this is always zero.
  786 
  787 
  788 5.5 The Code Table
  789 
  790     Following the discussions on address modes and instruction code tables,
  791     we define a "Code Table" to have the data below:
  792 
  793 	s_near: the size of the near cache,
  794 	s_same: the size of the same cache,
  795 	i_code: the 256-entry instruction code table.
  796 
  797     Vcdiff itself defines a "default code table" in which s_near is 4
  798     and s_same is 3. Thus, there are 9 address modes for a COPY instruction.
  799     The first two are VCD_SELF (0) and VCD_HERE (1). Modes 2, 3, 4 and 5
  800     are for addresses coded against the near cache. And, modes 6, 7  and 8
  801     are for addresses coded against the same cache.
  802 
  803     The default instruction code table is depicted below, in a compact
  804     representation that we use only for descriptive purposes.  See section 7
  805     for the specification of how an instruction code table is represented
  806     in the Vcdiff encoding format.  In the depiction, a zero value for
  807     size indicates that the size is separately coded. The mode of non-COPY
  808     instructions is represented as 0 even though they are not used.
  809 
  810 
  811          TYPE      SIZE     MODE    TYPE     SIZE     MODE     INDEX
  812         ---------------------------------------------------------------
  813      1.  RUN         0        0     NOOP       0        0        0
  814      2.  ADD    0, [1,17]     0     NOOP       0        0      [1,18]
  815      3.  COPY   0, [4,18]     0     NOOP       0        0     [19,34]
  816      4.  COPY   0, [4,18]     1     NOOP       0        0     [35,50]
  817      5.  COPY   0, [4,18]     2     NOOP       0        0     [51,66]
  818      6.  COPY   0, [4,18]     3     NOOP       0        0     [67,82]
  819      7.  COPY   0, [4,18]     4     NOOP       0        0     [83,98]
  820      8.  COPY   0, [4,18]     5     NOOP       0        0     [99,114]
  821      9.  COPY   0, [4,18]     6     NOOP       0        0    [115,130]
  822     10.  COPY   0, [4,18]     7     NOOP       0        0    [131,146]
  823     11.  COPY   0, [4,18]     8     NOOP       0        0    [147,162]
  824     12.  ADD       [1,4]      0     COPY     [4,6]      0    [163,174]
  825     13.  ADD       [1,4]      0     COPY     [4,6]      1    [175,186]
  826     14.  ADD       [1,4]      0     COPY     [4,6]      2    [187,198]
  827     15.  ADD       [1,4]      0     COPY     [4,6]      3    [199,210]
  828     16.  ADD       [1,4]      0     COPY     [4,6]      4    [211,222]
  829     17.  ADD       [1,4]      0     COPY     [4,6]      5    [223,234]
  830     18.  ADD       [1,4]      0     COPY       4        6    [235,238]
  831     19.  ADD       [1,4]      0     COPY       4        7    [239,242]
  832     20.  ADD       [1,4]      0     COPY       4        8    [243,246]
  833     21.  COPY        4      [0,8]   ADD        1        0    [247,255]
  834         ---------------------------------------------------------------
  835 
  836     In the above depiction, each numbered line represents one or more
  837     entries in the actual instruction code table (recall that an entry in
  838     the instruction code table may represent up to two combined delta
  839     instructions.) The last column ("INDEX") shows which index value or
  840     range of index values of the entries covered by that line. The notation
  841     [i,j] means values from i through j, inclusive. The first 6 columns of
  842     a line in the depiction describe the pairs of instructions used for
  843     the corresponding index value(s).
  844 
  845     If a line in the depiction includes a column entry using the [i,j]
  846     notation, this means that the line is instantiated for each value
  847     in the range from i to j, inclusive.  The notation "0, [i,j]" means
  848     that the line is instantiated for the value 0 and for each value
  849     in the range from i to j, inclusive.
  850 
  851     If a line in the depiction includes more than one entry using the [i,j]
  852     notation, implying a "nested loop" to convert the line to a range of
  853     table entries, the first such [i,j] range specifies the outer loop,
  854     and the second specifies the inner loop.
  855 
  856     The below examples should make clear the above description:
  857 
  858     Line 1 shows the single RUN instruction with index 0. As the size field
  859     is 0, this RUN instruction always has its actual size encoded separately.
  860 
  861     Line 2 shows the 18 single ADD instructions. The ADD instruction with
  862     size field 0 (i.e., the actual size is coded separately) has index 1.
  863     ADD instructions with sizes from 1 to 17 use code indices 2 to 18 and
  864     their sizes are as given (so they will not be separately encoded.)
  865 
  866     Following the single ADD instructions are the single COPY instructions
  867     ordered by their address encoding modes. For example, line 11 shows the
  868     COPY instructions with mode 8, i.e., the last of the same cache.
  869     In this case, the COPY instruction with size field 0 has index 147.
  870     Again, the actual size of this instruction will be coded separately.
  871 
  872     Lines 12 to 21 show the pairs of instructions that are combined together.
  873     For example, line 12 depicts the 12 entries in which an ADD instruction
  874     is combined with an immediately following COPY instruction. The entries
  875     with indices 163, 164, 165 represent the pairs in which the ADD
  876     instructions all have size 1 while the COPY instructions has mode
  877     0 (VCD_SELF) and sizes 4, 5 and 6 respectively.
  878 
  879     The last line, line 21, shows the eight instruction pairs where the first
  880     instruction is a COPY and the second is an ADD. In this case, all COPY
  881     instructions have size 4 with mode ranging from 0 to 8 and all the ADD
  882     instructions have size 1. Thus, the entry with largest index 255
  883     combines a COPY instruction of size 4 and mode 8 with an ADD instruction
  884     of size 1.
  885 
  886     The choice of the minimum size 4 for COPY instructions in the default code
  887     table was made from experiments that showed that excluding small matches
  888     (less then 4 bytes long) improved the compression rates.
  889 
  890 
  891 6. DECODING A TARGET WINDOW
  892 
  893     Section 4.3 discusses that the delta instructions and associated data
  894     are encoded in three arrays of bytes:
  895 
  896         Data section for ADDs and RUNs,
  897         Instructions and sizes section, and
  898         Addresses section for COPYs.
  899 
  900 
  901     Further, these data sections may have been further compressed by some
  902     secondary compressor. Assuming that any such compressed data has been
  903     decompressed so that we now have three arrays:
  904 
  905 	inst: bytes coding the instructions and sizes.
  906         data: unmatched data associated with ADDs and RUNs.
  907 	addr: bytes coding the addresses of COPYs.
  908 
  909     These arrays are organized as follows:
  910 
  911 	inst:
  912 	    a sequence of (index, [size1], [size2]) tuples, where "index"
  913             is an index into the instruction code table, and size1 and size2
  914             are integers that MAY or MAY NOT be included in the tuple as
  915             follows. The entry with the given "index" in the instruction
  916             code table potentially defines two delta instructions. If the
  917             first delta instruction is not a VCD_NOOP and its size is zero,
  918             then size1 MUST be present. Otherwise, size1 MUST be omitted and
  919             the size of the instruction (if it is not VCD_NOOP) is as defined
  920             in the table. The presence or absence of size2 is defined
  921             similarly with respect to the second delta instruction.
  922 
  923 	data:
  924 	    a sequence of data values, encoded as bytes.
  925 
  926 	addr:
  927 	    a sequence of address values. Addresses are normally encoded as
  928             integers as described in Section 2 (i.e., base 128).
  929 	    Since the same cache emits addresses in the range [0,255],
  930 	    however, same cache addresses are always encoded as a
  931 	    single byte.
  932 
  933     To summarize, each tuple in the "inst" array includes an index to some
  934     entry in the instruction code table that determines:
  935 
  936     a. Whether one or two instructions were encoded and their types.
  937 
  938     b. If the instructions have their sizes encoded separately, these
  939        sizes will follow, in order, in the tuple.
  940 
  941     c. If the instructions have accompanying data, i.e., ADDs or RUNs,
  942        their data will be in the array "data".
  943 
  944     d. Similarly, if the instructions are COPYs, the coded addresses are
  945        found in the array "addr".
  946 
  947     The decoding procedure simply processes the arrays by reading one code
  948     index at a time, looking up the corresponding instruction code entry,
  949     then consuming the respective sizes, data and addresses following the
  950     directions in this entry. In other words, the decoder maintains an implicit
  951     next-element pointer for each array; "consuming" an instruction tuple,
  952     data, or address value implies incrementing the associated pointer.
  953 
  954     For example, if during the processing of the target window, the next
  955     unconsumed tuple in the inst array has index value 19, then the first
  956     instruction is a COPY, whose size is found as the immediately following
  957     integer in the inst array.  Since the mode of this COPY instruction is
  958     VCD_SELF, the corresponding address is found by consuming the next
  959     integer in the addr array.  The data array is left intact. As the second
  960     instruction for code index 19 is a NOOP, this tuple is finished.
  961 
  962 
  963 7. APPLICATION-DEFINED CODE TABLES
  964 
  965     Although the default code table used in Vcdiff is good for general
  966     purpose encoders, there are times when other code tables may perform
  967     better. For example, to code a file with many identical segments of data,
  968     it may be advantageous to have a COPY instruction with the specific size
  969     of these data segments so that the instruction can be encoded in a single
  970     byte. Such a special code table MUST then be encoded in the delta file
  971     so that the decoder can reconstruct it before decoding the data.
  972 
  973     Vcdiff allows an application-defined code table to be specified
  974     in a delta file with the following data:
  975 
  976 	Size of near cache            - byte
  977 	Size of same cache            - byte
  978 	Compressed code table data
  979 
  980     The "compressed code table data" encodes the delta between the default
  981     code table (source) and the new code table (target) in the same manner as
  982     described in Section 4.3 for encoding a target window in terms of a
  983     source window. This delta is computed using the following steps:
  984 
  985     a.  Convert the new instruction code table into a string, "code", of
  986 	1536 bytes using the below steps in order:
  987 
  988         i. Add in order the 256 bytes representing the types of the first
  989 	   instructions in the instruction pairs.
  990        ii. Add in order the 256 bytes representing the types of the second
  991 	   instructions in the instruction pairs.
  992       iii. Add in order the 256 bytes representing the sizes of the first
  993 	   instructions in the instruction pairs.
  994        iv. Add in order the 256 bytes representing the sizes of the second
  995 	   instructions in the instruction pairs.
  996         v. Add in order the 256 bytes representing the modes of the first
  997 	   instructions in the instruction pairs.
  998        vi. Add in order the 256 bytes representing the modes of the second
  999 	   instructions in the instruction pairs.
 1000 
 1001     b.  Similarly, convert the default instruction code table into
 1002 	a string "dflt".
 1003 
 1004     c.  Treat the string "code" as a target window and "dflt" as the
 1005 	corresponding source data and apply an encoding algorithm to
 1006 	compute the delta encoding of "code" in terms of "dflt".
 1007 	This computation MUST use the default code table for encoding
 1008 	the delta instructions.
 1009 
 1010     The decoder can then reverse the above steps to decode the compressed
 1011     table data using the method of Section 6, employing the default code
 1012     table, to generate the new code table.  Note that the decoder does not
 1013     need to know anything about the details of the encoding algorithm used
 1014     in step (c). The decoder is still able to decode the new code table
 1015     because the Vcdiff format is independent from the choice of encoding
 1016     algorithm, and because the encoder in step (c) uses the known, default
 1017     code table.
 1018 
 1019 
 1020 8. PERFORMANCE
 1021 
 1022     The encoding format is compact. For compression only, using the LZ-77
 1023     string parsing strategy and without any secondary compressors, the typical
 1024     compression rate is better than Unix compress and close to gzip.  For
 1025     differencing, the data format is better than all known methods in
 1026     terms of its stated goal, which is primarily decoding speed and
 1027     encoding efficiency.
 1028 
 1029     We compare the performance of compress, gzip and Vcdiff using the
 1030     archives of three versions of the Gnu C compiler, gcc-2.95.1.tar,
 1031     gcc-2.95.2.tar and gcc-2.95.3.tar. The experiments were done on an
 1032     SGI-MIPS3, 400MHZ. Gzip was used at its default compression level.
 1033     Vcdiff timings were done using the Vcodex/Vcdiff software (Section 13).
 1034     As string and window matching typically dominates the computation during
 1035     compression, the Vcdiff compression times were directly due to the
 1036     algorithms used in the Vcodex/Vcdiff software. However, the decompression
 1037     times should be generic and representative of any good implementation
 1038     of the Vcdiff data format. Timing was done by running each program
 1039     three times and taking the average of the total cpu+system times.
 1040 
 1041     Below are the different Vcdiff runs:
 1042 
 1043 	Vcdiff: vcdiff is used as compressor only.
 1044 
 1045 	Vcdiff-d: vcdiff is used as a differencer only. That is, it only
 1046 		compares target data against source data.  Since the files
 1047 		involved are large, they are broken into windows. In this
 1048 		case, each target window starting at some file offset in
 1049 		the target file is compared against a source window with
 1050 		the same file offset (in the source file). The source
 1051 		window is also slightly larger than the target window
 1052 		to increase matching opportunities. The -d option also gives
 1053 		a hint to the string matching algorithm of Vcdiff that
 1054 		the two files are very similar with long stretches of matches.
 1055 		The algorithm takes advantage of this to minimize its
 1056 		processing of source data and save time.
 1057 
 1058 	Vcdiff-dc: This is similar to Vcdiff-d but vcdiff can also compare
 1059 		target data against target data as applicable. Thus, vcdiff
 1060 		both computes differences and compresses data. The windowing
 1061 		algorithm is the same as above. However, the above hint is
 1062 		recinded in this case.
 1063 
 1064 	Vcdiff-dcs: This is similar to Vcdiff-dc but the windowing algorithm
 1065 		uses a content-based heuristic to select source data segments
 1066 		that are more likely to match with a given target window.
 1067 		Thus, the source data segment selected for a target window
 1068 		often will not be aligned with the file offsets of this
 1069 		target window.
 1070 
 1071 
 1072                 gcc-2.95.1    gcc-2.95.2    compression   decompression
 1073     raw size      55746560      55797760
 1074     compress         -          19939390       13.85s	      7.09s
 1075     gzip             -          12973443       42.99s         5.35s
 1076     Vcdiff           -          15358786       20.04s         4.65s
 1077     Vcdiff-d         -            100971       10.93s         1.92s
 1078     Vcdiff-dc        -             97246       20.03s         1.84s
 1079     Vcdiff-dcs       -            256445       44.81s         1.84s
 1080 
 1081 		TABLE 1. Compressing gcc-2.95.2.tar given gcc-2.95.1
 1082 
 1083 
 1084     TABLE 1 shows the raw sizes of gcc-2.95.1.tar and gcc-2.95.2.tar and the
 1085     sizes of the compressed results. As a pure compressor, the compression
 1086     rate for Vcdiff is worse than gzip and better than compress. The last
 1087     three rows shows that when two file versions are very similar, differencing
 1088     can have dramatically good compression rates. Vcdiff-d and Vcdiff-dc use
 1089     the same simple window selection method but Vcdiff-dc also does compression
 1090     so its output is slightly smaller. Vcdiff-dcs uses a heuristic based on
 1091     data content to search for source data that likely will match a given target
 1092     window. Although it does a good job, the heuristic did not always find the
 1093     best matches which are given by the simple algorithm of Vcdiff-d.  As a
 1094     result, the output size is slightly larger. Note also that there is a large
 1095     cost in computing matching windows this way. Finally, the compression times
 1096     of Vcdiff-d is nearly half of that of Vcdiff-dc. It is tempting to conclude
 1097     that the compression feature causes the additional time in Vcdiff-dc
 1098     relative to Vcdiff-d.  However, this is not the case. The hint given to
 1099     the Vcdiff string matching algorithm that the two files are likely to
 1100     have very long stretches of matches helps the algorithm to minimize
 1101     processing of the "source data", thus saving half the time. However, as we
 1102     shall see below when this hint is wrong, the result is even longer time.
 1103 
 1104 
 1105                 gcc-2.95.2    gcc-2.95.3    compression   decompression
 1106     raw size      55797760      55787520
 1107     compress         -          19939453       13.54s	      7.00s
 1108     gzip             -          12998097       42.63s         5.62s
 1109     Vcdiff           -          15371737       20.09s         4.74s
 1110     Vcdiff-d         -          26383849       71.41s         6.41s
 1111     Vcdiff-dc        -          14461203       42.48s         4.82s
 1112     Vcdiff-dcs       -           1248543       61.18s         1.99s
 1113 
 1114 		TABLE 2. Compressing gcc-2.95.3.tar given gcc-2.95.2
 1115 
 1116 
 1117     TABLE 2 shows the raw sizes of gcc-2.95.2.tar and gcc-2.95.3.tar and
 1118     the sizes of the compressed results. In this case, the tar file of
 1119     gcc-2.95.3 is rearranged in a way that makes the straightforward method
 1120     of matching file offsets for source and target windows fail. As a
 1121     result, Vcdiff-d performs rather dismally both in time and output size.
 1122     The large time for Vcdiff-d is directly due to fact that the string
 1123     matching algorithm has to work much harder to find matches when the hint
 1124     that two files have long matching stretches fails to hold. On the other
 1125     hand, Vcdiff-dc does both differencing and compression resulting in good
 1126     output size. Finally, the window searching heuristic used in Vcdiff-dcs is
 1127     effective in finding the right matching source windows for target windows
 1128     resulting a small output size. This shows why the data format needs to
 1129     have a way to specify matching windows to gain performance. Finally,
 1130     we note that the decoding times are always good regardless of how
 1131     the string matching or window searching algorithms perform.
 1132 
 1133 
 1134 9. FURTHER ISSUES
 1135 
 1136     This document does not address a few issues:
 1137 
 1138     Secondary compressors:
 1139         As discussed in Section 4.3, certain sections in the delta encoding
 1140 	of a window may be further compressed by a secondary compressor.
 1141 	In our experience, the basic Vcdiff format is adequate for most
 1142 	purposes so that secondary compressors are seldom needed. In
 1143         particular, for normal use of data differencing where the files to
 1144 	be compared have long stretches of matches, much of the gain in
 1145 	compression rate is already achieved by normal string matching.
 1146 	Thus, the use of secondary compressors is seldom needed in this case.
 1147 	However, for applications beyond differencing of such nearly identical
 1148 	files, secondary compressors may be needed to achieve maximal
 1149 	compressed results.
 1150 
 1151         Therefore, we recommend to leave the Vcdiff data format defined
 1152 	as in this document so that the use of secondary compressors
 1153  	can be implemented when they become needed in the future.
 1154         The formats of the compressed data via such compressors or any
 1155 	compressors that may be defined in the future are left open to
 1156 	their implementations.  These could include Huffman encoding,
 1157 	arithmetic encoding, and splay tree encoding [8,9].
 1158 
 1159     Large file system vs. small file system:
 1160 	As discussed in Section 4, a target window in a large file may be
 1161 	compared against some source window in another file or in the same
 1162 	file (from some earlier part). In that case, the file offset of the
 1163 	source window is specified as a variable-sized integer in the delta
 1164 	encoding. There is a possibility that the encoding was computed on
 1165 	a system supporting much larger files than in a system where
 1166 	the data may be decoded (e.g., 64-bit file systems vs. 32-bit file
 1167 	systems). In that case, some target data may not be recoverable.
 1168 	This problem could afflict any compression format, and ought
 1169 	to be resolved with a generic negotiation mechanism in the
 1170 	appropriate protocol(s).
 1171 
 1172 
 1173 10.  SUMMARY
 1174 
 1175     We have described Vcdiff, a general and portable encoding format for
 1176     compression and differencing. The format is good in that it allows
 1177     implementing a decoder without knowledge of the encoders. Further,
 1178     ignoring the use of secondary compressors not defined within the format,
 1179     the decoding algorithms runs in linear time and requires working space
 1180     proportional to window sizes.
 1181 
 1182 
 1183 
 1184 11. ACKNOWLEDGEMENTS
 1185 
 1186     Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff
 1187     who provided much encouragement to publicize Vcdiff. In particular, Jeff
 1188     helped clarifying the description of the data format presented here.
 1189 
 1190 
 1191 
 1192 12. SECURITY CONSIDERATIONS
 1193 
 1194     Vcdiff only provides a format to encode compressed and differenced data.
 1195     It does not address any issues concerning how such data are, in fact,
 1196     stored in a given file system or the run-time memory of a computer system.
 1197     Therefore, we do not anticipate any security issues with respect to Vcdiff.
 1198 
 1199 
 1200 
 1201 13. SOURCE CODE AVAILABILITY
 1202 
 1203     Vcdiff is implemented as a data transforming method in Phong Vo's
 1204     Vcodex library. AT&T Corp. has made the source code for Vcodex available
 1205     for anyone to use to transmit data via HTTP/1.1 Delta Encoding [10,11].
 1206     The source code and according license is accessible at the below URL:
 1207 
 1208           http://www.research.att.com/sw/tools
 1209 
 1210 
 1211 14. INTELLECTUAL PROPERTY RIGHTS
 1212 
 1213    The IETF has been notified of intellectual property rights claimed in
 1214    regard to some or all of the specification contained in this
 1215    document.  For more information consult the online list of claimed
 1216    rights, at <http://www.ietf.org/ipr.html>.
 1217 
 1218    The IETF takes no position regarding the validity or scope of any
 1219    intellectual property or other rights that might be claimed to
 1220    pertain to the implementation or use of the technology described in
 1221    this document or the extent to which any license under such rights
 1222    might or might not be available; neither does it represent that it
 1223    has made any effort to identify any such rights.  Information on the
 1224    IETF's procedures with respect to rights in standards-track and
 1225    standards-related documentation can be found in BCP-11.  Copies of
 1226    claims of rights made available for publication and any assurances of
 1227    licenses to be made available, or the result of an attempt made to
 1228    obtain a general license or permission for the use of such
 1229    proprietary rights by implementors or users of this specification can
 1230    be obtained from the IETF Secretariat.
 1231 
 1232 
 1233 
 1234 15. IANA CONSIDERATIONS
 1235 
 1236    The Internet Assigned Numbers Authority (IANA) administers the number
 1237    space for Secondary Compressor ID values.  Values and their meaning
 1238    must be documented in an RFC or other peer-reviewed, permanent, and
 1239    readily available reference, in sufficient detail so that
 1240    interoperability between independent implementations is possible.
 1241    Subject to these constraints, name assignments are First Come, First
 1242    Served - see RFC2434 [13].  Legal ID values are in the range 1..255.
 1243 
 1244    This document does not define any values in this number space.
 1245 
 1246 
 1247 16. REFERENCES
 1248 
 1249     [1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression,
 1250         Practical Reusable Unix Software, Editor B. Krishnamurthy,
 1251         John Wiley & Sons, Inc., 1995.
 1252 
 1253     [2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data
 1254         Compression, IEEE Trans. on Information Theory, 23(3):337-343, 1977.
 1255 
 1256     [3] W. Tichy, The String-to-String Correction Problem with Block Moves,
 1257         ACM Transactions on Computer Systems, 2(4):309-321, November 1984.
 1258 
 1259     [4] E.M. McCreight, A Space-Economical Suffix Tree Construction
 1260         Algorithm, Journal of the ACM, 23:262-272, 1976.
 1261 
 1262     [5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta Algorithms,
 1263         IEEE Software Configuration and Maintenance Workshop, 1996.
 1264 
 1265     [6] J.J. Hunt, K.P. Vo, W. Tichy, Delta Algorithms: An Empirical Analysis,
 1266         ACM Trans. on Software Engineering and Methodology, 7:192-214, 1998.
 1267 
 1268     [7] D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library,
 1269         Proc. of the Summer '91 Usenix Conference, 1991.
 1270 
 1271     [8] D. W. Jones, Application of Splay Trees to Data Compression,
 1272         CACM, 31(8):996:1007.
 1273 
 1274     [9] M. Nelson, J. Gailly, The Data Compression Book, ISBN 1-55851-434-1,
 1275         M&T Books, New York, NY, 1995.
 1276 
 1277    [10] J.C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy,
 1278         Potential benefits of delta encoding and data compression for HTTP,
 1279         SIGCOMM '97, Cannes, France, 1997.
 1280 
 1281    [11] J.C. Mogul, B. Krishnamurthy, F. Douglis, A. Feldmann,
 1282         Y. Goland, and A. Van Hoff, Delta Encoding in HTTP,
 1283         IETF, draft-mogul-http-delta-10, 2001.
 1284 
 1285    [12] S. Bradner, Key words for use in RFCs to Indicate Requirement Levels,
 1286         RFC 2119, March 1997.
 1287 
 1288    [13] T. Narten, H. Alvestrand, Guidelines for Writing an IANA
 1289         Considerations Section in RFCs, RFC2434, October 1998.
 1290 
 1291 
 1292 
 1293 17. AUTHOR'S ADDRESS
 1294 
 1295     Kiem-Phong Vo (main contact)
 1296     AT&T Labs, Room D223
 1297     180 Park Avenue
 1298     Florham Park, NJ 07932
 1299     Email: kpv@research.att.com
 1300     Phone: 1 973 360 8630
 1301 
 1302     David G. Korn
 1303     AT&T Labs, Room D237
 1304     180 Park Avenue
 1305     Florham Park, NJ 07932
 1306     Email: dgk@research.att.com
 1307     Phone: 1 973 360 8602
 1308 
 1309     Jeffrey C. Mogul
 1310     Western Research Laboratory
 1311     Compaq Computer Corporation
 1312     250 University Avenue
 1313     Palo Alto, California, 94305, U.S.A.
 1314     Email: JeffMogul@acm.org
 1315     Phone: 1 650 617 3304 (email preferred)
 1316 
 1317     Joshua P. MacDonald
 1318     Computer Science Division
 1319     University of California, Berkeley
 1320     345 Soda Hall
 1321     Berkeley, CA 94720
 1322     Email: jmacd@cs.berkeley.edu