using rdiff to optimise upload of similar files

If you have several similar large files to upload over a slow link you can use ‘rdiff’ to optimise the transfer. ‘rdiff’ compares files block by block and produces a delta file that contains only the blocks that differ.
It is therefore of most use when the two files are mostly similar but some parts different such as powerpoint product presentations  targeted for two different customers.

In this example we have two similar but different documents:

$ dir *.doc

643,584 document-edition-two.doc
634,880 document-original.doc

Generate an MD5 hash to be used later to verify the file integrity
$ md5 document-edition-two.doc
8280AEAFABC0833D5FEC64CE5FEF6237  document-edition-two.doc

Prepare a “signature” file which contains hash codes of each block in the base file.

$ rdiff signature document-original.doc document.sig

Next I use that signature file to see which blocks are different in the second file and extract them to a delta file.

$ rdiff delta document.sig document-edition-two.doc

$ dir


Note that the “delta” file is only 12% of the size of “document-edition-two.doc”, the relative file size depends on how similar the two documents are.

Now I upload the files “document-original.doc” and “”

On the server I (or the recipients of the files) run ‘rdiff’ to generate the second document from the first and the delta.
$ rdiff patch document-original.doc document-edition-two-reconstructed.doc

Check the MD5 hash to confirm that the second document has been faithfully reproduced.

$ md5 document-edition-two-reconstructed.doc
8280AEAFABC0833D5FEC64CE5FEF6237  document-edition-two-reconstructed.doc

Download rdiff for Windows, compiled with Cygwin  

Leave a Reply

You must be logged in to post a comment.