The rsync algorithm

Posted on 2016-12-18

I was curious about how the nice little program, rsync work. Rsync is something that I use almost every day.

In this post, we describe the elegant rsync algorithm, which is described by this technical report by Andrew Tridgell. As always, Tridgell’s writing is crystal clear and it would perhaps make more sense for you, the reader to read that report than this blog post. But if you want to read some not-so-elegant Haskell code (my fault, not the language’s) that describe the algorithm, read on…

Before we proceed, credits where it is due. I read the rsync technical report along with another earlier blog post that developed the algorithm step by step, in OCaml. Thanks to the authors of that post. The code below is not a direct translation of that post. I developed the code independently after reading the post and the technical report. So, off we start..

Another thing to keep in mind is that this post is only about the rsync algorithm and not a full rsync program implementation.

First, we declare some boiler plate and some imports..

{-# LANGUAGE OverloadedStrings #-}
module Lib where

import Control.Monad.State
import qualified Data.ByteString.Char8 as BS
import qualified Data.ByteString.Lazy as BL
import Data.Digest.Adler32 (adler32, adler32Update)
import qualified Data.Map as M
import Data.Word (Word8, Word32)
import qualified Crypto.Hash.MD4 as MD4

We declare some helper type synonyms next. Rsync uses a weak checksum called Adler32 (which is 32-bit wide) and a strong hash, MD4, which is 128 bit wide. The weak checksum takes less CPU cycles to calculate. In comparison, MD4 is more costly to calculate.

We represent the MD4 output as a Lazy ByteString.

If two blocks of bytes have the weak signature, that does not mean that the blocks of bytes also match. On the other hand, if they don’t have the same weak signature, we are rather certain that the blocks of bytes do not match. There is no harm in this assumption. However we should remove false positives from the algorithm, so we also make use of strong checksums as we will see below.

type Md4digest       = BS.ByteString
type Adler32checksum = Word32

The big picture

So, we have two hosts A and B. Host A wants to transfer a file (rsync can work with multiple files, but for this discussion, let us talk only about transferring just one file) to the Host B. Let us also say that Host B, already has a version of the file which is similar, but not the same. For example, assume that you are managing a personal website on a server and so, a bunch of files that constitute the website exist on both your development machine and on the server. Sometimes, you add a new file, but you also sometime correct typos or make additions to a few files. So, you have a bunch of files that are similar but not the same.

We could just send the new file from Host A to Host B. But that would be wasteful. Host B already has most of the bytes that constitute the file. So, why send them all over again!

We could take a block of S bytes at a time from the file and ask the host B if it has the those bytes at a particular offset in the file and do the same for the entire file. That would certainly work. If the block does not exist, we send the block. If that block exist, we increment by S bytes etc..

This would work very well when the copy existing in Host B has a few additions at the end of the file or has a few length preserving changes in the middle. But what if the copy of the file in the host B is the same as the one in host A with a few bytes from the beginning removed (or added at the beginning).

Also let us add another constraint: we want to do the entire operation in just one round trip.

algorithm

Here is what rsync does (as per the rsync technical report). I haven’t looked at the rsync source code to verify if it has deviated from the report. There seem to be librsync and rsync and they seem to have some differences when I looked last.

The Host A splits the file into blocks of size S. For each of those blocks, it calculates two signatures:

splitBS :: BL.ByteString -> Integer -> [BL.ByteString]
splitBS bs blockSize | fromIntegral (BL.length bs) < blockSize = [bs]
splitBS bs blockSize = BL.take (fromIntegral blockSize) bs :
  splitBS (BL.drop (fromIntegral blockSize) bs) blockSize

We split the bytestring that represents the file into blocks of size ‘blockSize’. The last block has the length <= ‘blockSize’.

Then, we calculate the signature for each block and create an association list, each element is a tuple of the form:

type Signature     = (Md4digest, Adler32checksum, Int)

.. and here is the function that creates the signatures for the block.

fileSignatures :: BL.ByteString -> Integer -> [Signature]
fileSignatures bs blockSize = zip3 strongsigs weaksigs [0..]
  where blocks     = splitBS bs blockSize
        strongsigs = map blockSig blocks
        weaksigs   = map adler32 blocks

Note that the type signature describe the general shape of the function succinctly. We could optimize this further by traversing the blocks only once.

The two individual signatures ‘blockSig’ and ‘adler32’ uses the library functions that we imported at the beginning.

blockSig :: BL.ByteString -> BS.ByteString
blockSig = MD4.hash . BL.toStrict

weakSig :: BL.ByteString -> Adler32checksum
weakSig = adler32

Host B recieves these signatures. The output of the processing at B can be modelled as a series of instructions. The instruction would be to say that a particular block is the same as another block in the file (from the copy on the Host A) or it would be a series of bytes that constitutes the ones at a particular location. We model it using this declaration:

data Instruction = RChar Word8
                 | RBlk  Int
                 deriving Show

Now, we see the meat of the algorithm that generates these instructions at the Host B. The genius of rsync algorithm is the simplicity of this algorithm.

At host B, we first calculate the weak signature for the first block and compare it with the (weak) signature we got from host A. Remember that weak signature is .. weak. We are reducing a block of bytes into a simple 32-bit scalar number and so collisions can certain happen. So, if the signature match, then we also compare the MD4 hash of the block. If that matches as well, we are more or less certain of a block match and we send the RBlk index instruction, where index is the index of the first block on the Host A that has the matching block signature. We then increment the index by block size.

If the weak signature does not match, we send the first byte of the block that didn’t match using the RChar instruction and then increment the moving window by one and calculate the checksum of that block. Adler32 is a rolling checksum. That is, given the checksum of bytes b0, b1, …. bn-1: given a new byte bn, we can easily calculate the checksum of b1, b2, … bn from the old checksum and bn. This calculation is quite cheap and fast. After calculating the updated Adler32 checksum, we continue from the next byte as before, looking at a block of bytes from that offset.

In the rsync paper, the authors use a variant of Adler32 checksum for the rolling checksum. The 32-bit checksum is split as two 16-bit numbers. The current value of the 16-bit number is calculated from the previous value of the number, the oldest value used to calculate the previous value and the last value in the new window.

For the Haskell implementation, we use a State Monad to hold the Adler32 checksum as the state and produce a list of instructions as a result. The code below implements the above two paragraphs more or less exactly. We use one hash table for Adler32 checksum to list of block numbers lookup and another for MD4 digest to list of block number lookups.

PS: rsync does not use adler32. It uses a similar rolling checksum algorithm (which I haven’t implemented here, instead I am using adler32 from a library).

genInstructions :: [Signature] -> Integer -> BL.ByteString -> [Instruction]
genInstructions f0sigs blockSize fnew =
  evalState (go fnew) sig0
  where
    sig0 = weakSig $ BL.take (fromIntegral blockSize) fnew
    go :: BL.ByteString -> State Adler32checksum [Instruction]
    go fnew | fnew == BL.empty = return []
            | otherwise = do
                let (blk, blks) = BL.splitAt (fromIntegral blockSize) fnew
                adlerSum <- get
                let matches = M.lookup adlerSum f0AdlerTable >>
                      M.lookup (blockSig blk) f0MD4Table
                case matches of
                  Just idxs -> do
                    modify (`adler32Update` blk)
                    is <- go blks
                    return $ RBlk (head idxs) : is
                  Nothing -> do
                    let c = BL.head blk
                    modify (`adler32Update`  BL.singleton c)
                    is <- go (BL.tail (blk `mappend` blks))
                    return $ RChar c : is
    f0AdlerTable = toAdlerMap f0sigs
    f0MD4Table   = toMD4Map f0sigs

I am not very happy with the above function. It could be a lot more elegant than what I have got.

Reconstruction

Once we get the instructions at the Host A, we simply apply these instructions on the byte string that represents the file. Pretty straightforward.

recreate :: BL.ByteString -> Integer -> [Instruction] -> BL.ByteString
recreate f0 blockSize ins =
  let f0blocks = splitBS f0 blockSize
  in
    go f0blocks ins
  where go f0blocks [] = mempty
        go f0blocks (inst:insts) =
          case inst of
            RBlk i  -> (f0blocks !! i) `mappend` go f0blocks insts
            RChar w -> BL.singleton w `mappend` go f0blocks insts

I think that is an elegant algorithm. The report says one could apply pipelining and apply the same to multiple files (i.e. given any number of files, it only takes one round trip). That’s pretty amazing! Hats off to Andrew Tridgell for the algorithm and also for explaining it very clearly.

I am adding a few missing pieces (like a proper rolling checksum) in a git repo: hs-rsync