Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

ddar calculates a 32-bit Rabin fingerprint over a sliding 48-byte window (a Rabin fingerprint is a type of checksum which allows quick calculation of a "slide" with just the previous fingerprint, the new byte coming in, and the old byte going away). The fingerprint is compared against a mask which has n bits set, where 2^n is the desired target block size. If there's a match, then the location of the window is taken to be a block boundary. With random data this should lead to a binomial distribution of block sizes centred around the target block size. But minimum and maximum block sizes are also enforced for pathological cases which skews this slightly.

Blocks whose boundaries are determined by this algorithm are hashed (SHA256 currently). The hash is used to key each block in a reference-counted key/value store.

Then each archive member is just a list of block hashes.



If you were to elaborate on and brush up the description, it would make an excellent HN submission. Good stuff anyways.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: