Prague Stringology Conference 2019

Yehonatan Dude, Michael Hirsch and Yair Toaff

A Fast SIMD-Based Chunking Algorithm

Deduplication is a special case of data compression where repeated chunks of data are stored only once. The input data is divided into chunks using a chunking algorithm and a cryptographically strong hash is calculated on each chunk and used as its unique identifier for further searching and duplicate elimination. As the input stream is processed, a chunk boundary is declared at a byte address in the input stream if some weak hash of a fixed number of preceding bytes (the "hash window") satisfies some criterion. Commonly, a rolling hash like Karp and Rabin (1987) or some cyclic polynomial Lemire and Kaser (2007) is used for the weak hash since these cheaply support moving the hash window forward one byte in the input stream. This work presents a way to calculate n weak rolling hashes at a time using single instruction multiple data (SIMD) instructions available on today's processors. Furthermore, it shows how to calculate chunk boundaries cheaply using other instructions also available on these processors. Empirical results show that the proposed algorithm is four times as fast as previous algorithms, and that these optimizations save up to 25 % of the computation required for deduplication.

Download paper: Article in PostScript Article in PDF BibTeX Reference
 PostScript   PDF   BibTeX reference 
Download presentation: Presentation