Burrows-Wheeler transformation

Main features

The BW method is generally used for images, sound and text.
It works in block mode, the others in streaming mode.
Also called block sorting.
Is used in bzip2.

Time and space complexities

Sorting time complexity is O(n.log(n)). Using method Suffix Array only O(n).

History

It was invented by Michael Burrows and David Wheeler, 1994.

Description

Burrows-Wheeler transformation works in block mode. We read blocks of text with predefined size from input and each block process separately.

First we make all cyclic rotations of input text (here in left) and number them. Then we lexicographically sort the resulting phrases and remember the number of line containing the input string. Now we select from all lexicographically sorted phrases the last symbols and put together the string L or if you like select the last column (L = Last), when we have all phrases written under the others.

Now we use for encoding the string L methods Move To Front and Huffman encoding. Then we can use the RLE method yet, etc. For decoding we use the methods in the reverse order and get back the string L and the value I.

The decoding makes easy putting the string L into column and numbering the single symbols. Next we create the columns F (First) and T.

From column L always choose unworked, lexicographically the least and in order the first symbol L[i]. Then put it on the first free position in column F and into column T on the position T[i] put the number of it´s position in column F. We make it for all symbols of the string L.

In the next step set the index of line i to the value I. Then the symbol L[i] put on the start of output string and then select the line with index T[i]. Now we repeat this while the size of output string is less than size of string L, or if you like while the index of line i isn´t again equal to the value I.

Pseudo-code


 1:  encoding :
 2:  begin
 3:     make all cyclic rotations of input text (phrase begins with marked symbol)
 4:     number obtained phrases
 5:     sort phrases lexicographically using suffix array SA and remember last symbols 
 6:       of each phrase L[i]
 7:     remeber position I of input phrase by sorting
 8:     encode string L using method Move To Front and aplhabet A
 9:     output of Move To Front is sequence numbers C
10:     count frequency N of numbers in sequence C
11:     each number is assigned with codeword of Huffman encoding
12:     output(all in Fibonacci numbers of order 2 - the value I)
13:     output(count of the numbers (symbols) in Huffman tree)
14:     output(all numbers in Huffman tree)
15:     output(the Huffman tree)
16:     output(the C sequence)
17:  end
18:  
19:  decoding :
20:  begin
21:     read the value I
22:     decode count of the numbers in Huffman tree
23:     decode the numbers in Huffman tree
24:     restore codewords of Huffman encoding
25:     decode the sequence C
26:     using method Move To Front to get back the string L
27:     list the column L (Last) and number the lines
28:     while (in L are unworked symbols) do
29:      begin
30:        select unworked and lexikographically sorted the first symbol L[i] from L
31:        put it on the next free position in column F (First) and in T[i] store its index 
32:          in column F
33:      end
34:     select the line i = I
35:     while (output length < length of string L) do
36:      begin
37:        the symbol L[i] put to start of output stream
38:        in next step will work with the line indexed T[i]
39:      end
40:  end

References

M. Burrows and D. Wheeler: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
Burrows-Wheeler Transform. Wikipedia, the free encyclopedia.