Static Huffman coding

Main features

statistic compression method
two-pass compression method
semi-adaptive compression method
asymetric compression method
Optimal prefix code
Compression algorithms PKZIP, JPEG, MP3, BZIP2

Time and space complexities

Time complexity of algorithm: O(n + |Σ| * log|Σ|)
Space complexity of algorithm: O(|Σ|)

Let's specify: n = size of input data and |Σ| = size of input alphabet. The algorithm says that input text is passed two times (once to get frequencies and once to get codes for characters), than the binary tree is created (not necessary total), which has |Σ| leaf nodes and depth is log₂|Σ|, so its size is about 2*|Σ| (count of nodes). This tree is passed because of getting code for each input character, this passing last |Σ|*log₂|Σ|. We can use O(|Σ|*log|Σ|) for time complexity of sorting. Than the time compexity is 2*n + |Σ|*log₂|Σ| (withouth sorting) so then O(n + |Σ|*log|Σ|) and the space complexity 2*|Σ| so then O(|Σ|).

History

The story of David Huffman and his coding

The year 1951 was written. At universities of all over the world many similar problems were solved, like at the one where David Huffman was studying. One professor allowed to his students that they didn't have to pass an exam, when they could solve one difficult problem. It consisted in availability of the shortest prefix coding. In that time this problem hadn't been solved yet. But students didn't know that. There were known only ineffective methods (top-down methods). David Huffman (born in 1925) studying at university of Ohio didn't want to do an exam. No because he didn't understand to theme, but he wanted to avoid the exam. The problem seemed to be insolvable. When he wanted to stop inquiry and prepare to the exam regularly, he looked at the paper with notes and in that time he got an idea...

Later already as a Doctor of Science he published his idea in work called A Method for the Construction of Minimum Redundancy Codes. His solution using binary tree was very simple but very elegant together. But also it was very effective in practice. However he has never patented his genial idea (he didn't want to). He always said to his son: "Look, it's only an algorithm."

Today, the most various variations of Huffman coding (for example adaptive variant) are mostly used in some compression algorithms (PKZIP, JPEG, MP3, BZIP2). Interesting is, that the algorithm from unix program bzip2 first used arithmetic coding. But when the firm IBM acquired over ten patents on this algorithm between years 1977-2001 and it was impossible to implement it effectively without using this methods, programmers of this open-source program bzip 2 decided to use Huffman coding.

Description

First, the algorithm counts frequencies (probabilities) of all single characters in input alphabet. According to these frequencies it generates binary tree, which has edges with value 0 or 1 and in its leaf nodes there are characters of input alphabet. Likewise it holds, that the highest is probability of character, the nearer is appearance of node to the root of the tree. Then this characters are replaced with binary code, which correspond to concatenation of values of edges, which we pass on the way from the root to the given leaf node. Since one of the values 0 or 1 always appears on edges leading to the right successor (the other to the left successor) and characters of input alphabet are located only in leaf nodes, the resulting code is prefix. It is necessary to attach this tree or at least the information about frequencies of appearance of single characters in input data to the resulting sequence. Then the decompression is available.

Pseudo-code


 1:  begin
 2:     count frequencies of single characters (source units)
 3:     output(frequencies using Fibonacci Codes of degree 2)
 4:     sort them to non-decreasing sequence
 5:     create a leaf node (character, frequency c, left son = NULL, right son = NULL) 
 6:  	   of the tree for each character and put nodes into queue F
 7:     while (|F|>=2) do 
 8:      begin
 9:        pop the first two nodes (u1, u2) with the lowest 
10:  	      frequencies from sorted queue
11:        create a node evaluated with sum of the chosen units, 
12:  	      successors are chosen units (eps, c(u1)+c(u2), u1, u2)
13:        insert new node into queue
14:      end
15:     node evaluate with way from root to leaf node (left son 1, right son 0)
16:     create output from coded intput characters
17:  end

References

D. Huffman: A Method for the Construction of Minimum Redundancy Codes. Proceedings of the I.R.E, volume 40, pp. 1098-1101, September 1952.
David A. Huffman. Wikipedia, the free encyclopedia.