papers
+HCU papers

Little essay about the various methods and viewpoints of crunching
~ version June 1998 ~
by Joa

Courtesy of fravia's page of reverse engineering

Here is the second part of Joa's essay.
See also the first part (Introduction)


Joa's work is extremely clear and will guide and help you to understand these matters, that for obvious reasons no real reverser should ever underestimate. My advice: read, re-read and then -once you have understood the basis- search the web and deepen (and then contribute :-)


Little essay about the various methods and viewpoints of crunching.

Part II: Propability and Crunching (Huffman isn't the only way)



By Joa Koester (JoKo2000@HotMail.Com) 
This essay is free to be copied and read and spread as long as
you are decent enough to leave my name in it. 
If you have corrections, comments, better examples than the ones used 
in this essay, etc. - drop me a line.


OK, let's dive into the world of crunchers and munchers...

In the Intro i told you that crunching is a process of reducing
the SPACE data needs to be stored, but not the MEANING (the
informations transmitted in the symbols).
The meter 'Entropy' is used here like in the physics. The more
entropy a symbol has, the LESS information is transmitted thru it.
Less information compared to the whole of all symbols, that is.
This means of course there is a lot of redundancy in it - meaning
a big 'crunch me, crunch me' for us. So we should concentrate
on the symbols that have a very high entropy. How we calculate
an entropy for our means depends on the data we want to crunch
and with which algorithm we want to crunch it.
I also told you that a certain student called Huffman once came
up with an algorithm to compress data. His idea is based on the
fact that in data (escpecially speech) some symbols happen 
to be more often than others. Here one overstated example:

aaaaaaaaaabbbbbcccdd

What we have here is a algorithm based on propability. That means
that some symbols are not transmitted in the way they used to be,
because the algorithm will create new symbols in respect to the
propabilities of the symbols. Some of these new symbols will be
shorter. And those lesser apperaring symbols will be transformed
into longer symbols:

Before:

a = 61h = %0110 0001
b = 62h = %0110 0010
c = 63h = %0110 0011
d = 64h = %0110 0100

and 

after:

????

How do we compute such a new symbol table? OK, before we explore
this algorithm let me change the data from ASCII to the lower
nibble. That should be enough to explain the algorithm:

Before:

a = %0001
b = %0010
c = %0011
d = %0100

and we have the propability:

a = 10
b =  5
c =  3 
d =  2
   ===
    20 zero != NULL) )
         knot = knot->zero
      else 
         if (knot->one != NULL) ) knot = knot->one

      if (knot.one == NULL) //is enough. knot.zero is NULL, too
         //symbol found. great. To the output with it
         symbol = knot.symbol
      end if
    end while   
    Do_Outpur_Char(symbol);
end while

Done. 


         
Hm, i have another little problem...

What is it, Watson?


I did as you told me. Took a table of 256 longs, counted the 256 chars,
sorted them and let the computer calculate an optimized Huffman-tree.
But all i got was a tree with all leaves having a length of 8 - all chars
are encoded with 8 bits and there is no crunching in that. What happened?

Ah, Dr. Watson,
you did not remember my sentence about Entropy. This file you examined
had all bytes nearly equally spread all over the length of the file.
So -statisticalwise- the file is equalized. No surprises here. With 
a Huffman-tree, which considers the whole file you won't get any 
crunching here, i fear.

But what do i do now?

Well, there are three scenarios that will satisfy your desire:
- We use a different algorithm
- We use a brutal tree
- We use an adapting technique

??? Tell me about the brutal tree.

Ok, i got this one from the 1001 Packer from the C64 (Anyone ringing a 
bell? 1001 Crew? No? Where are the old days gone to...)
They used a tree i just can label 'brutal tree'. Here's the idea:
When we have the case that we have data that is so equal that no
Huffman-tree will help us, we help us ourself. We take all the
Bytecounters and sort them. Then we create a table of the 256 chars
ordered by the sorted table, so that the still most frequent
chars appear on the top. Now we have the 256 possible bytes sorted.
So far so good.
Lets think about a byte. How do we read a byte? Isn't it:

#33 = $21 = %0010 0001  ?

What, if we we would change our view of the binary to a 2 + 6:

#33 = $21 = %00 100001

Our 256 chars would now be represented as 4 (00, 01, 10, 11) blocks
of 6 bits. What if we now 'huffman' these first two bits to:

0
10
110
111

So that we would have %0 100001 instead of %00 100001. Every byte that
is in the first 64Byteblock is coded with 7 bits, while the others
are coded either with 8 or with 9 bits. We can only hope that there
are enough bytes in the file to outweigh the two 9bit-chars. But it is
amazing how this little trick works. And the speed. It's so easy to
implement it and it is so fast. Even on the C64 it took only fractals of
a second to decode a programm as big as the main memory area (other 
packers took several seconds for the same size).


Wow, cool. But what about your different algorithms 
and this 'adapting technique' ?


Well, this is another story and will be told in the next chapter...




I hope you enjoyed this chapter. 
There will be more - i promise.
Just have a little patience.

Greetings from a programmer

Joa

In the meantime, after having read Joa's work above, you may enjoy a look at a huffer and a unhuffer, both written in C some time a go, and donated to the world, by Shaun Case... once you are done studying it, may be you'll want to have a look also at sixpack another more advanced (adaptive Huffman encoding) compresser, written in C as well, by Philip Gage, for a data compression competition some time ago... since this code includes SIX different algorhitms, I believe it could be useful for readers as 'passage' to the next part of Joa's essay (if Joa doesn't agree with me I'll kill these links of course).




redhomepage redlinks red anonymity red+ORC redstudents' essays redacademy database
redtools redcounter measures redcocktails redantismut redbots wars redsearch_forms redmail_fravia
redIs reverse engineering legal?