crunchi4.htm: Little essay about the various methods and viewpoints of crunching

+HCU papers

Little essay about the various methods and viewpoints of crunching
~ version June 1998 ~
by Joa Part. IV
Courtesy of fravia's page of reverse engineering
Well, Joa continues his fundamental paper on crunching, this is part IV (since the preceding one was part 3 :-)
enjoy!
05 June 98 Joa ~ crunchi1.htm Little essay about the various methods and viewpoints of crunching papers ~ fra_0126

10 June 98 Joa ~ crunchi2.htm Little essay about the various methods and viewpoints of crunching II papers ~ fra_0129

17 June 98 Joa ~ crunchi3.htm Little essay about the various methods and viewpoints of crunching III papers ~ fra_012E

17 June 98 Joa ~ crunchi4.htm Little essay about the various methods and viewpoints of crunching IV papers ~ fra_012F
Little essay about the various methods and viewpoints of crunching.

Part IV: Leaving Huffman and entering the realms of R'lyeh, eh, RLE



By Joa Koester (JoKo2000@HotMail.Com) 
This essay is free to be copied and read and spread as long as
you are decent enough to leave my name in it. 
If you have corrections, comments, better examples than the ones used 
in this essay, etc. - drop me a line.


OK, let's dive into the world of crunchers and munchers...


Last time we discussed the possible uses of the Huffman-tech and
the possibibility to use it in an adaptive way. I would like
to conclude this with the remark that the Huffman-way is pretty 
useless for 'normal' progs. You achieve about 5% - 10% if you
are lucky. Used on normal texts yielding only characters with
ASCII > 127 there will be at least a 25% ratio. Normally there
will be a ratio of about 30%-35%. But there you have a limited
Alphabet (the letters in lower and upper cases and the numbers
plus some extra symbols). What we want are algorithms usable
for more general (random) data. One of these algorithms (or
better the principle) is RLE.

What does RLE mean?

RLE, Dr. Watson, stands for Run Length Encoding. It's varying 
implementations are mostly used when speed is important.
On the Amiga days there was a nice picture format - IFF.
It had a RLE mode (or was RLE always on? Can't remember anymore...)
where the single lines of a picture were crunched with a RLE
algorithm leading to a space-saving of about 30% to 80% depending
on the picture. Nice, isn't it?

But what is the basic idea of RLE?

Ah, Watson, always so eager, aren't you? Well, your desire will
be satisfied... 

Imagine a file with some data and some zeros:

abcdefg00000000000000000000

(you will see data like this a lot in uncompressed bitmaps)

Now, we want to crunch it. We would like to crunch the RUN of the
zeros. For this we have to tell the decoder the LENGTH  of the run. 
One obvious idea would be:

abcdefg{0x14}0

where the {0x14} would be a byte which would tell the 
decompressor how often it has to copy the following byte, 
which would give us 9 bytes instead of 27 bytes: a 66.66% ratio.
Now you know where the name RLE comes from. Believe me, this is
one of the most primitive ways of thinking up a crunch-method.
But as with all most primitive ways it is extremely fast and 
astoundingly stable. 

--- Little private note ----
As a programmer you don't win a prize for the most complicated 
algorithm you can think of (because you aren't paid (enough) for it 
and there is in most cases no time for it) (except for
the field of security where complicated algs are expected. 
And yet - a programmer has to debug his own routines and so he won't 
use tricks he won't understand a few weeks later), so you don't do it.
You program on a more stable level: KISS. Keep It Simple, Stupid!
If there is a more easy way of doing it, choose it. Use preformulated
libraries like the C++ STL or (yuk) MFC or Borland VCL. That's the
way programmers program their applications. Hugh!
--- Little private note end ---

One immediate question is of course coming up:
What, if the zeros would have appeared 0x61 times (the value for 'a')?
The output would have been:
abcdefg{0x61}0 = abcdefga0

How does the decruncher know when to activate a copy loop (crunch) or 
just copy the actual bytes (no-crunch)?

We have to build up a rule that both, the coder and the decoder
use without exceptions. The coder has to build the output in a way that
the decoder can decode it's input without problems.

So, what we need is a kind of SWITCH. A switch which the encoder sends.
Linked with the descision of implementing the switch is automatically 
the question: WHEN do i crunch? 
Do i crunch when there are 10 equal bytes following in a run? Certainly.
There will be enough ways to tell the decoder that a run is coming up.
Do i crunch when there are 5 equal bytes following in a run? Yep. Same answer.
Do i crunch when there are 4 equal bytes following in a run? Yep. Same answer.
Do i crunch when there are 3 equal bytes following in a run? Hm, let me think of it.
Do i crunch when there are 2 equal bytes following in a run? Eh, (panic) don't know. HELP!

Ok, ok. What you need is a little overview of some ideas i observed while 
analysing some RLE methods. In most crunching algorithms there will be telling
bits or bytes, telling the cruncher to change into crunch/normal mode. After
this switch the following bits/bytes are interpreted completely different.

- IFF like.
  The IFf format was build upon the idea of signed chars. I don't remember the 
  algorithm in all detail, but enough to explain the idea. If someone has the
  original IFF-Readme's he should correct me here. But the basic idea is the 
  following:
  There are TELLING bytes. These telling bytes are either signed or unsigned.
  The crunched file starts with a telling byte.
  The switch was the 8th bit (the sign bit) of this char. 
  When the decoder encountered a signed char it knew that, by masking with 0x7f,
  it would get to the Runlength of the following byte. It copied the following
  byte the decoded time +2 and went on with the next byte. +2 coz there had to
  be at least 2 bytes, so you could add 2 in mind, giving you a length of a
  crunch-run from 2 (encoded with %1 + (2 - 2 = 0 = %000 0000) = 0x80) 
  to 129 (encoded with %1 + (129 - 2 = 127 = %111 1111) = 0xff).
  If the byte was not signed, that would mean that the next X +1 bytes were to 
  be copied as they were. X +1 because there HAS to be at least one byte, so 
  you can add 1 in mind, giving us a range 
  from 1 byte (encoded with %0 + (1 - 1 = 0 = %000 0000) = 0x00) 
  to 128 (encoded with %0 + (128 - 1 = 127 =  %111 1111) = 0x7f). 
  The minimum crunching point was 2 equal following bytes. They would be
  coded as two bytes (0x80,0x??). Yes, i know, there is no saving here, but the other
  consequence would have been to encode these two bytes as no-crunch, leading
  to a three byte coding (0x01, 0x??, 0x??). 
  In cases like this it is most important to keep the garbage down, or else you are
  increasing the filesize with it.
  One example:
  
  abcdefg00000000000000000000 (= abcdefg + 20 '0')
  
  would be coded as:
  
  {0x06}abcdefg{0x92}0. 
  0x06 because we have 7 bytes here and the decoder adds one for you.
  0x92 = %10010010 = %1 0010010 = 128 + 18. 18 because the decoder adds 2 for you.

  It is important that you add the original filelength with the crunched file or your 
  decoder doesn't know when to stop. 

  If you have a file of total random data (like a already crunched file) you will
  add (filesize / 128) bytes to it. So, if your Word '95 with 3.845.120 bytes was
  total random data it would be enlarged by 30.040 bytes. But run the alf on uncompressed
  bitmaps and watch them shrink.

  Remark: IFF is a very good and simple to implement algorithm and it's a shame that
          the Amiga is dead nowadays. I see IFF only on AIFF sound formats. ;(


- one for eight
  We have a telling byte which is viewed as a row of eight status bits. 1 means we have
  a crunch, 0 means we have garbage (or the other way round - your choice).
  Crunch and Garbage have unsigned counting bytes giving us a range from 1 to 256.
  After we worked thru a sequence of all 8 bits, the next telling byte is read (thus
  forming a package, remember?)
  The crunched file starts with such a telling byte.
  One example:

  abcdefg00000000000000000000xyz111222333444abcdefg = 49 bytes

  We step thru the first 7 bytes and consider them Garbage. 
  The next 20 bytes are Crunch.
  Then 3 bytes Garbage.
  Then 3 bytes Crunch.
  Then 3 bytes Crunch.
  Then 3 bytes Crunch.
  Then 3 bytes Crunch.
  Then 7 bytes Garbage.

  So our telling byte would be: %0 1 0 1 1 1 1 0 = 0x5e.
  The counting bytes would be (decremented by one, remember): 
  0x06, 0x13, 0x02, 0x02, 0x02, 0x02, 0x02, 0x06

  The crunching algorithm can start by crunching two bytes and we should do it
  or else we would produce one counting byte, plus those two bytes = 3 bytes!

  It is important that you add the original filelength with the crunched file or your 
  decoder doesn't know when to stop. 

  The whole package would look like this:
  {0x5e}{0x06}abcdefg{0x13}0{0x02}xyz{0x02}1{0x02}2{0x02}3{0x02}4{0x06}abcdefg = 31 bytes

  The IFF-coder would have produced:
  {0x06}abcdefg{0x92}0{0x02}xyz{0x81}1{0x81}2{0x81}3{0x81}4{0x06}abcdefg = 30 bytes

  While the algorithm is a little bit more complicated than the IFF-alg, it has the advantage
  of having a range of 256 instead of 129 byte. In big files this could lead to some significant
  savings.

  I lifted this algorithm from the bit-level to the bytelevel. I saw both implementations and
  due to my observations the bytelevel is much faster. In the bitlevel you read a telling-BIT. 
  If it's a one (crunch) you read 8 bits as a counter, add one, read the next 8 bit as the char 
  and copy your decoded byte ?-times into your destination buffer. 
  If it's a zero you read the next 8 bits 
  as a counter, add one and reading ? times 8 bits as a char you copy the next ? bytes. It's the
  same mechanism, but implemented as byte it's sooooo much faster. The bitlevel is just more
  pseudo-crypting, because you can't read anything anymore :-).

  If you have a file of total random data (like a already crunched file) you will
  add (filesize / 256) + ( (filesize / 256) / 8) bytes to it. 
  So, if your Word '95 with 3.845.120 bytes was total random data it would be 
  enlarged by 15.020 + 1878 = 16898 bytes. 
  

- No Garbage
  This one is most interesting. As you saw above, one problem arising is the coding of garbage.
  This idea here now implements a way of RLE without having the problems of garbage with the
  disadvantage of only being able to crunch from up to 3 bytes in a row instead of 2 like the
  others.
  The idea is, that the decoder can do something about the garbage recognizing. If the decoder
  would somehow recognize that we have a crunch here, it just would have to read a counter byte 
  and could start the copy routine. 

  Well, the easiest way of making the decoder recognizing a byte-run is to let the run partially 
  STAY in the source. That means that we have to let at least two bytes in a row stay in the source for
  to make the decoder able to recognize a crunch. Than we add a counting byte with a range from
  1 to 256 of how many of this byte will FOLLOW and there are no garbage bytes anymore.
  1 to 256 because we will crunch starting by three equal bytes in a row, so the byte will at least 
  be copied once more, thus adding one to the range of a byte.
  In the praxis this means you will have a run with a certain length. When the length is more than
  two and shorter than 259 you subtract 3 from it. When you have a length of 3 you subtract 3 giving
  a zero as output. As the decoder will at least copy one byte the run is perfect. If you have a run
  of 258 you subtract 3 giving 255 as output. Exactly what can be put into a unsigned char. As two
  bytes will be coded the normal way and the decoder will add one to the counter we have our 258 
  runlength encoded.

  One example:
  abcdefg00000000000000000000xyz111222333444abcdefg = 49 bytes

  abcdefg               is encoded as is
  00000000000000000000  is encoded as crunch: 20 - 3 = 17 bytes will follow, so 00{0x11}
  xyz                   is encoded as is
  111                   is encoded as crunch: 3  - 3 = 0  byted will follow, so 11{0x00}
  222                   is encoded as crunch: 3  - 3 = 0  byted will follow, so 22{0x00}
  333                   is encoded as crunch: 3  - 3 = 0  byted will follow, so 33{0x00}
  444                   is encoded as crunch: 3  - 3 = 0  byted will follow, so 44{0x00}
  abcdefg               is encoded as is

  makes alltogether:

  abcdefg00{0x11}xyz11{0x00}22{0x00}33{0x00}44{0x00}abcdedfg = 32 bytes


  If you have a file of total random data (like a already crunched file) you will
  add 0 (!) bytes to it. That's an important feature. This algorithm can be
  let loose on all data you know. In the worst case the output will be the same
  as the input and no harm was done.
  So, if your Word '95 with 3.845.120 bytes was total random data it would be 
  still 3.845.120 bytes
  The disadvantage is of course the unability to crunch two consequenting bytes.
  But hey, you can't have it all, can you?


There are of course a lot more possibilities of coding RLE. I hope you get the idea.
Don't let the statistics deceive you. It strongly depends on the input data which
of the above mentioned algorithms are best suited for the task. One good strategy 
would be to load about 20%-10% of the file (if it's big) and test crunch it with 
ALL your algs. Then choose the best one and crunch the whole file. This will not
work on smaller files where you should load the whole file into a buffer (say, 64K
or 128K or so). For most RLE-algs speed is the most critical factor. Have this 
always on your mind when you invent your own RLE-crunchers.

When i started writing little crunchers on my own i started with the DECruncher.
It worked always fine for me. First think of a file and a possible coding. Write
it down. Create a file with this coding. Write a decruncher. And then think of a 
way to generate such a code. The central point of all crunch algorithms will be
the pointers to your source-buffer and to internal arrays and/or to some
history tables. Try it. It's easier than you may think right now.


Eh, i have observed something important, i think.

What is it, Watson?

In your example (abcdefg00000000000000000000xyz111222333444abcdefg) the 
sequence 'abcdefg' happens to appear two times. After having observed this
i examined some files and texts and i realized that sequences (and, or, in...)
(re)appear a lot more often than equal-byte runs in files. 
Can we crunch these sequences, too?

Oh, yes, Watson. We can! Jacob Ziv and Abraham Lempel published essays dealing
exactly with this problem in 1977 and 1978. The theories described there are
the basics of what we call today the LZxxx-algorithms. Programs like the
UNIX-Compress or ZIP or RAR are based on these theories. But this is a 
little bit more complicated and better dealt with in an chapter on it's own.



I hope you enjoyed this chapter. 
There will be more - i promise.
Just have a little patience.

Greetings from a programmer

Joa
homepage
links
anonymity
+ORC
tools
bots wars
mail_fravia
Is reverse engineering legal?
05 June 98	Joa	~	crunchi1.htm	Little essay about the various methods and viewpoints of crunching	papers	~	fra_0126
10 June 98	Joa	~	crunchi2.htm	Little essay about the various methods and viewpoints of crunching II	papers	~	fra_0129
17 June 98	Joa	~	crunchi3.htm	Little essay about the various methods and viewpoints of crunching III	papers	~	fra_012E
17 June 98	Joa	~	crunchi4.htm	Little essay about the various methods and viewpoints of crunching IV	papers	~	fra_012F