papers
+HCU papers

Little essay about the various methods and viewpoints of crunching
~ version June 1999 ~
by Joa ----- Part. VIII (Burrows - Wheeler - Transformation (BWT)

Courtesy of fravia's page of reverse engineering

Well, Joa continues his fundamental paper on crunching, this is part VIII
enjoy!

12 December 98 Joa ~ crunchi1.htm Little essay about the various methods and viewpoints of crunching papers ~fra_0126
10 June 98 Joa ~ crunchi2.htm Little essay about the various methods and viewpoints of crunching II papers ~fra_0129
17 June 98 Joa ~ crunchi3.htm Little essay about the various methods and viewpoints of crunching III papers ~fra_012E
17 June 98 Joa ~ crunchi4.htm Little essay about the various methods and viewpoints of crunching IV papers ~fra_012F
17 June 98 Joa ~ crunchi5.htm Little essay about the various methods and viewpoints of crunching V papers ~fra_012F
17 June 98 Joa ~ crunchi6.htm Little essay about the various methods and viewpoints of crunching VI papers ~fra_012F
17 June 98 Joa ~ crunchi7.htm Little essay about the various methods and viewpoints of crunching VII papers ~fra_xxxx


Little essay about the various methods and viewpoints of crunching.

Part VIII: Burrows - Wheeler - Transformation (BWT)



By Joa Koester (JoKo2000@HotMail.Com)
This essay is free to be copied and read and spread as long as
you are decent enough to leave my name in it.
If you have corrections, comments, better examples than the ones used
in this essay etc. - drop me a line.
hello, hello, well, shave my legs and call me Grandpa if that wasn't a long delay, but i still have some serious difficulties accessing the Internet recently. I hope that i can access you more frequently in the near future. And to sweet up the time till then a little bit here my next essay about crunching methods: THE BURROWS - WHEELER - TRANSFORMATION (tataaa :) As we discussed last time the mechanism of Arithmetic Crunching some of you asked me about the source (the input of the cruncher) and that sometimes it would take the cruncher a long time to adjust to the source data. Yes, there is a problem with this kind of crunching: the source itself should be transformed to be better crunchable. But how do we transform a data to be better crunchable with a fast algorithm? Well, the arithmetic cruncher would love to crunch data like (all hex): 00, 01, 05, 02, 08, c2, 03, 01, c0, 0a, 12, 01, 02.... but how do we transform a sequence like {0a, f2, f2, 3c, 0a, 77, 44, 0a} into such cruncherfriendly data? Well, luckily for us, there is an algorithm available for us that does exactly what we need. This alg is called Move-To-Front (MTF). The MTF-Algorithm MTF is - once you had the idea - a pretty simple algorithm. It transforms the input data into some other data, it is reversible and it is pretty fast. The basic idea is that we don't output the data itself but the index of the char in some table (in our case the table has the usual possible 256 entries). After the output we save the byte, shift the rest below one position up and MOVE the saved byte TO the FRONT of the table. A simple char-Table will show you what i mean: Let's presume the input is {0a, f2, f2, 3c, 0a, 77, 44, 0a} That's 8 entries. But we have only 5 distinct values. As we want a reference table, we only need the distinct values, so for the sake of our example we take a reference table with only 5 entries (which are - for the example - filled with the distinct input values as starting values. In a real case the table would have 256 entries filled with the 256 possible values for start): Table Value [ ] 0 0a 1 f2 2 3c 3 77 4 44 Now let's transform the first byte: 0a We search our table for the value 0a and we find it at offset 00. And that's our output: 00. Now we shift the rest one position upwards and move the 0a at offset 00 to offset 00. Luckily for us the loop terminates immediately and our table looks now exactly like before: Table Value [ ] 0 0a 1 f2 2 3c 3 77 4 44 And now the second byte: f2 f2 is found at position 01 and this is our output: 01. Now we save the f2 at position 01, move all other entries below the position 01 one up and move the saved f2 to position 00. Let's watch the manipulation of the table in slow motion: Table Value Table Value Table Value [ ] [ ] [ ] 0 0a f2 is saved. Copy 0 0a insert the -> 0 f2 1 f2 1 0a saved f2 at 1 0a 2 3c one position up: 2 3c offset 00: 2 3c 3 77 3 77 3 77 4 44 4 44 4 44 The next byte is again: f2 As we find it at position 00 our output is 00. Our moving loop terminates immediately and now let's have a look at our table and our output so far: Table Value [ ] 0 f2 1 0a 2 3c 3 77 4 44 Output so far: 00, 01, 00 Oh, ah, fascinating. Our input has surely been transformed into something more likely to be crunched by an Arithmetic Cruncher system. Let's continue with the next bytes: Inputbyte is 3c 3c is found at offset 02. So we save the 3c at offset 02, move the rest below position 02 one up and move 3c to offset 00: Table Value Table Value Table Value [ ] [ ] [ ] 0 f2 3c is saved. Copy 0 f2 insert the -> 0 3c 1 0a the entries below 1 f2 saved f2 at 1 f2 2 3c 2 0a offset 00: 2 0a 3 77 4 77 4 77 4 44 5 44 5 44 Output so far: 00, 01, 00, 02 The next input byte is 0a. It is found at offset 02 which is our output. Now we move the 0a at position 02 to the front: Table Value Table Value Table Value [ ] [ ] [ ] 0 3c 0a is saved. Copy 0 3c insert the -> 0 0a 1 f2 the entries below 1 3c saved 0a at 1 3c 2 0a 2 f2 offset 00: 2 f2 3 77 4 77 4 77 4 44 5 44 5 44 Output so far: 00, 01, 00, 02, 02 Now for the next input byte: 77. 77 is found at offset 03 which is output. This is how our table is altered: Table Value Table Value Table Value [ ] [ ] [ ] 0 0a 77 is saved. Copy 0 0a insert the -> 0 77 1 3c the entries below 1 0a saved 77 at 1 0a 2 f2 one position up: 2 3c offset 00: 2 3c 3 77 <4 f2 4 f2 4 44 5 44 5 44 Output so far: 00, 01, 00, 02, 02, 03 After the next input byte (44) our output is 04 and our table looks like this: Table Value [ ] 0 44 1 77 2 0a 4 3c 5 f2 Output so far: 00, 01, 00, 02, 02, 03, 04 After the final input byte (0a) our output is 02 and our table looks like this: Table Value [ ] 0 0a 1 44 2 77 4 3c 5 f2 Output so far: 00, 01, 00, 02, 02, 03, 04, 02 As you can see these output bytes are far more crunchable than the original bytes. But is it really reversible? Let's retransform the first few bytes together. At first we build the basic transformation table which is the same as the cruncher uses (remember how cruncher and decruncher have to use the same model?): Table Value [ ] 0 0a 1 f2 2 3c 3 77 4 44 Again, in a real version the table would be filled with the standard 256 possible values of the ASCII-TABLE. Now let's retransform the first byte: 00 We look for the 00th byte in our table, find a 0a and output it. Then we take the 00 as starting point for the shifting and moving and proceed as we already know from the first pass. As the loop terminates immediately we can continue with the next input byte. The next input byte is a 01. Now, at offset 01 we find a f2, output it and proceed: Save the char at offset 01, shift all entries below offset 01 one position up and move the saved byte into position 00. After that our table looks like this: Table Value [ ] 0 f2 1 0a 2 3c 3 77 4 44 As you can already see the table looks exactly like the table when we had our first pass and transformed the char f2. If we would continue to scan thru the rest of the input (the former output) we would see that the transformation table would be exactly like the one in the first pass. So, let's recapitulate: the MTF-Algorithm is fast, easy to implement and a good preprocessor for Huffman or (even better) Arithmetic Crunching. It's also a nice basic idea for crypting data. Yes, but, ehm... Yes, Watson? Well, i thought, that this essay was about the Burrows-Wheeler-Transformation, not about the MTF-algorithm. You are right. But the authors of the BWT, Michael Burrows and David Wheeler recommended that data should be crunched in such an order: Data -> (maybe a simple RLE cruncher ->) BWT -> MTF -> Huffman or Arithmetic Cruncher Ah, i see. Yes, and now as the basic premises are all clear let's examine the BWT. What is the BWT? Well, a short answer would be: an alg that sorts data, saves a certain aspekt of the sorted data, saves a starting index and that's it. What? No compression? No, Watson, no compression. The transformation is just what it is: A transformation. But think about it for a moment. A reversible transformation has a big advantage for designers: You can inspect and/or manipulate the transformed data in another way than you can with the original data. Take the MTF for example. It transforms the input data into some other data. But this other data is in most cases much better crunchable than the original data. And as the process is completely reversible you don't have any disadvantage. You could write a special cruncher sensitive to the output of MTF processes with which you could achieve better results than achieved with normal data. You could say that this kind of transforming data, manipulating that other data and retransforming the data to the original kind is some form of indirection. Indirection is very crucial for programmers (and philosophers); it enables you to solve and discuss problems on a much more comfortable level than the original problem stated. Keep this in your mind: if you have some problems, viewing the problem from a differenct angle gives you much more possibilities than trying to solve the problem with brute force. But back to the BWT. So, the BWT is just some kind of indirection. And it has to be reversible. But how does this work together with sorting? Well, the BWT algorithm gives you the possibility to unsort your data... Yes, you read right. With BWT you can unsort the saved data you constructed by BWT. OK, the data you save itself is not sorted, but is has some interesting features: 1. It has in some way the FOLLOWING context of the original data. 2. It enables you to unsort the data and rebuild the original data. 3. The bigger the chunks of data you sort, the better your results for the crunching stage. So, speedwise it's a matter of how fast your sorting algorithm is. crunchwise it's a matter of how much memory you have. So, for example let's transform the word CRACKER. To do this we need an array to all possible shifted forms of the word. If you take the word CRACKER, it has 7 letters. So we can shift it 7 times, from shift 0 to shift 6. The result is something like this: CRACKER RACKERC ACKERCR CKERCRA KERCRAC ERCRACK RCRACKE How do arrange this? Well, one easy way would be to duplicate the source data in RAM and just install enough charpointers (7 in that case). CRACKERCRACKER S0 ^^^^^^^ S1 ^^^^^^^ S2 ^^^^^^^ S3 ^^^^^^^ S4 ^^^^^^^ S5 ^^^^^^^ S6 ^^^^^^^ When we later would sort the data we would have a very simple compare logic. Another way would be just 7 charpointers and the compare part of the sort logic would have to wrap around if it reached the end of the block: S0 CRACKER S1 RACKERC S2 ACKERCR S3 CKERCRA S4 KERCRAC S5 ERCRACK S6 RCRACKE I assume the latter for now. But it's a matter of taste, nothing else. However, what we do next is to sort the data. As pointers we take our 7 charpointer and we should only swap the string pointers as swapping some 100 KByte could take a looong time if done some 100 times :). Again: if you sort that way i do, your compare function has to recognize the end of the block to wrap around. Otherwise your sorting will fail. (You should save the pointer of S1, as you will need it later.) So, after sorting our new set of strings looks like this: S2 ACKERCR S3 CKERCRA S0 CRACKER S5 ERCRACK S4 KERCRAC S1 RACKERC S6 RCRACKE Now, what next? Ok, let's take a closer look to the first and last column of the strings: S2 A CKERC R S3 C KERCR A S0 C RACKE R S5 E RCRAC K S4 K ERCRA C S1 R ACKER C S6 R CRACK E The first column ist SORTED. Well, we can expect it to be sorted, as we have just sorted the data :) And what about the last column? Well, the least we can say - as we are having a wrap-around look - is - that the first and the last column both contain the word (of course) in some more or less strange order and - the letter in the last column is the prefix letter of the letter in the first column in the same row. Now, what we save to disk is the LAST column and an index. The index is the position of the first letter of the original word in the saved data (our last column). When we rotate the last column by 90 degrees it looks like: RARKCCE. And the 'C' of CRACKER is at position 5 (zero-based). It is important to take the correct offset of the correct 'C' as it is the starting point for the re-transformation. There is an easy way to find the right 'C'. There are two 'C's in our little example. But we need the 'C' which starts the word CRACKER. If you look at the table, it is the 'C' of the last column of S1. Why? Because S0 is the original (shifted 0 times). And S1 is the original shifted once: S0 C RACKE R S1 R ACKER C That's why i'd save the pointer S1. As we want the first char of the word, but have to use the last column, we only can take the char in the last colum of S1. So, all, we have to do is to count thru our charpointers until we reach S1: 0: S2 A CKERC R 1: S3 C KERCR A 2: S0 C RACKE R 3: S5 E RCRAC K 4: S4 K ERCRA C 5: S1 R ACKER C attachment to this essay (and if fravia+ allows so :). If you find the attachment, take your compiler and compile the progs. The correct sequence for crunching / decrunching is: Crunch: SourceProg -> RLE -> BWT -> MTF -> ARI -> CrunchProg Decrunch: CrunchProg -> UNARI -> UNMTF -> UNBWT -> UNRLE -> SourceProg. Anyway, as always i'm open to suggestions, questions etc. Have just a little patience, that's all (maybe i should get a private INet-Access anyway :). Ok, see you next time Greetings from a programmer Joa




You are deep inside fravia's pages of reverse engineering, choose your way out!

redhomepage red+ORC redanonimity academy redcounter measures redbots' wars
redtools redour tools redhow to use our tools
redjavascript wars redreality cracking redacademy database redprogrammer's corner redhow to protect better
redantismut CGI-scripts redcocktails redsearch_page redhow to search redmail_fravia+
redIs reverse engineering legal?