Data compression == AI??
In the past week I've been working on implementing LLMZip, a new algorithm for compressing text using an LLM. It scores tokens based on how likely the LLM thinks the actual next token will be. Those scores are generally low, since the LLM is decent at its job. A file with a bunch of small numbers in is easy to compress. So far in practice I'm getting output that is 60% the size it would be without the LLM encoding.
What's surprising to me is that there's a whole world of connection between AI and information theory which makes this success unsurprising. There is a prize called the Hutter prize for efficient compression of a large subset of Wikipedia. The competition is specifically aimed at driving improvements in the state of the art in aritificial intelligence, on the theory that to be able to compress and decompress Wikipedia you have to actually in some real sense know Wikipedia.
In Claude Shannon's The Mathematical Theory of Communication, in order to demonstrate how language can be seen as a statistical system, he goes through successive approximations to English, starting with assuming that all characters are equally likely:
Zero-order approximation (symbols independent and equiprobable).
XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD.
First-order approximation (symbols independent but with frequencies of English text).
OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEl ALHENHTTPA OOBTTVA NAH BRL.
Second-order approximation (digram structure as in English).
ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SE.A.CE CTISBE.
Third-order approximation (trigram structure as in English).
IN NO 1ST LAT WHEY CRATICT FROURE BIBS GROCID PONDENOME OF DEIVIONSTURES OF THE REPTAOIN IS REOOACTIONA OF ORE
And on to trigrams, word frequencies, word trigrams, etc. His point is that with each successive approximation, we know more about what symbol is coming next, so that we can better encode our data to take advantage of this. For instance, when designing Morse code, Morse took advantage of the letter frequencies in English, using the shortest code (a single dot) for the most common letter, E, etc, which makes the total message smaller, on average, than if the code were indifferent to the frequencies of letters.
Another way of putting this is that English is redundant. There is a lot of extra padding that can be squeezed out if you want to make your data smaller in order to transmit it or store it.
When you look at the successive approximations to English, it's striking that this is exactly the same approach that people take in natural language processing to artifically generating language. Shannon gets as far as Markov chains, which were state of the art for years. Now we have neural networks (RNNs and transformers) that can do the same thing of predicting/generating the next word to a much higher degree of accuracy.
What's strange though is that this simple task, "predict the next word", is so powerful.
Here's an explanation of this from Blaise Aguera y Arcas
Consider what it takes for the model to learn how to predict blanked-out portions of the following sentence from Wikipedia:
"Mount Melbourne is a 2,733-metre-high (8,967 ft) ice-covered stratovolcano in Victoria Land, Antarctica, between Wood Bay and Terra Nova Bay […] The volcano is uneroded and forms a cone with a base area of 25 by 55 kilometres (16 mi × 34 mi)."
If a word like “volcano” were blanked out, this would be a test of reading comprehension (What are we talking about? A kind of volcano). If “cone” were blanked out, it would be a test of general knowledge (Are volcanoes shaped like cubes, spheres, cones, something else?). If “Mount Melbourne” were blanked out, it would be a test of specialized knowledge (in this case, of esoteric geography). If “25 by 55” were blanked out, it would be a test of unit conversion knowledge and basic arithmetic. In short, one can see how pretraining on general texts like Wikipedia forces the model to learn a great deal about both language and about the world.
This all remains pretty deeply mysterious to me. The information theory/AI connection is definitely something I'm going to be thinking about.