Finally got a chance (after 5 months!) to write something less formal and talk about my personal thoughts on the gzip paper
I realized our paper has received some attention, with most reactions on two extreme sides. I personally think our paper is similar to most papers in that it has both inspirational part and limitations. So I really appreciate people like Sebastian and Abhinav who introduced the idea in a responsible way.
tl;dr
In this blog, I will briefly recap the main idea, then discuss about tie-breaking strategy and added results, also go into practical guide and limitations, and my personal thoughts about its future work.
This might be clear with text examples:
x_1 = “US’s Stark Industries has developed a 12-gram flying microrobot.”
x_2 = “The latest tiny flying robot has been unveiled in US.”
x_3 = “Quicksilver won the gold medal in the 400 individual medley.”
If we only compress
Now what about
Using len(gzip.compress(x.encode('utf-8')))
we get
Why is
Because the information in
But you may also notice that only using the compressed length of concatenation like
This is the motivation for a formally defined metric Information Distance
For example, for two images below,
one of the programs that can convert from one to another can be:
Information Distance is a desirable distance metric as among all “reasonable”
To have a computable version of information distance, Normalized Compression Distance (NCD) is defined
With the help of NCD, we can calculate the distance between any two objects. So when we have a test data point
There have been a lot of discussions
Traditionally in
When we use “majority vote”, there are chances we have different classes with same number of votes. For example, when
The simplest strategy is random
- just pick tied classes randomly (this can be set via rand=True
flag in the original codebase). As you can imagine, the accuracy with random
will be unstable, depending on how lucky we are.
This becomes trickier under few-shot settings as we will have more randomness with different training samples being picked.
What should we do if we’d like to have a more stable accuracy? Increase the trials of experiments. Currently we have repeated random
strategy and repeat each experiment for
So we report the result with a more deterministic strategy - maximum
, which assumes we are super lucky and always guess the label correctly when tie. In other words, this is the upperbound of the accuracy we can get with gzip+
We also use maximum
strategy for other methods utilizing
Below is the code snippet that deals with accuracy calculation, and it is thought to be a bug of calculating top-
I will explain why the code doesn’t calculate top-
What is the definition of top-
It computes the number of times where the correct label is among the top
k labels (ranked by predicted scores) predicted.^{[8]} [Def.1]
Now recall that discriminative models like BERT will output a list of predicted classes with scores before picking the most possible one. Take trinary classification as an example, suppose we have only three classes - tech
, sports
, and politics
, we will have probability predicted for each model like below:
BERT:
tech: 0.73
sports: 0.24
politics: 0.03
While for models like
kNN:
tech
tech
sports
Suppose the ground truth is sports
and we are calculating top3 accuracy. For BERT, we will score every time but for
The difference of accuracy between discriminative models like BERT and
The percentage of nearest
k labels that contain the ground truth. [Def.2]
[Def.2] is clearly different from [Def.1], which is the definition of top-
But why it’s thought to be calculating top-
In summary, when
Although the reported accuracy is not top-maximum
only is not enough and including random
and minimum
can provide a more wholistic point of view. So here is the updated table that I posted during the discussion:
Results for W2V and SentBERT under random
are also included for comparison.
So if we only compare the result with random
, we can see that under 5-shot settings, gzip (rand) outperforms all methods on Kinyarwanda, Kirundi, Filipino, and SogouNews but is worse than mBERT on Swahili, on which mBERT has been pretained. In the full dataset setting, gzip (rand) is only highest on SogouNews, is competitive on Kinyarwanda and Swahili, and is bad on Kirundi and updated DengueFilipino.
Side note: I also received comments saying it’s unfair to compare BERT to gzip because BERT has not been pretrained on low-resource languages… That made me think “fair comparison” is a relatively subjective standard - why no one mentions it’s unfair to compare gzip to BERT, who has been pretrained on billions of tokens while gzip and other methods do not have access to? Probably because BERT is so conveniently accessible and is a standard in all kinds of benchmarks. So it is like an unspoken academic norm instead of veritas to me. Other norms include but not limited to “you are not expected to use commercial/proprietary models like chatGPT when doing evaluation in academic papers”. Not trying to self-defend here, as I mentioned it’s my mistake to not include the results of random
and minimum
in the paper, but I do think there tends to be bias towards what’s fair and what’s unfair based on certain norms in general.
gzip+
But practically when do we want to use gzip+
Our paper only explores “topic classification”, which is just a subset of text classification. So the findings mentioned in the paper may not be extrapolated to other tasks like sentiment analysis and natural language inference.
As for the limitation of the method itself, the top one concern is the speed, due to the
Beyond the method of gzip+
The drawback of this method is that the effectiveness depends on the “search space” of the compressors (e.g., window size for gzip). Compressors with smaller search space may suffer from failing to utilize all the information in training samples. But better compressors and implementations can overcome this. For example, this repo called FTCC used a better compressor and implementation than what I did in the paper.
Possibilities!
As you can see, gzip is just a tool that helps us to approximate information distance. The generalized version of this method is plotted below:
This framework is called Non-Parametric learning with Compression (NPC). Both the “Compressor” and “Aggregate” parts are replaceable. We can even use neural compressors, which is illustrated in more details in my other blog, and other aggregation methods.
That’s not all - we can even go beyond this framework. What we want to do is to approximate information distance. But we didn’t directly approximate it, instead, we approximate components of information distance by optimizing our compressors. Although information distance is not differentiable, we have numerous methods to deal with non-differentiability. So another direction to explore is to optimize information distance directly. After all, no matter it’s sentence-transformer, or siamese network, the fundamental principle is to learn representations guided by similarity metrics.
[1] Jiang, Zhiying, et al. ““Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors.” Findings of the Association for Computational Linguistics: ACL 2023. 2023.
[2] https://aclanthology.org/2023.findings-acl.426.pdf#page=16
[3] Charles H Bennett, Péter Gács, Ming Li, Paul MB Vitányi, and Wojciech H Zurek. Information distance. IEEE Transactions on Information Theory, 44(4):1407–1423, 1998.
[4] Some people may ask whether the language matters. According to invariance theorem, for any given description language, the optimal one is at least as efficient as the given description language with some constant overhead.
[5] Technically, we should call it “admissible”, which basically constrains the distance metric to be meaningful (e.g., doesn’t include distance metric like
[6] https://stackoverflow.com/questions/11568897/value-of-k-in-k-nearest-neighbor-algorithm
[7] https://github.com/bazingagin/npc_gzip/issues/3
[8] https://scikit-learn.org/stable/modules/generated/sklearn.metrics.top_k_accuracy_score.html
[9] Raff, Edward, and Charles Nicholas. “An alternative to NCD for large sequences, Lempel-Ziv Jaccard distance.” Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 2017.