Zhiying (Gin) Jiang | publications

2023

ACL2023

“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors

Jiang, Zhiying, Yang, Matthew, Tsirlin, Mikhail, Tang, Raphael, Dai, Yiqin, and Lin, Jimmy

In Findings of the Association for Computational Linguistics (ACL) 2023

Abs PDF

Deep neural networks (DNNs) are often used for text classification due to their high accuracy. However, DNNs can be computationally intensive, requiring millions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize, and to transfer to out-of-distribution (OOD) cases in practice. In this paper, we propose a non-parametric alternative to DNNs that’s easy, lightweight, and universal in text classification: a combination of a simple compressor like gzip with a k-nearest-neighbor classifier. Without any training parameters, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distribution datasets.It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also excels in the few-shot setting, where labeled data are too scarce to train DNNs effectively. Our code is available at https://github.com/bazingagin/npc_gzip.
preprint

Approximating Human-Like Few-shot Learning with GPT-based Compression

Huang, Cynthia*, Xie, Yuqing*, Jiang, Zhiying*, Lin, Jimmy, and Li, Ming

In preprint 2023

Abs PDF

In this work, we conceptualize the learning process as information compression. We seek to equip generative pre-trained models with human-like learning capabilities that enable data compression during inference. We present a novel approach that utilizes the Generative Pre-trained Transformer (GPT) to approximate Kolmogorov complexity, with the aim of estimating the optimal Information Distance for few-shot learning. We first propose using GPT as a prior for lossless text compression, achieving a noteworthy compression ratio. Experiment with LLAMA2-7B backbone achieves a compression ratio of 15.5 on enwik9. We justify the pre-training objective of GPT models by demonstrating its equivalence to the compression length, and, consequently, its ability to approximate the information distance for texts. Leveraging the approximated information distance, our method allows the direct application of GPT models in quantitative text similarity measurements. Experiment results show that our method overall achieves superior performance compared to embedding and prompt baselines on challenging NLP tasks, including semantic similarity, zero and one-shot text classification, and zero-shot text ranking.
ACL2023

What the DAAM: Interpreting Stable Diffusion Using Cross Attention

Tang, Raphael, Liu, Linqing, Pandey, Akshat, Jiang, Zhiying, Yang, Gefei, Kumar, Karun, Stenetorp, Pontus, Lin, Jimmy, and Ture, Ferhan

In Proceedings of Association for Computational Linguistics (ACL), Best Paper Award, 2023

Abs

Diffusion models are a milestone in text-to-image generation, but they remain poorly understood, lacking interpretability analyses. In this paper, we perform a text-image attribution analysis on Stable Diffusion, a recently open-sourced model. To produce attribution maps, we upscale and aggregate cross-attention maps in the denoising module, naming our method DAAM. We validate it by testing its segmentation ability on nouns, as well as its generalized attribution quality on all parts of speech, rated by humans. On two generated datasets, we attain a competitive 58.8-64.8 mIoU on noun segmentation and fair to good mean opinion scores (3.4-4.2) on generalized attribution. Then, we apply DAAM to study the role of syntax in the pixel space across head–dependent heat map interaction patterns for ten common dependency relations. We show that, for some relations, the head map consistently subsumes the dependent, while the opposite is true for others. Finally, we study several semantic phenomena, focusing on feature entanglement; we find that the presence of cohyponyms worsens generation quality by 9%, and descriptive adjectives attend too broadly. We are the first to interpret large diffusion models from a visuolinguistic perspective, which enables future research. Our code is at https://github.com/castorini/daam.

2022

NeurIPS2022

Few-Shot Non-Parametric Learning with Deep Latent Variable Model

Jiang, Zhiying, Dai, Yiqin, Xin, Ji, Li, Ming, and Lin, Jimmy

In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS) Spotlight. 2022

Abs PDF

Most real-world problems that machine learning algorithms are expected to solve face the situation with 1) unknown data distribution; 2) little domain-specific knowledge; and 3) datasets with limited annotation. We propose Non-Parametric learning by Compression with Latent Variables (NPC-LV), a learning framework for any dataset with abundant unlabeled data but very few labeled ones. By only training a generative model in an unsupervised way, the framework utilizes the data distribution to build a compressor. Using a compressor-based distance metric derived from Kolmogorov complexity, together with few labeled data, NPC-LV classifies without further training. We show that NPC-LV outperforms supervised methods on all three datasets on image classification in low data regime and even outperform semi-supervised learning methods on CIFAR-10. We demonstrate how and when negative evidence lowerbound (nELBO) can be used as an approximate compressed length for classification. By revealing the correlation between compression rate and classification accuracy, we illustrate that under NPC-LV, the improvement of generative models can enhance downstream classification accuracy.

2021

BlackBoxNLP

How Does BERT Rerank Passages? An Attribution Analysis with Information Bottlenecks

Jiang, Zhiying, Tang, Raphael, Xin, Ji, and Lin, Jimmy

In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP 2021

Abs PDF

Fine-tuned pre-trained transformers achieve the state of the art in passage reranking. Unfortunately, how they make their predictions remains vastly unexplained, especially at the end-to-end, input-to-output level. Little known is how tokens, layers, and passages precisely contribute to the final prediction. In this paper, we address this gap by leveraging the recently developed information bottlenecks for attribution (IBA) framework. On BERT-based models for passage reranking, we quantitatively demonstrate the framework’s veracity in extracting attribution maps, from which we perform detailed, token-wise analysis about how predictions are made. Overall, we find that BERT still cares about exact token matching for reranking; the [CLS] token mainly gathers information for predictions at the last layer; top-ranked passages are robust to token removal; and BERT fine-tuned on MSMARCO has positional bias towards the start of the passage.
ICLR Workshop

Investigating the Limitations of Transformers with Simple Arithmetic Tasks

Nogueira, Rodrigo, Jiang, Zhiying, and Lin, Jimmy

Proceedings of the First Mathematical Reasoning in General Artificial Intelligence Workshop, ICLR 2021

PDF

2020

EMNLP2020

Inserting Information Bottleneck for Attribution in Transformers

Jiang, Zhiying, Tang, Raphael, Xin, Ji, and Lin, Jimmy

In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings 2020

Abs PDF

Pretrained transformers achieve the state of the art across tasks in natural language processing, motivating researchers to investigate their inner mechanisms. One common direction is to understand what features are important for prediction. In this paper, we apply information bottlenecks to analyze the attribution of each feature for prediction on a black-box model. We use BERT as the example and evaluate our approach both quantitatively and qualitatively. We show the effectiveness of our method in terms of attribution and the ability to provide insight into how information flows through layers. We demonstrate that our technique outperforms two competitive methods in degradation tests on four datasets. Code is available at https://github.com/bazingagin/IBA.
EMNLP2020

Document Ranking with a Pretrained Sequence-to-Sequence Model

Nogueira, Rodrigo*, Jiang, Zhiying*, Pradeep, Ronak, and Lin, Jimmy

In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings 2020

Abs PDF

This work proposes the use of a pretrained sequence-to-sequence model for document ranking. Our approach is fundamentally different from a commonly adopted classification-based formulation based on encoder-only pretrained transformer architectures such as BERT. We show how a sequence-to-sequence model can be trained to generate relevance labels as “target tokens”, and how the underlying logits of these target tokens can be interpreted as relevance probabilities for ranking. Experimental results on the MS MARCO passage ranking task show that our ranking approach is superior to strong encoder-only models. On three other document retrieval test collections, we demonstrate a zero-shot transfer-based approach that outperforms previous state-of-the-art models requiring in-domain cross-validation. Furthermore, we find that our approach significantly outperforms an encoder-only architecture in a data-poor setting. We investigate this observation in more detail by varying target tokens to probe the model’s use of latent knowledge. Surprisingly, we find that the choice of target tokens impacts effectiveness, even for words that are closely related semantically. This finding sheds some light on why our sequence-to-sequence formulation for document ranking is effective. Code and models are available at pygaggle.ai.
CEUR

Evaluating pretrained transformer models for citation recommendation

Nogueira, Rodrigo, Jiang, Zhiying, Cho, Kyunghyun, and Lin, Jimmy

CEUR Workshop Proceedings 2020

Abs PDF

Citation recommendation systems for the scientific literature, to help authors find papers that should be cited, have the potential to speed up discoveries and uncover new routes for scientific exploration. We treat this task as a ranking problem, which we tackle with a two-stage approach: candidate generation followed by re-ranking. Within this framework, we adapt to the scientific domain a proven combination based on “bag of words” retrieval followed by re-scoring with a BERT model. We experimentally show the effects of domain adaptation, both in terms of pretraining on in-domain data and exploiting in-domain vocabulary. In addition, we evaluate eleven pretrained transformer models and analyze some unexpected failure cases. On three different collections from different scientific disciplines, our models perform close to or at the state of the art in the citation recommendation task.
arxiv

Navigation-Based Candidate Expansion and Pretrained Language Models for Citation Recommendation

Nogueira, Rodrigo, Jiang, Zhiying, Cho, Kyunghyun, and Lin, Jimmy

arXiv preprint arXiv:2001.08687 2020

PDF

2019

ACL2019

PaperRobot: Incremental Draft Generation of Scientific Ideas

Wang, Qingyun, Huang, Lifu, Jiang, Zhiying, Knight, Kevin, Ji, Heng, Bansal, Mohit, and Luan, Yi

In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019

Abs PDF

We present a PaperRobot who performs as an automatic research assistant by (1) conducting deep understanding of a large collection of human-written papers in a target domain and constructing comprehensive background knowledge graphs (KGs); (2) creating new ideas by predicting links from the background KGs, by combining graph attention and contextual text attention; (3) incrementally writing some key elements of a new paper based on memory-attention networks: from the input title along with predicted related entities to generate a paper abstract, from the abstract to generate conclusion and future work, and finally from future work to generate a title for a follow-on paper. Turing Tests, where a biomedical domain expert is asked to compare a system output and a human-authored string, show PaperRobot generated abstracts, conclusion and future work sections, and new titles are chosen over human-written ones up to 30%, 24% and 12% of the time, respectively.

2018

BEA

Chengyu cloze test

Jiang, Zhiying, Zhang, Boliang, Huang, Lifu, and Ji, Heng

In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications 2018