Headless Language Models: Learning without Predicting with Contrastive Weight Tying
UPDATE: This paper was accepted to ICLR 2024! 🎉
Abstract
Self-supervised pre-training of language models usually consists in predicting probability distributions over extensive token vocabularies. In this study, we propose an innovative method that shifts away from probability prediction and instead focuses on reconstructing input embeddings in a contrastive fashion via Constrastive Weight Tying (CWT). We apply this approach to pretrain Headless Language Models in both monolingual and multilingual contexts. Our method offers practical advantages, substantially reducing training computational requirements by up to 20 times, while simultaneously enhancing downstream performance and data efficiency. We observe a significant +1.6 GLUE score increase and a notable +2.7 LAMBADA accuracy improvement compared to classical LMs within similar compute budgets.
This paper was co-authored by my PhD supervisors Eric Villemonte de la Clergerie and Benoît Sagot from Inria’s ALMAnaCH team.
Here is the PDF version of the paper that you can also find here:
Please cite as:
1
2
3
4
5
6
7
8
@inproceedings{
godey2024headless,
title={Headless Language Models: Learning without Predicting with Contrastive Weight Tying},
author={Nathan Godey and {\'E}ric Villemonte de la Clergerie and Beno{\^\i}t Sagot},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=ONPECq0Rk7}
}
This work was funded by the PRAIRIE institute as part of a PhD contract at Inria Paris and Sorbonne Université.