Publications

2024

11 Apr Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck
29 Feb On the Scaling Laws of Geographical Representation in Language Models
22 Jan Anisotropy Is Inherent to Self-Attention in Transformers

2023

15 Sep Headless Language Models: Learning without Predicting with Contrastive Weight Tying
09 Jun MANTa: Efficient Gradient-Based Tokenization for End-to-End Robust Language Modeling

Trending Tags

thesis anisotropy frequency geographical language-modeling representation softmax software tokenization