publications 5
- Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck
- On the Scaling Laws of Geographical Representation in Language Models
- Anisotropy Is Inherent to Self-Attention in Transformers
- Headless Language Models: Learning without Predicting with Contrastive Weight Tying
- MANTa: Efficient Gradient-Based Tokenization for End-to-End Robust Language Modeling