featured 7
- Lost in Backpropagation: The LM Head is a Gradient Bottleneck
- Gaperon: A Peppered English-French Generative Language Model Suite
- Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression
- Improving Representations for Language Modeling (PhD thesis)
- Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck
- Headless Language Models: Learning without Predicting with Contrastive Weight Tying
- MANTa: Efficient Gradient-Based Tokenization for End-to-End Robust Language Modeling