featured 7

Lost in Backpropagation: The LM Head is a Gradient Bottleneck Mar 10, 2026
Gaperon: A Peppered English-French Generative Language Model Suite Oct 29, 2025
Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression Mar 4, 2025
Improving Representations for Language Modeling (PhD thesis) Sep 15, 2024
Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck Apr 11, 2024
Headless Language Models: Learning without Predicting with Contrastive Weight Tying Sep 15, 2023
MANTa: Efficient Gradient-Based Tokenization for End-to-End Robust Language Modeling Jun 9, 2023

Trending Tags

thesis anisotropy softmax data-curation language-models pretraining attention award biomedical compression