MANTa: Efficient Gradient-Based Tokenization for End-to-End Robust Language Modeling

Posted Jun 9, 2023 Updated Apr 26, 2024

By Nathan Godey 1 min read

Last year, I got my first paper published as a findings at EMNLP 2022! It was a joint effort with Roman Castagné and was co-authored by my PhD supervisors Eric Villemonte de la Clergerie and Benoît Sagot from Inria’s ALMAnaCH team.

It introduces a differentiable tokenization that can be plugged to many language models to make end-to-end neural language modeling.

Here is the PDF version of the paper that you can also find here:

We also released our models’ weights and implementation on HuggingFace 🤗

Please cite as:

  
@inproceedings{godey-etal-2022-manta,
    title = "{MANT}a: Efficient Gradient-Based Tokenization for End-to-End Robust Language Modeling",
    author = "Godey, Nathan  and
      Castagn{\'e}, Roman  and
      de la Clergerie, {\'E}ric  and
      Sagot, Beno{\^\i}t",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.207",
    pages = "2859--2870",
}

This work was funded by the PRAIRIE institute as part of a PhD contract at Inria Paris and Sorbonne Université.

featured, publications

thesis tokenization

This post is licensed under CC BY 4.0 by the author.

Trending Tags