This is a summary of the models available in 🤗 Transformers. It assumes you’re familiar with the original transformer model. For a gentle introduction check the annotated transformer. Here we focus on the high-level differences between the models. You can check them more in detail in their respective documentation. Also check out the pretrained model page to see the checkpoints available for each type of model and all the community models.
Each one of the models in the library falls into one of the following categories:
- Autoregressive models
- Autoencoding models
- Sequence-to-sequence models
- Multimodal models
- Retrieval-based models
Autoregressive models
Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full sentence so that the attention heads can only see what was before in the text, and not what’s after. Although those models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation. A typical example of such models is GPT.
Original GPT
GPT-2
CTRL
Transformer-XL
Reformer
XLNet
Autoencoding models
Autoencoding models are pretrained by corrupting the input tokens in some way and trying to reconstruct the original sentence. They correspond to the encoder of the original transformer model in the sense that they get access to the full inputs without any mask. Those models usually build a bidirectional representation of the whole sentence. They can be fine-tuned and achieve great results on many tasks such as text generation, but their most natural application is sentence classification or token classification. A typical example of such models is BERT.
Note that the only difference between autoregressive models and autoencoding models is in the way the model is pretrained. Therefore, the same architecture can be used for both autoregressive and autoencoding models. When a given model has been used for both types of pretraining, we have put it in the category corresponding to the article where it was first introduced.
BERT
ALBERT
RoBERTa
DistilBERT
XLM
XLM-RoBERTa
FlauBERT
ELECTRA
Funnel Transformer
Longformer
Sequence-to-sequence models
Sequence-to-sequence models use both the encoder and the decoder of the original transformer, either for translation tasks or by transforming other tasks to sequence-to-sequence problems. They can be fine-tuned to many tasks but their most natural applications are translation, summarization and question answering. The original transformer model is an example of such a model (only for translation), T5 is an example that can be fine-tuned on other tasks.
BART
Pegasus
MarianMT
T5
MT5
MBart
ProphetNet
XLM-ProphetNet
Multimodal models
Multimodal models mix text inputs with other kinds (e.g. images) and are more specific to a given task.
MMBT
Retrieval-based models
Some models use documents retrieval during (pre)training and inference for open-domain question answering, for example.