Summary of The Tasks

NLP

发布日期: 2021-02-02

This page shows the most frequent use-cases when using the library. The models available allow for many different configurations and a great versatility in use-cases. The most simple ones are presented here, showcasing usage for tasks such as question answering, sequence classification, named entity recognition and others.

These examples leverage auto-models, which are classes that will instantiate a model according to a given checkpoint, automatically selecting the correct model architecture. Please check the AutoModel documentation for more information. Feel free to modify the code to be more specific and adapt it to your specific use-case.

In order for a model to perform well on a task, it must be loaded from a checkpoint corresponding to that task. These checkpoints are usually pre-trained on a large corpus of data and fine-tuned on a specific task. This means the following:

Not all models were fine-tuned on all tasks. If you want to fine-tune a model on a specific task, you can leverage one of the run_$TASK.py scripts in the examples directory.
Fine-tuned models were fine-tuned on a specific dataset. This dataset may or may not overlap with your use-case and domain. As mentioned previously, you may leverage the examples scripts to fine-tune your model, or you may create your own training script.

In order to do an inference on a task, several mechanisms are made available by the library:

Pipelines: very easy-to-use abstractions, which require as little as two lines of code.
Direct model use: Less abstractions, but more flexibility and power via a direct access to a tokenizer (PyTorch/TensorFlow) and full inference capacity.

Sequence Classification

文本序列到类别

Sequence classification is the task of classifying sequences according to a given number of classes.

Extractive Question Answering

文本序列到类别

Extractive Question Answering is the task of extracting an answer from a text given a question.

Language Modeling

文本序列到文本序列

Language modeling is the task of fitting a model to a corpus, which can be domain specific. All popular transformer-based models are trained using a variant of language modeling, e.g. BERT with masked language modeling, GPT-2 with causal language modeling.

Language modeling can be useful outside of pretraining as well, for example to shift the model distribution to be domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset or on scientific papers e.g. LysandreJik/arxiv-nlp.

Masked Language Modeling

Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to fill that mask with an appropriate token. This allows the model to attend to both the right context (tokens on the right of the mask) and the left context (tokens on the left of the mask). Such a training creates a strong basis for downstream tasks requiring bi-directional context, such as SQuAD (question answering, see Lewis, Lui, Goyal et al., part 4.2).

Causal Language Modeling

Causal language modeling is the task of predicting the token following a sequence of tokens. In this situation, the model only attends to the left context (tokens on the left of the mask). Such a training is particularly interesting for generation tasks.

Usually, the next token is predicted by sampling from the logits of the last hidden state the model produces from the input sequence.

Text Generation

In text generation (a.k.a open-ended text generation) the goal is to create a coherent portion of text that is a continuation from the given context.

Named Entity Recognition

文本序列到类别

Named Entity Recognition (NER) is the task of classifying tokens according to a class, for example, identifying a token as a person, an organisation or a location. An example of a named entity recognition dataset is the CoNLL-2003 dataset, which is entirely based on that task. If you would like to fine-tune a model on an NER task, you may leverage the run_ner.py (PyTorch), run_pl_ner.py (leveraging pytorch-lightning) or the run_tf_ner.py (TensorFlow) scripts.

Summarization

文本序列到文本序列

Summarization is the task of summarizing a document or an article into a shorter text.

An example of a summarization dataset is the CNN / Daily Mail dataset, which consists of long news articles and was created for the task of summarization. If you would like to fine-tune a model on a summarization task, various approaches are described in this document.

Translation

文本序列到文本序列

Translation is the task of translating a text from one language to another.

An example of a translation dataset is the WMT English to German dataset, which has sentences in English as the input data and the corresponding sentences in German as the target data. If you would like to fine-tune a model on a translation task, various approaches are described in this document.

CarlYoung

http://yc1999.github.io/2021/02/02/summary-of-the-tasks/