Pytorch nlp tutorial

Pytorch nlp tutorial DEFAULT

By James Montantes, Exxact Corporation.

PyTorch is one of the most popular Deep Learning frameworks that is based on Python and is supported by Facebook.

In this article we will be looking into the classes that PyTorch provides for helping with Natural Language Processing (NLP).

There are 6 classes in PyTorch that can be used for NLP related tasks using recurrent layers:

  • torch.nn.RNN
  • torch.nn.LSTM
  • torch.nn.GRU
  • torch.nn.RNNCell
  • torch.nn.LSTMCell
  • torch.nn.GRUCell

Understanding these classes, their parameters, their inputs and their outputs are key to getting started with building your own neural networks for Natural Language Processing (NLP) in Pytorch.

If you have started your NLP journey, chances are that you have encountered a similar type of diagram (if not, we recommend that you check out this excellent and often-cited article by Chris Olah — Understanding LSTM Networks):

Source —

Such unrolled diagrams are used by teachers to provide students with a simple-to-grasp explanation of the recurrent structure of such neural networks. Going from these pretty, unrolled diagrams and intuitive explanations to the Pytorch API can prove to be challenging.

PyTorch is one of the most popular Deep Learning frameworks that is based on Python and is supported by Facebook.

In this article we will be looking into the classes that PyTorch provides for helping with Natural Language Processing (NLP).

There are 6 classes in PyTorch that can be used for NLP related tasks using recurrent layers:

  • torch.nn.RNN
  • torch.nn.LSTM
  • torch.nn.GRU
  • torch.nn.RNNCell
  • torch.nn.LSTMCell
  • torch.nn.GRUCell

Understanding these classes, their parameters, their inputs and their outputs are key to getting started with building your own neural networks for Natural Language Processing (NLP) in Pytorch.

If you have started your NLP journey, chances are that you have encountered a similar type of diagram (if not, we recommend that you check out this excellent and often-cited article by Chris Olah — Understanding LSTM Networks):

Source —

Such unrolled diagrams are used by teachers to provide students with a simple-to-grasp explanation of the recurrent structure of such neural networks. Going from these pretty, unrolled diagrams and intuitive explanations to the Pytorch API can prove to be challenging.

Source —

Hence, in this article, we aim to bridge that gap by explaining the parameters, inputs and the outputs of the relevant classes in PyTorch in a clear and descriptive manner.

Pytorch basically has 2 levels of classes for building recurrent networks:

  • Multi-layer classes — nn.RNN , nn.GRU andnn.LSTM
    Objects of these classes are capable of representing deep bidirectional recurrent neural networks (or, as the class names suggest, one of more their evolved architectures — Gated Recurrent Unit (GRU) or Long Short Term Memory (LSTM) networks).
  • Cell-level classes — nn.RNNCell , nn.GRUCell and nn.LSTMCell
    Objects of these classes can represent only a single cell (again, a simple RNN or LSTM or GRU cell) that can handle one timestep of the input data. (Remember, these Cells don’t have cuDNN optimisation and thus don’t have any fused operations, etc.)

All the classes in the same level share the same API. Hence, understanding the parameters, inputs and outputs of any one of the classes in both the above levels is enough.

To make explanations simple, we will use the simplest classes — torch.nn.RNN and torch.nn.RNNCell

torch.nn.RNN :

We will use the following diagram to explain the API —

Source —


  • input_size — The number of expected features in the input x

This represents the dimensions of vector x[i] (i.e, any of the vectors from x[0] to x[t] in the above diagram). Note that it is easy to confuse this with the sequence length, which is the total number of cells that we get after unrolling the RNN as above.

  • hidden_size — The number of features in the hidden state h

This represents the dimension of vector h[i] (i.e, any of the vectors from h[0] to h[t] in the above diagram). Together, hidden_size and input_size are necessary and sufficient in determining the shape of the weight matrices of the network.

  • num_layers — Number of recurrent layers. E.g., setting num_layers=2would mean stacking two RNNs together to form a stacked RNN, with the second RNN taking in outputs of the first RNN and computing the final results. Default: 1

This parameter is used to build deep RNNs like these:

Here red cells represent the inputs, green blocks represent the RNN cells and blue blocks represent the output.

So for the above diagram, we would set the num_layers parameter to 3.

  • nonlinearity — The non-linearity to use. Can be either ‘tanh’ or ‘relu’. Default: ‘tanh’

This is self-explanatory.

  • bias — If False, then the layer does not use bias weights b_ih and b_hh. Default: True

In the Deep Learning community, some people find that removing/using bias does not affect the model’s performance. Hence, this boolean parameter.

  • batch_first — If True, then the input and output tensors are provided as (batch, seq, feature). Default: False
  • dropout — If non-zero, introduces a Dropout layer on the outputs of each RNN layer except the last layer, with dropout probability equal to dropout. Default: 0

This parameter is used to control the dropout regularisation method in the RNN architecture.

  • bidirectional — If True, becomes a bidirectional RNN. Default: False

Creating a bidirectional RNN is as simple as setting this parameter to True!

So, to make an RNN in PyTorch, we need to pass 2 mandatory parameters to the class — input_size and hidden_size.

Once we have created an object, we can “call” the object with the relevant inputs and it returns outputs.


We need to pass 2 inputs to the object — input and h_0 :

  • input — This is a tensor of shape (seq_len, batch, input_size). In order to work with variable lengthed inputs, we pack the shorter input sequences. See torch.nn.utils.rnn.pack_padded_sequence() ortorch.nn.utils.rnn.pack_sequence() for details.
  • h_0 — This is a tensor of shape (num_layers * num_directions, batch, hidden_size). num_directions is 2 for bidirectional RNNs and 1 otherwise. This tensor contains the initial hidden state for each element in the batch.


In a similar manner, the object returns 2 outputs to us — output and h_n :

  • output — This is a tensor of shape (seq_len, batch, num_directions * hidden_size). It contains the output features (h_k) from the last layer of the RNN, for each k.
  • h_n — This is a tensor of size (num_layers * num_directions, batch, hidden_size). It contains the hidden state for k = seq_len.

As mentioned before, both torch.nn.GRU and torch.nn.LSTM have the same API, i.e, they accept the same set of parameters and accept inputs in the same format and return out in the same format too.

torch.nn.RNNCell :

Since this represents only a single cell of the RNN, it accepts only 4 parameters, all of which have the same meaning as they did in torch.nn.RNN .


  • input_size — The number of expected features in the input x
  • hidden_size — The number of features in the hidden state h
  • bias — If False, then the layer does not use bias weights b_ih and b_hh. Default: True
  • nonlinearity — The non-linearity to use. Can be either ‘tanh’ or ‘relu’. Default: ‘tanh’

Again, since this is just a single cell of an RNN, the input and output dimensions are much simpler —

Inputs (input, hidden):

  • input — this is a tensor of shape (batch, input_size) that contains the input features.
  • hidden — this is a tensor of shape (batch, hidden_size) that contains the initial hidden states for each of the elements in the batch.


  • h’ — this is a tensor of shape (batch, hidden_size) and it gives us the hidden state for the next time step.

This was all about getting started with the PyTorch framework for Natural Language Processing (NLP). If you are looking for ideas on what is possible and what you can build, check out — Deep Learning for Natural Language Processing using RNNs and CNNs.

Original. Reposted with permission.




A list of NLP(Natural Language Processing) tutorials built on PyTorch.

NLP Tutorial

LICENSEGitHub issuesGitHub starsGitHub forks

A list of NLP(Natural Language Processing) tutorials built on PyTorch.

Table of Contents

A step-by-step tutorial on how to implement and adapt to the simple real-word NLP task.

Text Classification

News Category Classification

This repo provides a simple PyTorch implementation of Text Classification, with simple annotation. Here we use Huffpost news corpus including corresponding category. The classification model trained on this dataset identify the category of news article based on their headlines and descriptions.
Keyword:CBoW, LSTM, fastText, Text cateogrization

IMDb Movie Review Classification

This text classification tutorial trains a transformer model on the IMDb movie review dataset for sentiment analysis. It provides a simple PyTorch implementation, with simple annotation.
Keyword:Transformer, Sentiment analysis

Question-Answer Matching

This repo provides a simple PyTorch implementation of Question-Answer matching. Here we use the corpus from Stack Exchange to build embeddings for entire questions. Using those embeddings, we find similar questions for a given question, and show the corresponding answers to those I found.
Keyword:CBoW, TF-IDF, LSTM with variable-length seqeucnes

Movie Review Classification (Korean NLP)

This repo provides a simple Keras implementation of TextCNN for Text Classification. Here we use the movie review corpus written in Korean. The model trained on this dataset identify the sentiment based on review text.
Keyword:TextCNN, Sentiment analysis

Neural Machine Translation

English to French Translation - seq2seq

This neural machine translation tutorial trains a seq2seq model on a set of many thousands of English to French translation pairs to translate from English to French. It provides an intrinsic/extrinsic comparison of various sequence-to-sequence (seq2seq) models in translation.
Keyword:sequence to seqeunce network(seq2seq), Attention, Autoregressive, Teacher-forcing

French to English Translation - Transformer

This neural machine translation tutorial trains a Transformer model on a set of many thousands of French to English translation pairs to translate from French to English. It provides a simple PyTorch implementation, with simple annotation.
Keyword:Transformer, SentencePiece

Natural Language Understanding

Neural Language Model

This repo provides a simple PyTorch implementation of Neural Language Model for natural language understanding. Here we implement unidirectional/bidirectional language models, and pre-train language representations from unlabeled text (Wikipedia corpus).
Keyword:Autoregressive language model, Perplexity

  1. 2013 r pod 178
  2. Cleanout adapter
  3. C# socket send
  4. 2667 v2
  5. 2017 jeep srt cargurus

Deep Learning For NLP with PyTorch and Torchtext

PyTorch has been an awesome deep learning framework that I have been working with. However, when it comes to NLP somehow I could not found as good utility library like torchvision. Turns out PyTorch has this torchtext, which, in my opinion, lack of examples on how to use it and the documentation [6] can be improved. Moreover, there are some great tutorials like [1] and [2] but, we still need more examples.

This article’s purpose is togive readers sample codes on how to use torchtext, in particular, to use pre-trained word embedding, use dataset API, use iterator API for mini-batch, and finally how to use these in conjunction to train a model.

There have been some alternatives in pre-trained word embeddings such as Spacy [3], Stanza (Stanford NLP)[4], Gensim [5] but in this article, I wanted to focus on doing word embedding with torchtext.

Available Word Embedding

You can see the list of pre-trained word embeddings at torchtext. At this time of writing, there are 3 pre-trained word embedding classes supported: GloVe, FastText, and CharNGram, with no additional detail on how to load. The exhaustive list is stated here, but it took me sometimes to read that so I will layout the list here.


There are two ways we can load pre-trained word embeddings: initiate word embedding object or using instance.

Using Field Instance

You need some toy dataset to use this so let’s set one up.

df = pd.DataFrame([
['my name is Jack', 'Y'],
['Hi I am Jack', 'Y'],
['Hello There!', 'Y'],
['Hi I am cooking', 'N'],
['Hello are you there?', 'N'],
['There is a bird there', 'N'],
], columns=['text', 'label'])

then we can construct objects that hold metadata of feature column and label column.

from import Fieldtext_field = Field(
)label_field = Field(sequential=False, use_vocab=False)# sadly have to apply preprocess manually
preprocessed_text = df['text'].apply(lambda x: text_field.preprocess(x))# load fastext simple embedding with 300d
)# get the vocab instance
vocab = text_field.vocab

to get the real instance of pre-trained word embedding, you can use


Initiate Word Embedding Object

For each of these codes, it will download a big size of word embeddings so you have to be patient and do not execute all of the below codes all at once.


FastText object has one parameter: language, and it can be ‘simple’ or ‘en’. Currently they only support 300 embedding dimensions as mentioned at the above embedding list.

from torchtext.vocab import FastText
embedding = FastText('simple')


from torchtext.vocab import CharNGram
embedding_charngram = CharNGram()


GloVe object has 2 parameters: name and dim. You can look up the available embedding list on what each parameter support.

from torchtext.vocab import GloVe
embedding_glove = GloVe(name='6B', dim=100)

Using Word Embedding

Using the torchtext API to use word embedding is super easy! Say you have stored your embedding at variable , then you can use it like a python’s .

# known token, in my case print 12
# unknown token, will print 0

As you can see, it has handled unknown token without throwing error! If you play with encoding the words into an integer, you can notice that by default unknown token will be encoded as while pad token will be encoded as .

Assuming variable has been defined as above, we now proceed to prepare the data by constructing for both the feature and label.

from import Fieldtext_field = Field(
)label_field = Field(sequential=False, use_vocab=False)# sadly have to apply preprocess manually
preprocessed_text = df['text'].apply(
lambda x: text_field.preprocess(x)
)# load fastext simple embedding with 300d
)# get the vocab instance
vocab = text_field.vocab

A bit of warning here, may return 3 datasets (train, val, test) instead of 2 values as defined

I do not found any ready API to load pandas to torchtext dataset, but it is pretty easy to form one.

from import Dataset, Example
ltoi = {l: i for i, l in enumerate(df['label'].unique())}
df['label'] = df['label'].apply(lambda y: ltoi[y])class DataFrameDataset(Dataset):
def __init__(self, df: pd.DataFrame, fields: list):
super(DataFrameDataset, self).__init__(
Example.fromlist(list(r), fields)
for i, r in df.iterrows()

we can now construct the and initiate it with the pandas dataframe.

train_dataset, test_dataset = DataFrameDataset(
('text', text_field),
('label', label_field)

we then use class to easily construct minibatching iterator.

from import BucketIteratortrain_iter, test_iter = BucketIterator.splits(
datasets=(train_dataset, test_dataset),
batch_sizes=(2, 2),

Remember to use sort=False otherwise it will lead to an error when you try to iterate because we haven’t defined the sort function, yet somehow, by default defined to be sorted.

A little note: while I do agree that we should use API to handle the minibatch, but at this moment I have not explored how to use with torchtext.

Let’s define an arbitrary PyTorch model using 1 embedding layer and 1 linear layer. In the current example, I do not use pre-trained word embedding but instead I use new untrained word embedding.

import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adamclass ModelParam(object):
def __init__(self, param_dict: dict = dict()):
self.input_size = param_dict.get('input_size', 0)
self.vocab_size = param_dict.get('vocab_size')
self.embedding_dim = param_dict.get('embedding_dim', 300)
self.target_dim = param_dict.get('target_dim', 2)

class MyModel(nn.Module):
def __init__(self, model_param: ModelParam):
self.embedding = nn.Embedding(
self.lin = nn.Linear(
model_param.input_size * model_param.embedding_dim,

def forward(self, x):
features = self.embedding(x).view(x.size()[0], -1)
features = F.relu(features)
features = self.lin(features)
return features

Then I can easily iterate the training (and testing) routine as follows.

Reusing The Pre-trained Word Embedding

It is easy to modify the current defined model to a model that used pre-trained embedding.

class MyModelWithPretrainedEmbedding(nn.Module):
def __init__(self, model_param: ModelParam, embedding):
self.embedding = embedding
self.lin = nn.Linear(
model_param.input_size * model_param.embedding_dim,

def forward(self, x):
features = self.embedding[x].reshape(x.size()[0], -1)
features = F.relu(features)
features = self.lin(features)
return features

I made 3 lines of modifications. You should notice that I have changed constructor input to accept an embedding. Additionally, I have also change the method to and use get operator instead of call operator to access the embedding.

model = MyModelWithPretrainedEmbedding(model_param, vocab.vectors)

I have finished laying out my own exploration of using torchtext to handle text data in PyTorch. I began writing this article because I had trouble using it with the current tutorials available on the internet. I hope this article may reduce overhead for others too.

You need help to write this code? Here’s a link to google Colab.

Link to Google Colab

[1] Nie, A. A Tutorial on Torchtext. 2017.

[2] Text Classification with TorchText Tutorial.

[3] Stanza Documentation.

[4] Gensim Documentation.

[5] Spacy Documentation.

[6] Torchtext Documentation.

Build your own chatbot using Python - Python Tutorial for Beginners in 2021 - Great Learning

NLP From Scratch: Translation with a Sequence to Sequence Network and Attention¶


Run in Google Colab


Download Notebook


View on GitHub



Click here to download the full example code

Author: Sean Robertson

This is the third and final tutorial on doing “NLP From Scratch”, where we write our own classes and functions to preprocess the data to do our NLP modeling tasks. We hope after you complete this tutorial that you’ll proceed to learn how torchtext can handle much of this preprocessing for you in the three tutorials immediately following this one.

In this project we will be teaching a neural network to translate from French to English.

[KEY: > input, = target, < output] > il est en train de peindre un tableau . = he is painting a picture . < he is painting a picture . > pourquoi ne pas essayer ce vin delicieux ? = why not try that delicious wine ? < why not try that delicious wine ? > elle n est pas poete mais romanciere . = she is not a poet but a novelist . < she not not a poet but a novelist . > vous etes trop maigre . = you re too skinny . < you re all alone .

… to varying degrees of success.

This is made possible by the simple but powerful idea of the sequence to sequence network, in which two recurrent neural networks work together to transform one sequence to another. An encoder network condenses an input sequence into a vector, and a decoder network unfolds that vector into a new sequence.

To improve upon this model we’ll use an attention mechanism, which lets the decoder learn to focus over a specific range of the input sequence.

Recommended Reading:

I assume you have at least installed PyTorch, know Python, and understand Tensors:

It would also be useful to know about Sequence to Sequence networks and how they work:

You will also find the previous tutorials on NLP From Scratch: Classifying Names with a Character-Level RNN and NLP From Scratch: Generating Names with a Character-Level RNN helpful as those concepts are very similar to the Encoder and Decoder models, respectively.



Loading data files¶

The data for this project is a set of many thousands of English to French translation pairs.

This question on Open Data Stack Exchange pointed me to the open translation site which has downloads available at - and better yet, someone did the extra work of splitting language pairs into individual text files here:

The English to French pairs are too big to include in the repo, so download to before continuing. The file is a tab separated list of translation pairs:


Download the data from here and extract it to the current directory.

Similar to the character encoding used in the character-level RNN tutorials, we will be representing each word in a language as a one-hot vector, or giant vector of zeros except for a single one (at the index of the word). Compared to the dozens of characters that might exist in a language, there are many many more words, so the encoding vector is much larger. We will however cheat a bit and trim the data to only use a few thousand words per language.

We’ll need a unique index per word to use as the inputs and targets of the networks later. To keep track of all this we will use a helper class called which has word → index () and index → word () dictionaries, as well as a count of each word which will be used to replace rare words later.

SOS_token=0EOS_token=1classLang:def__init__(self,name){}self.word2count={}self.index2word={0:"SOS",1:"EOS"}self.n_words=2# Count SOS and EOSdefaddSentence(self,sentence):forwordinsentence.split(' '):self.addWord(word)defaddWord(self,word):ifwordnotinself.word2index:self.word2index[word]=self.n_wordsself.word2count[word]=1self.index2word[self.n_words]=wordself.n_words+=1else:self.word2count[word]+=1

The files are all in Unicode, to simplify we will turn Unicode characters to ASCII, make everything lowercase, and trim most punctuation.

# Turn a Unicode string to plain ASCII, thanks to#''.join(cforcinunicodedata.normalize('NFD',s)ifunicodedata.category(c)!='Mn')# Lowercase, trim, and remove non-letter charactersdefnormalizeString(s):s=unicodeToAscii(s.lower().strip())s=re.sub(r"([.!?])",r" \1",s)s=re.sub(r"[^a-zA-Z.!?]+",r" ",s)returns

To read the data file we will split the file into lines, and then split lines into pairs. The files are all English → Other Language, so if we want to translate from Other Language → English I added the flag to reverse the pairs.

defreadLangs(lang1,lang2,reverse=False):print("Reading lines...")# Read the file and split into lineslines=open('data/%s-%s.txt'%(lang1,lang2),encoding='utf-8').\ read().strip().split('\n')# Split every line into pairs and normalizepairs=[[normalizeString(s)forsinl.split('\t')]forlinlines]# Reverse pairs, make Lang instancesifreverse:pairs=[list(reversed(p))forpinpairs]input_lang=Lang(lang2)output_lang=Lang(lang1)else:input_lang=Lang(lang1)output_lang=Lang(lang2)returninput_lang,output_lang,pairs

Since there are a lot of example sentences and we want to train something quickly, we’ll trim the data set to only relatively short and simple sentences. Here the maximum length is 10 words (that includes ending punctuation) and we’re filtering to sentences that translate to the form “I am” or “He is” etc. (accounting for apostrophes replaced earlier).

MAX_LENGTH=10eng_prefixes=("i am ","i m ","he is","he s ","she is","she s ","you are","you re ","we are","we re ","they are","they re ")deffilterPair(p):returnlen(p[0].split(' '))<MAX_LENGTHand \ len(p[1].split(' '))<MAX_LENGTHand \ p[1].startswith(eng_prefixes)deffilterPairs(pairs):return[pairforpairinpairsiffilterPair(pair)]

The full process for preparing the data is:

  • Read text file and split into lines, split lines into pairs
  • Normalize text, filter by length and content
  • Make word lists from sentences in pairs
defprepareData(lang1,lang2,reverse=False):input_lang,output_lang,pairs=readLangs(lang1,lang2,reverse)print("Read %s sentence pairs"%len(pairs))pairs=filterPairs(pairs)print("Trimmed to %s sentence pairs"%len(pairs))print("Counting words...")forpairinpairs:input_lang.addSentence(pair[0])output_lang.addSentence(pair[1])print("Counted words:")print(,input_lang.n_words)print(,output_lang.n_words)returninput_lang,output_lang,pairsinput_lang,output_lang,pairs=prepareData('eng','fra',True)print(random.choice(pairs))


Reading lines... Read 135842 sentence pairs Trimmed to 10599 sentence pairs Counting words... Counted words: fra 4345 eng 2803 ['je suis en quelque sorte fatigue .', 'i m sort of tired .']

The Seq2Seq Model¶

A Recurrent Neural Network, or RNN, is a network that operates on a sequence and uses its own output as input for subsequent steps.

A Sequence to Sequence network, or seq2seq network, or Encoder Decoder network, is a model consisting of two RNNs called the encoder and decoder. The encoder reads an input sequence and outputs a single vector, and the decoder reads that vector to produce an output sequence.

Unlike sequence prediction with a single RNN, where every input corresponds to an output, the seq2seq model frees us from sequence length and order, which makes it ideal for translation between two languages.

Consider the sentence “Je ne suis pas le chat noir” → “I am not the black cat”. Most of the words in the input sentence have a direct translation in the output sentence, but are in slightly different orders, e.g. “chat noir” and “black cat”. Because of the “ne/pas” construction there is also one more word in the input sentence. It would be difficult to produce a correct translation directly from the sequence of input words.

With a seq2seq model the encoder creates a single vector which, in the ideal case, encodes the “meaning” of the input sequence into a single vector — a single point in some N dimensional space of sentences.

The Encoder¶

The encoder of a seq2seq network is a RNN that outputs some value for every word from the input sentence. For every input word the encoder outputs a vector and a hidden state, and uses the hidden state for the next input word.


The Decoder¶

The decoder is another RNN that takes the encoder output vector(s) and outputs a sequence of words to create the translation.

Simple Decoder¶

In the simplest seq2seq decoder we use only last output of the encoder. This last output is sometimes called the context vector as it encodes context from the entire sequence. This context vector is used as the initial hidden state of the decoder.

At every step of decoding, the decoder is given an input token and hidden state. The initial input token is the start-of-string token, and the first hidden state is the context vector (the encoder’s last hidden state).


I encourage you to train and observe the results of this model, but to save space we’ll be going straight for the gold and introducing the Attention Mechanism.

Attention Decoder¶

If only the context vector is passed between the encoder and decoder, that single vector carries the burden of encoding the entire sentence.

Attention allows the decoder network to “focus” on a different part of the encoder’s outputs for every step of the decoder’s own outputs. First we calculate a set of attention weights. These will be multiplied by the encoder output vectors to create a weighted combination. The result (called in the code) should contain information about that specific part of the input sequence, and thus help the decoder choose the right output words.

Calculating the attention weights is done with another feed-forward layer , using the decoder’s input and hidden state as inputs. Because there are sentences of all sizes in the training data, to actually create and train this layer we have to choose a maximum sentence length (input length, for encoder outputs) that it can apply to. Sentences of the maximum length will use all the attention weights, while shorter sentences will only use the first few.



Preparing Training Data¶

To train, for each pair we will need an input tensor (indexes of the words in the input sentence) and target tensor (indexes of the words in the target sentence). While creating these vectors we will append the EOS token to both sequences.

defindexesFromSentence(lang,sentence):return[lang.word2index[word]forwordinsentence.split(' ')]deftensorFromSentence(lang,sentence):indexes=indexesFromSentence(lang,sentence)indexes.append(EOS_token)returntorch.tensor(indexes,dtype=torch.long,device=device).view(-1,1)deftensorsFromPair(pair):input_tensor=tensorFromSentence(input_lang,pair[0])target_tensor=tensorFromSentence(output_lang,pair[1])return(input_tensor,target_tensor)

Training the Model¶

To train we run the input sentence through the encoder, and keep track of every output and the latest hidden state. Then the decoder is given the token as its first input, and the last hidden state of the encoder as its first hidden state.

“Teacher forcing” is the concept of using the real target outputs as each next input, instead of using the decoder’s guess as the next input. Using teacher forcing causes it to converge faster but when the trained network is exploited, it may exhibit instability.

You can observe outputs of teacher-forced networks that read with coherent grammar but wander far from the correct translation - intuitively it has learned to represent the output grammar and can “pick up” the meaning once the teacher tells it the first few words, but it has not properly learned how to create the sentence from the translation in the first place.

Because of the freedom PyTorch’s autograd gives us, we can randomly choose to use teacher forcing or not with a simple if statement. Turn up to use more of it.

teacher_forcing_ratio=0.5deftrain(input_tensor,target_tensor,encoder,decoder,encoder_optimizer,decoder_optimizer,criterion,max_length=MAX_LENGTH):encoder_hidden=encoder.initHidden()encoder_optimizer.zero_grad()decoder_optimizer.zero_grad()input_length=input_tensor.size(0)target_length=target_tensor.size(0)encoder_outputs=torch.zeros(max_length,encoder.hidden_size,device=device)loss=0foreiinrange(input_length):encoder_output,encoder_hidden=encoder(input_tensor[ei],encoder_hidden)encoder_outputs[ei]=encoder_output[0,0]decoder_input=torch.tensor([[SOS_token]],device=device)decoder_hidden=encoder_hiddenuse_teacher_forcing=Trueifrandom.random()<teacher_forcing_ratioelseFalseifuse_teacher_forcing:# Teacher forcing: Feed the target as the next inputfordiinrange(target_length):decoder_output,decoder_hidden,decoder_attention=decoder(decoder_input,decoder_hidden,encoder_outputs)loss+=criterion(decoder_output,target_tensor[di])decoder_input=target_tensor[di]# Teacher forcingelse:# Without teacher forcing: use its own predictions as the next inputfordiinrange(target_length):decoder_output,decoder_hidden,decoder_attention=decoder(decoder_input,decoder_hidden,encoder_outputs)topv,topi=decoder_output.topk(1)decoder_input=topi.squeeze().detach()# detach from history as inputloss+=criterion(decoder_output,target_tensor[di])ifdecoder_input.item()==EOS_token:breakloss.backward()encoder_optimizer.step()decoder_optimizer.step()returnloss.item()/target_length

This is a helper function to print time elapsed and estimated time remaining given the current time and progress %.

importtimeimportmathdefasMinutes(s):m=math.floor(s/60)s-=m*60return'%dm %ds'%(m,s)deftimeSince(since,percent):now=time.time()s=now-sincees=s/(percent)rs=es-sreturn'%s (- %s)'%(asMinutes(s),asMinutes(rs))

The whole training process looks like this:

  • Start a timer
  • Initialize optimizers and criterion
  • Create set of training pairs
  • Start empty losses array for plotting

Then we call many times and occasionally print the progress (% of examples, time so far, estimated time) and average loss.

deftrainIters(encoder,decoder,n_iters,print_every=1000,plot_every=100,learning_rate=0.01):start=time.time()plot_losses=[]print_loss_total=0# Reset every print_everyplot_loss_total=0# Reset every plot_everyencoder_optimizer=optim.SGD(encoder.parameters(),lr=learning_rate)decoder_optimizer=optim.SGD(decoder.parameters(),lr=learning_rate)training_pairs=[tensorsFromPair(random.choice(pairs))foriinrange(n_iters)]criterion=nn.NLLLoss()foriterinrange(1,n_iters+1):training_pair=training_pairs[iter-1]input_tensor=training_pair[0]target_tensor=training_pair[1]loss=train(input_tensor,target_tensor,encoder,decoder,encoder_optimizer,decoder_optimizer,criterion)print_loss_total+=lossplot_loss_total+=lossifiter%print_every==0:print_loss_avg=print_loss_total/print_everyprint_loss_total=0print('%s (%d%d%%) %.4f'%(timeSince(start,iter/n_iters),iter,iter/n_iters*100,print_loss_avg))ifiter%plot_every==0:plot_loss_avg=plot_loss_total/plot_everyplot_losses.append(plot_loss_avg)plot_loss_total=0showPlot(plot_losses)

Plotting results¶

Plotting is done with matplotlib, using the array of loss values saved while training.

importmatplotlib.pyplotaspltplt.switch_backend('agg')importmatplotlib.tickerastickerimportnumpyasnpdefshowPlot(points):plt.figure()fig,ax=plt.subplots()# this locator puts ticks at regular intervalsloc=ticker.MultipleLocator(base=0.2)ax.yaxis.set_major_locator(loc)plt.plot(points)


Evaluation is mostly the same as training, but there are no targets so we simply feed the decoder’s predictions back to itself for each step. Every time it predicts a word we add it to the output string, and if it predicts the EOS token we stop there. We also store the decoder’s attention outputs for display later.

defevaluate(encoder,decoder,sentence,max_length=MAX_LENGTH):withtorch.no_grad():input_tensor=tensorFromSentence(input_lang,sentence)input_length=input_tensor.size()[0]encoder_hidden=encoder.initHidden()encoder_outputs=torch.zeros(max_length,encoder.hidden_size,device=device)foreiinrange(input_length):encoder_output,encoder_hidden=encoder(input_tensor[ei],encoder_hidden)encoder_outputs[ei]+=encoder_output[0,0]decoder_input=torch.tensor([[SOS_token]],device=device)# SOSdecoder_hidden=encoder_hiddendecoded_words=[]decoder_attentions=torch.zeros(max_length,max_length)fordiinrange(max_length):decoder_output,decoder_hidden,decoder_attention=decoder(decoder_input,decoder_hidden,encoder_outputs)decoder_attentions[di]=decoder_attention.datatopv,'<EOS>')breakelse:decoded_words.append(output_lang.index2word[topi.item()])decoder_input=topi.squeeze().detach()returndecoded_words,decoder_attentions[:di+1]

We can evaluate random sentences from the training set and print out the input, target, and output to make some subjective quality judgements:

defevaluateRandomly(encoder,decoder,n=10):foriinrange(n):pair=random.choice(pairs)print('>',pair[0])print('=',pair[1])output_words,attentions=evaluate(encoder,decoder,pair[0])output_sentence=' '.join(output_words)print('<',output_sentence)print('')

Training and Evaluating¶

With all these helper functions in place (it looks like extra work, but it makes it easier to run multiple experiments) we can actually initialize a network and start training.

Remember that the input sentences were heavily filtered. For this small dataset we can use relatively small networks of 256 hidden nodes and a single GRU layer. After about 40 minutes on a MacBook CPU we’ll get some reasonable results.


If you run this notebook you can train, interrupt the kernel, evaluate, and continue training later. Comment out the lines where the encoder and decoder are initialized and run again.

  • ../_images/sphx_glr_seq2seq_translation_tutorial_001.png
  • ../_images/sphx_glr_seq2seq_translation_tutorial_002.png


1m 32s (- 21m 36s) (5000 6%) 2.8327 3m 2s (- 19m 44s) (10000 13%) 2.2748 4m 31s (- 18m 5s) (15000 20%) 1.9830 6m 1s (- 16m 34s) (20000 26%) 1.7042 7m 28s (- 14m 57s) (25000 33%) 1.5309 8m 55s (- 13m 23s) (30000 40%) 1.3746 10m 23s (- 11m 52s) (35000 46%) 1.2225 11m 52s (- 10m 23s) (40000 53%) 1.0765 13m 21s (- 8m 54s) (45000 60%) 0.9878 14m 49s (- 7m 24s) (50000 66%) 0.8937 16m 17s (- 5m 55s) (55000 73%) 0.8348 17m 45s (- 4m 26s) (60000 80%) 0.7430 19m 12s (- 2m 57s) (65000 86%) 0.6902 20m 40s (- 1m 28s) (70000 93%) 0.6129 22m 7s (- 0m 0s) (75000 100%) 0.5760


> tu es tres religieuse n est ce pas ? = you re very religious aren t you ? < you re very religious aren t you ? <EOS> > je suis terriblement fatiguee . = i m awfully tired . < i m sick tired . <EOS> > je te vire . = i m firing you . < i m firing you . <EOS> > vous n etes pas si interessants . = you re not that interesting . < you re not that interesting . <EOS> > nous avons honte . = we re ashamed . < we re ashamed . <EOS> > vous etes une merveilleuse amie . = you re a wonderful friend . < you re a wonderful friend . <EOS> > je suis tres inquiet a ton sujet . = i m very worried about you . < i m very worried about you . <EOS> > vous n etes pas malades . = you re not sick . < you re not being . <EOS> > vous allez trop vite . = you re going too fast . < you re going too fast . <EOS> > il est son ami . = he is her friend . < he is her friend . <EOS>

Visualizing Attention¶

A useful property of the attention mechanism is its highly interpretable outputs. Because it is used to weight specific encoder outputs of the input sequence, we can imagine looking where the network is focused most at each time step.

You could simply run to see attention output displayed as a matrix, with the columns being input steps and rows being output steps:

output_words,attentions=evaluate(encoder1,attn_decoder1,"je suis trop froid .")plt.matshow(attentions.numpy())

For a better viewing experience we will do the extra work of adding axes and labels:

defshowAttention(input_sentence,output_words,attentions):# Set up figure with colorbarfig=plt.figure()ax=fig.add_subplot(111)cax=ax.matshow(attentions.numpy(),cmap='bone')fig.colorbar(cax)# Set up axesax.set_xticklabels(['']+input_sentence.split(' ')+['<EOS>'],rotation=90)ax.set_yticklabels(['']+output_words)# Show label at every tickax.xaxis.set_major_locator(ticker.MultipleLocator(1))ax.yaxis.set_major_locator(ticker.MultipleLocator(1)),attentions=evaluate(encoder1,attn_decoder1,input_sentence)print('input =',input_sentence)print('output =',' '.join(output_words))showAttention(input_sentence,output_words,attentions)evaluateAndShowAttention("elle a cinq ans de moins que moi .")evaluateAndShowAttention("elle est trop petit .")evaluateAndShowAttention("je ne crains pas de mourir .")evaluateAndShowAttention("c est un jeune directeur plein de talent .")


input = elle a cinq ans de moins que moi . output = she is five years older than me . <EOS> input = elle est trop petit . output = she is too short . <EOS> input = je ne crains pas de mourir . output = i m not afraid to die . <EOS> input = c est un jeune directeur plein de talent . output = he is a dumb worker worker . <EOS>


  • Try with a different dataset
    • Another language pair
    • Human → Machine (e.g. IOT commands)
    • Chat → Response
    • Question → Answer
  • Replace the embeddings with pre-trained word embeddings such as word2vec or GloVe
  • Try with more layers, more hidden units, and more sentences. Compare the training time and results.
  • If you use a translation file where pairs have two of the same phrase (), you can use this as an autoencoder. Try this:
    • Train as an autoencoder
    • Save only the Encoder network
    • Train a new Decoder for translation from there

Total running time of the script: ( 22 minutes 16.832 seconds)

Gallery generated by Sphinx-Gallery

© Copyright 2021, PyTorch.

Built with Sphinx using a theme provided by Read the Docs.


Access comprehensive developer documentation for PyTorch

View Docs


Get in-depth tutorials for beginners and advanced developers

View Tutorials


Find development resources and get your questions answered

View Resources

To analyze traffic and optimize your experience, we serve cookies on this site. By clicking or navigating, you agree to allow our usage of cookies. As the current maintainers of this site, Facebook’s Cookies Policy applies. Learn more, including about available controls: Cookies Policy.


Tutorial pytorch nlp

Deep Learning for NLP with Pytorch¶

Author: Robert Guthrie

This tutorial will walk you through the key ideas of deep learning programming using Pytorch. Many of the concepts (such as the computation graph abstraction and autograd) are not unique to Pytorch and are relevant to any deep learning toolkit out there.

I am writing this tutorial to focus specifically on NLP for people who have never written code in any deep learning framework (e.g, TensorFlow, Theano, Keras, Dynet). It assumes working knowledge of core NLP problems: part-of-speech tagging, language modeling, etc. It also assumes familiarity with neural networks at the level of an intro AI class (such as one from the Russel and Norvig book). Usually, these courses cover the basic backpropagation algorithm on feed-forward neural networks, and make the point that they are chains of compositions of linearities and non-linearities. This tutorial aims to get you started writing deep learning code, given you have this prerequisite knowledge.

Note this is about models, not data. For all of the models, I just create a few test examples with small dimensionality so you can see how the weights change as it trains. If you have some real data you want to try, you should be able to rip out any of the models from this notebook and use them on it.

UMass CS685 F21 (Advanced NLP): Evaluating text generation systems

graykode / nlp-tutorial Public


is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch. Most of the models in NLP were implemented with less than 100 lines of code.(except comments or blank lines)

  • [08-14-2020] Old TensorFlow v1 code is archived in the archive folder. For beginner readability, only pytorch version 1.0 or higher is supported.

Curriculum - (Example Purpose)

1. Basic Embedding Model

2. CNN(Convolutional Neural Network)

3. RNN(Recurrent Neural Network)

4. Attention Mechanism

5. Model based on Transformer


  • Python 3.5+
  • Pytorch 1.0.0+



You will also like:


839 840 841 842 843