NLP Research has evolved from the era of Punch cards and batch processing, in which the analysis of a sentence could take up to 7 minutes, to the era of Google and the likes of it, in which millions of webpages can be processed in less than a second.
NLP has now increasingly focused on the use of new Deep Learning models. For decades, ML approaches targetting NLP problems have been based on Shallow models (eg. SVMs and Logistic Regression) trained on very high dimensionality data and sparse Features. In the last few years, NN based on dense vector representations have been producing superior results on various NLP tasks.
Representations of Words for Models:
- Word Embeddings (Word2Vec, Bag of Words, n-grams method, Glove)
- Character Embeddings
- Contextual Word Embeddings
*ELMo (Embedding from Language Models) is a method that provides deep contextual embeddings. It produces word Embeddings for each context where the word is used, thus allowing different representations for varying senses of the same word. The mechanism of ELMo is based on the representation obtained from a bidirectional Language model. A bi-directional language model constitutes of two Language Models (LM), Forward and backward LM. A forward language model takes input representation for each of the kth tokens and passes it through L Layers of forwarding LSTM to get representations / Each of these representations, being hidden representations of RNNs is context dependent. A (k-1)th Step, forward LM predicts the kth token given the previously observed tokens. On the other hand, the Backward LM models the same joint Probability of the sequence by predicting the previous tokens given the future tokens. In other words, Backward LM is similar to forward LM which processes a sequence with the order being reversed. Hidden representations from both the LMs are concatenated to compose the final token vector.
Devlin proposed BERT which utilizes a transformer network to pre-train a language model for extracting word embeddings. Unlike ELMo, BERT uses different pretraining tasks for language modeling. In one of the tasks, BERT randomly masks a percentage of words in the sentences and only predicts those masked words. In the other task, BERT predicts the next sentence given a sentence. The described approaches for contextual WE promises better quality representations for words. The pretrained Deep Language models also provide a head start for downstream tasks in the form of transfer learning.
Models for Natural Language Processing:
- Convolutional Neural Network:
- It performs word wise class predictions. The use of CNN’s for sentence modeling traces back to Collobert and Weston, according to the model proposed by them, A lookup table was used to transform each work into a vector of user-defined dimensions. Thus, an input of n words was transformed into a series of vectors by applying the look-up table to each of its’ words.
- Sentence Modelling: For each sentence, Let wi be Word embedding for the ith word in the sentence. Given a sentence with n-words, the sentence will now be represented as an embedding matrix of nXd, where d is the dimension of word embedding.
- In a CNN, a number of convolutional filters, also called Kernels, of different widths slide over the entire word embedding matrix. Each kernel extracts a specific pattern of the n-gram. A convolution layer is usually followed by a max-pooling strategy, which subsamples the input typically by applying a max operation on each filter. Max pooling provides a fixed length output which is generally required for classification. Thus, regardless of the size of the filters, max pooling always maps the input to a fixed dimension of outputs.
- The sequential convolutions help in improved mining of the sentence to grasp truly abstract representations comprising rich semantic information. The kernels through deeper convolutions cover a larger part of the sentence until finally covering it fully and creating a global summarization of the sentence features. The above-mentioned architecture allows modeling of complete sentences into sentence representations.
- However, many NLP tasks, such as NER, POS tagging, and SRL, require word based predictions. To adapt CNN’s for such tasks, a window approach is used, which assumes that the tag of a word primarily depends on its neighboring words. For each word, thus, a fixed-size window surrounding itself is assumed and the sub-sentence ranging within the window is considered. A standalone CNN is applied to this sentence as explained earlier and predictions are attributed to the word in the center of the window. The ultimate goal of word-level classification is generally to assign a sequence of labels to the entire sentence. Read the paper to get more information on Time-delay Neural network (TDNN), a technique inspired by CNN, which considers all windows of words in the sentence at the same time.
- Recurrent Neural Networks
- RNNs use the idea of processing sequential information. The term “recurrent” applies as they perform the same task over each instance of the sequence such that the output is dependent on the previous computations and results.
- RNNs capture the inherent sequential nature present in the language, where units are characters, words or even sentences.
- Unlike CNNs, RNNs have flexible computational steps that provide better modeling capability and create the possibility to capture unbounded context.
- Both CNN and RNN have a different objective when modeling a sentence. While RNNs try to create a composition of an arbitrary long sentence along with unbounded context, CNNs try to extract most important n-grams, hence long term dependencies are usually ignored.
- LSTMs and GRUs are an extension of RNNs, with improved functions to deal with problems of vanishing and exploding gradient.
- Each unit is RNN consists of current Input and previous time step’s hidden state. Non-linear transformations such as tanh, ReLU, and U, V, W weights that are shared across time.
- RNNs are used for word-level Classification, Sentence-Level classification, and Generating language.
Papers I have referred for this Week:
- Recent Trends in Deep Learning Based Natural Language Processing – https://ieeexplore.ieee.org/abstract/document/8416973
- Combination of CNN and RNNs for Sentiment Analysis of Short Texts – https://www.aclweb.org/anthology/papers/C/C16/C16-1229/
- Comparative Analysis of CNN and RNN in NLP – https://arxiv.org/abs/1702.01923
Videos/Tutorials I referred:
- Simple Deep Neural Networks for Text Classification – https://www.youtube.com/watch?v=wNBaNhvL4pg
- NLP – Text Preprocessing and Text Classification (using Python) – https://www.youtube.com/watch?v=nxhCyeRR75Q&list=PLIG2x2RJ_4LTF-IIu7-J3y_yg8LRe1WZq
- 8. Text Classification Using Convolutional Neural Networks (2019) – https://www.youtube.com/watch?v=8YsZXTpFRO0
- Recurrent Neural Network – The Math of Intelligence (Week 5) – https://www.youtube.com/watch?v=BwmddtPFWtA