Natural Language Processing with Deep Learning

YouTube Video List

Syllabus



Differential, Gradient, Jacobian, Chain Rule

Review of differential calculus theory



word2vec

Some word representations:

Idea is to predict surrounding words in a window of length m of every word.

Bag of words, Hierarchical softmax

Counte-based distributional models vs Neural network-based models

The objective funcion is to maximize the log probability of any context word given the current center word:

where $\theta$ represents all variables and $T$ is the total number of words (not vocabulary).

The simplest formulation for $p(w_{t+j}|w_t)$ is (o for outside, c for center):

Here, every word has two vectors $p(w|\cdot), p(\cdot|w)$

To maximize $J(\theta)$, we calculate the gradient:

We can use Stochastic Gradient Descent to accelerate the optimization : update parameters after each window t.

Skip-gram model

Word2Vec Tutorial - The Skip-Gram Model

Mikolov, Tomas, et al. “Distributed representations of words and phrases and their compositionality.” Advances in neural information processing systems. 2013.

Since it’s expensive to calculate the sum $\sum\limits_{w=1}^V\exp(u_w^Tv_c)$, we change the objective function as follows:

where $\sigma(x)=\frac{1}{1+e^{-x}}$, $P(w)=\frac{U(w)^{3/4}}{Z}$, $U(w)$ is unigram distribution, the frequency of the word $w$ and the 3/4 power make less frequent words be sampled more often.

Negative sampling

We choose k negative samples (instead of choosing all words) and the objectif is to

Here k is 5-20 for small training sets and 2-5 for large datasets.

Subsampling of frequent words

Some words occur frequently, like “the”, “in”, “a”, etc. So we can apply a subsampling system:

Each word $w_i$ in the training set is discarded with probability $P(w_i)$:

where

Word pairs

Since “New York” has a much different meaning than the individual words “New” and “York”, it makes sense to treat “New York”, wherever it occurs in the text, as a single word with its own word vector representation.

Continuous bag of words model (CBOW)

Predict center word from sum of surrounding word vectos instead of predicting surrounding single words from center word as in skip-gram model.

Glove

Pennington, Jeffrey, Richard Socher, and Christopher Manning. “Glove: Global vectors for word representation.” Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.

where

Evaluation

Intrinsic Evaluation : Word Vector Analogies

a:b :: c:?

i.e. man:women :: king: ?, the relation between “man” and “women” is like the relation between “king” and which word ?

This can acutally be calculated by:

Some analogies:

What about ambiguity ?

Improving Word Representaions Via Global Context And Multiple Word Prototypes (Huang et al. 2012)

Cluster word windows around words, retrain with each word assigned to multiple different clusters bank1, bank2, etc.

Extrinsic Evaluation

Find a person, organization or location



Word Window Classification

Cross entropy

Cross-entropy can be re-written in terms of the entropy and Kullback-Leibler divergence between the two distributions

Assuming a ground truth (or gold or target) probability distribution that is 1 at the right class and 0 everywhere else, $p = [0,\cdots,0,1,0,\cdots,0]$ and our computed probability is q, then the cross entropy is:

Since p is one-hot, the only term left is the negative log probability of the true class y:

Logistic regression

For a word vector $x$, the probability can be like

where

We want to maximize the probability of the correct class y, so we minimize $J=-\log p(y|x)$.

Gradients

The corresponding gradients are:

For the full dataset $(x_i,y_i)$, the cross entropy loss function (with regularization) is:

where $f_{y_i}$ is the $y_i$-th element of $Wx_i$.

In the classification with word vectors, the parameters are not only $W$ but also the word vectors $x$. So the size of parameters are $Cd+Vd$ ($V$ is the size of vocabulary) and if there isn’t enough data we may have overfitting problems.

Window classification

The context is very important for classifying words. For example, “To sanction” can mean “to permit” or “to punish” and all depends on the context.

So we can classify a word in its context window of neighboring words. For example with window length 2, $x_{\text{window}}=x\in\mathbb{R}^{5d}$. We can use the same way as before to calculate the gradient. Just now we have $\nabla_x J\in\mathbb{R}^{5d}$ and we need to update the word vectors seperately.

Neural networks

Now we use a single layer and an unnormalized score:

Max-margin loss

Idea is to make score of true window larger and corrupt window’s score lower (until they are good enough):

where, for example, if we’d like to build a location classifier

Now the parameters are $U,W,b,x$. By noting ($\circ$ means element-wise multiplication)

The gradients are



Recurrent Neural Networks

Recurrent Neural Network

Given list of word vectors $x_1,\cdots,x_{t-1},x_t,x_{t+1},\cdots,x_T$.

At a single time step,

where

The corresponding loss function is

The total evaluation is (to minimize)

Simpler RNN

We now consider a simpler RNN:

The total error is the sum of each error at time steps $t$,

and we have the chain rule application:

as well as

where

The vanishing/exploding gradient problem

We can analyze the norms of the Jacobians

where $\beta_W,\beta_h$ are upper bounds of the norms.

Then

where the right side can vanish or explode quickly.

Bidirectional RNN

Bidirectional Recurrent Neural Network

Deep Bidirectional RNN

Deep Bidirectional Recurrent Neural Network



Machine Translation

RNN Translation Model Extensions

Machine Translation Recurrent Neural Network

Gated Recurrent Units

Machine Translation Gated Recurrent Units

where $\circ$ means element-wise multiplication.

Long-short-term-memories (LSTMs)

Understanding LSTM Networks

LSTMs

where

Pointer Sentinel Mixture Models



Not finished

Notes until lecture 8