2 : Word Representation

NLP Input

Words are the basic input for most NLP applications

However this depends - it could also be characters, especially in languages like chinese, or smaller units such as lemmas, roots or subword units

How do we represent words?

First, we do pre-processing

This could be:

  • Tokenisation (splitting words from punctuation)
  • Punctuation removal (removing punctuation all together)
  • Normalisation (conflate token variants to the same form, e.g. removing capitalisation)
  • Stop word removal (remove words which don't carry any meaning, e.g. 'the')
  • Rare words removal (remove words for which sufficient statistics cannot be computed, e.g. below a certain count threshold)
  • Lemmatisation (reducing words to basic form)
  • Stemming (reducing words to the root/stem)
  • Sub-word unit conversion (reduce words to subparts based on frequency, handles rare words/reduces vocab size)

Byte Pair Encoding

This is a type of sub-word unit conversion, following the algorithm:

1. Count frequency of tokens in token vocabulary Vt 2. Segment tokens into characters 3. Add end of token marker - new vocabulary Vc 4. Merge 2 most frequent characters and update Vc 5. Repeat step 4 for a max number of merging operations

An example can be found here

The algorithm above is done at training time on a vocabulary

When you want to represent a word at test time, you do the following:

1. Get all symbol bigrams in the word 2. Find a symbol pair that appeared the first among the symbol merges (Vc) 3. Apply the merge on the word 4. Repeat from step 2

Preprocessing

Preprocessing is very important

It's where the linguistic insights start in NLP

The order of the steps does usually matter, for example you need to normalise, then do BPE

Ways of representing words

Once pre-processing is done, there are many ways of representing words, including:

  • Words themselves (Bag-of-Words)
  • Mapping words to concepts lexical database
  • Features of words
  • By co-occurrence with other words - count-based
  • By co-occurrence with other words - predication-based (word embeddings)

Bag-of-Words (BOW)

This is where sentences are represented by words themselves

For example, the following two sentences would be represented as follows:

  1. The cat chased the mouse
  2. The kitten ran after the rat
... after cat chase kitten mouse rat run the ...
1. 0 0 1 1 0 1 0 0 1 0
2. 0 1 0 0 1 0 1 1 1 0

The table has a 1 if the word is present, and a 0 if it is not

This is a binary vector, you could also use a frequency or TF-IDF vector

It is very sparse

It cannot model relatedness between words

Mapping of words to concepts

This is where words are mapped to a concept, so the following example would use the mapping:

  • Cat & kitten β†’\rightarrow feline mammal
  • Mouse & rat β†’\rightarrow rodent mammal

To become:

... after feline chase rodent run the ...
1. 0 0 1 1 1 0 1 0
2. 0 1 1 0 1 1 1 0

This is a smaller vector, and better encodes similarity

However, some issues are:

  • It misses nuances and new meanings (e.g. is a mouse = rat?)
  • Disambiguation is hard (does mouse go to rodent or computer device?)
  • Databases don't exist for most languages

Β Features extracted from words

This is an example which could be used for sentiment analysis

The feature may be the percentage of positive and negative words

For example, the following sentences would have the following representation:

  1. Vivid and sharp display, amazing camera, easy to use with one hand, extremely fast, and more responsive than any other phone I have used
  2. Unnecessary, don't upgrade if you're thinking about it, too expensive
% positive % negative
1. 6 0
2. 0 2

This requires domain knowledge (e.g. which words are positive, which are negative), and is also task specific

Co-occurrence (distributional hypothesis)

A word's meaning is given by the words that frequently appear with it in corpora

This is moving from discrete to continuous representation

Vector space models represent words in a continuous vector space where semantically similar words are mapped to nearby points

The approaches are:

  • Count based (co-occurrence matrix)
  • Prediction based (word2vec, GLOVE, BERT)

Co-occurrence matrix (count-based)

Define a word by words it co-occurs with in the corpus

Using the following 5 sentences as an example:

  • Cats eat mice
  • Dogs eat food
  • Trees are green
  • Kittens eat mice
  • Puppies eat food

We can represent a word by its context:

cat eat mouse kitten dog food puppy tree are green
cat - 1 1 0 0 0 0 0 0 0
eat 1 - 1 1 1 1 1 0 0 0
mouse 1 1 - 1 0 0 0 0 0 0
kitten 0 1 1 - 0 0 0 0 0 0
dog 0 1 0 0 - 1 0 0 0 0
food 0 1 0 0 1 - 1 0 0 0
puppy 0 1 0 0 0 1 - 0 0 0
tree 0 0 0 0 0 0 0 - 1 1
are 0 0 0 0 0 0 0 1 - 1
green - - - - - - - 1 1 -

This is a symmetric matrix

  • We should remove stop words
  • We can also use frequency or proportion of times, instead of just binary
  • Can weight terms by TF-IDF
  • Allows for some notion of similarity, e.g. L2L_2 norm : βˆ‘i=1n(qiβˆ’di)2\sqrt{\sum_{i=1}^n (q_i - d_i)^2}

If we calculate the L2L_2 norm on the words cat and kitten, we get 1.41, whereas cat and dog get 2.65 (smaller = closer)

Cosine distance is another way of measuring distance:

βˆ‘i=1nqidiβˆ‘i=1nd2βˆ‘i=1nq2\frac{\sum_{i=1}^n q_id_i}{\sqrt{\sum_{i=1}^nd^2}\sqrt{\sum_{i=1}^n q^2}}

TF-IDF is a common preprocessing step

It where query terms are weighted higher if they ccur in certain documents but not in others

For example, the word 'the' isn't very telling, as it occurs in most sentences, and therefore it's downweighted

TFβˆ’IDFw,d,D=TFw,dIDFw,DTF-IDF_{w,d,D} = TF_{w,d}IDF_{w,D} (frequency of the word in the document)

Where:

TFw,d=freqw,dTF_{w,d} = freq_{w,d} and IDFw,D=log∣D∣dfwIDF_{w,D} = log \frac{|D|}{df_w} (size of collection over number of documents in the collection which have that word)

Prediction based - Word Embeddings (word2vec)

However, one problem with this is that it is still very large and very sparse

Therefore, the solution is to learn word representation from data, and embed them in a smaller space

  • Implicitly model similarity between words in corpora
  • Already start with lower dimensionality space
  • Instead of counting co-occurrences, predict context words in context
  • Computationally efficient
  • Adding new words in the model scales with corpus size

This is the core idea of deep learning

There are two common models:

  • Continuous Bag-of-Words (CBOW)

    • This predicts the target word wtw_t from context words
    • wtβˆ’jw_{t-j} ... wtβˆ’2w_{t-2} wtβˆ’1w_{t-1} ? wt+1w_{t+1} wt+1w_{t+1} ... wt+jw_{t+j}
  • Skip-gram model

    • This predicts the context words from the target word
    • ? ? ? wtw_t ? ? ?

Focusing on skip-gram models, in a nutshell:

  • They're a 1 hidden layer neural network
  • w(t) is the target words, given as input as a one-hot vector
  • One hidden layer performs the dot product between the weight matrix W and the input vector w(t), no activation function is used
  • The result of the dot product at the hidden later is passed to the output layer, which computer the dot product between the output vector of the hidden layer and the weight matrix of the output later (W')
  • The softmax activation function is applied to compute the probability of any vocabulary word appearing in the context of w(t)
  • At training time, output is a one-hot vector for each true context word

Β Training

During training, we use pairs of corresponding words to create W and W'

We get these from running a window across input sentences, and picking words which are within that window of each word

For example, using the sentence "the quick brown fox jumps over the lazy dog" with window size 2, our first run would give (the, quick) and (the, brown)

The goal is to find word representation that are useful for predicting context words wt+jw_{t+j} given a word wtw_t by maximising the average log probability for context window cc

This is done using stochastic gradient descent with cross entropy as the loss function

The hidden later will have a set dimension, say 300, and a vocab of, say 10K

The embedding matrix is the weight matrix with 10k rows (one per word in V) and 300 columns (one per hidden column)

The lookup table can be generated by multiplying a 1x10000 one-hot vector by a 10Kx300 matrix, aka selecting the matrix row for the column with 1

However, a caveat is that in order to compute a single forward pass of the model, you need to sum across the entire corpus vocab

This is prohibitively expensive on large vocabs, e.g. for 300 dims and 10k words, this is 3 million weights in each hidden layer and output layer

Therefore, approximation is used

This is done by removing the softmax, and using a binary classification objective (logistic regression) to discriminate real target words (wtw_t) from other noise words

So for example, using the sentence from before, to predict 'quick' from 'the', we select k noisy words which aren't in the context window of 'the'

Say k=1, and the noisy word is 'sheep'

We'd compute the loss function using both of these, so that the objective at this timestep is now:

log(QΘ(D=1∣the,quick))+log(QΘ(D=0∣the,sheep))log (Q_\Theta (D=1|the, quick)) + log(Q_\Theta(D=0 | the, sheep))

Negative sampling is done randomly or by frequency, usually with about 5-20 negative words on small datasets, and only 2-5 needed on larger ones

This speeds it up, since for the model we had of 300x10k weight matrix, we need up update weights for the positive word (quick) plus 5 other words we want the output to be 0 for

This is 6 output neurons, so so 6x300=1800 weight values in total, much less than the 3M before

Β Testing

After training, we remove the output later, and at test time we take the one-hot vector of a word, and its product with W is the word embedding of the input word

W is the word embedding matrix, W' is the context matrix

What these vectors capture

Relationships, such as "man is to woman as uncle is to aunt"

Plurals, such as "apples is to apple as cars is to car"

Therefore, you can do arithmetic, such as: king - man + woman = queen

Β Other skip-gram tricks

  • You can also treat common word pairs or phrases as single "words"
  • Subsampling frequent words to decrease the number of training examples
  • Using hierarchical softmax

Β Other types of word embeddings

  • CBOW
  • GLOVE
  • Fasttext
  • Contextualised word embeddings

    • ELMo
    • ULMFIT
    • BERT
    • RoBERTa

Having a quick look at one of there, ELMo, context matters in the embedding of a word

  • Instead of a fixed embedding for each word, it looks at the entire sentence before assigning each word an embedding
  • It uses a pre-trained, 2-layer, bi-directional, RNN-based language model that predicts the next word given
  • It extracts the hidden state of each layer for the input sequence of words
  • It then computes a weighted sum of those hidden states to obtain an embedding for each word
  • The weight of each hidden state is task-dependent, and is learned

BERT was inspired from this, but is based on transformers rather than RNNs, so have a more powerful use of context

  • Has an encoder than reads the entire sentence at once
  • Some % of the words are masked, then it's tasked with predicting them based on context

Found these notes useful? Why not share them...