Transformer Notes


Blog : http://jalammar.github.io/illustrated-transformer/

Paper: Attention is all you need

What is transformer

From 1000 fts, it converts input to output. (usually used in machine translation)

From 100 fts, it consists of multiple layers of encoders and decoders.

From 10 fts, there are 6 layers of encoders and decoders.

Question: why 6 layers? Not more or less layers?

from 1 ft, each encoder has a self-attention layer and a feed forward neural network.

The input word is embedded into vectors, then fed into the self-attention layer, then into the feed forward layer.

Word embedding

word embedding embeds word into vector, and semantically similar words are closed to each other.

Before word embedding was invented, one-hot embedding is used where each word is embedded into a giant vector whose length is the vocabulary size and only one element is set to one.

Word embedding comes with two benefits:

1. Dimension reduction, thus saves computation resources.

2. Word semantics is encoded in the vector.

How the embedding is trained

CBOW (the word2vec approach) is short for Continues–Bag-Of-Words. it sets a window in the sentence, and remove one target word. It lets the model to predict the target word based on the context words within the window.

The vectors created by Word Embedding preserve these similarities, so words that regularly occur nearby in text will also be in close proximity in vector space.

How about use this idea to encode music notes? We can have a note -> vector embedding for each musician.

Self-attention represents to how much attention should the model pay to context words and the source word in the whole sentence.

For each word, we generate Q, K and V from its embedding (or the output of the previous encoder layer) by multiplying it with the trained Wq, Wk and Wv metrics.

Then we calculate the attention for each word using the algorithm described below.

Questions: What does Query, Keys and Values represent?

Query:

Multiheads attention

Residual connection

https://stats.stackexchange.com/questions/321054/what-are-residual-connections-in-rnns

My takes aways of the Transformer

Benefits

  1. Parallelism across sentences. The only sequential part is the decoder within each sentence.

Limitations

  1. The Transformer only considers context within a sentence, no cross sentence context is considered.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.