This is complicated. Don't worry if you're confused.
Definition
Same as described in Attention mechanism. The encoding is done within the same sequence.
It helps understand and embedd, how parts of the input relate to other parts of the input (chunks).
Coding Example
from numpy import array
from numpy import random
from numpy import dot
from scipy.special import softmax
# encoder representations of four different words
word_1 = array([1, 0, 0])
word_2 = array([0, 1, 0])
word_3 = array([1, 1, 0])
word_4 = array([0, 0, 1])
# stacking the word embeddings into a single array
words = array([word_1, word_2, word_3, word_4])
# generating the weight matrices
random.seed(42)
W_Q = random.randint(3, size=(3, 3))
W_K = random.randint(3, size=(3, 3))
W_V = random.randint(3, size=(3, 3))
# generating the queries, keys and values
Q = words @ W_Q
K = words @ W_K
V = words @ W_V
# scoring the query vectors against all key vectors
scores = Q @ K.transpose()
# computing the weights by a softmax operation
weights = softmax(scores / K.shape[1] ** 0.5, axis=1)
# computing the attention by a weighted sum of the value vectors
attention = weights @ V
print(attention)
Let's write this a model to make it prettier.
from torch import nn
import torch
import torch.nn.functional as F
class AttentionLayer(nn.Module):
def __init__(self, feature_size):
super(AttentionLayer, self).__init__()
self.feature_size = feature_size
# calculate Q, K, V from the same source
self.k = nn.Linear(feature_size, feature_size)
self.q = nn.Linear(feature_size, feature_size)
self.v = nn.Linear(feature_size, feature_size)
def forward(self, x, mask=None):
# linear trans
keys = self.k(x)
queries = self.q(x)
values = self.v(x)
# move into init?
scaling_factor = torch.sqrt(self.feature_size, dtype=torch.float32)
scores = torch.matmul(queries, keys.transpose(-2, -1)) / scaling_factor
# optional: add mask here
# appy softmax
attention_weights = F.softmax(scores, dim=1)
output = torch.matmul(attention_weights, values)
return output, attention_weights
matmul is just a dot product for matrices.
Since key, query and value are part of the models parameters, during training, they are automatically updated.
How are the Query and Key values calculated?
We use a neural network to train the query, key values. This is done via end to end, we do not have labels for the keys or queries, but we can calculate them via back-propagation because we have labels for the final output of the model. The computational graph just becomes bigger.