This is complicated. Don't worry if you're confused.

Definition

Same as described in Attention mechanism. The encoding is done within the same sequence.
It helps understand and embedd, how parts of the input relate to other parts of the input (chunks).

Coding Example

from numpy import array
from numpy import random
from numpy import dot
from scipy.special import softmax

# encoder representations of four different words
word_1 = array([1, 0, 0])
word_2 = array([0, 1, 0])
word_3 = array([1, 1, 0])
word_4 = array([0, 0, 1])

# stacking the word embeddings into a single array
words = array([word_1, word_2, word_3, word_4])

# generating the weight matrices
random.seed(42)
W_Q = random.randint(3, size=(3, 3))
W_K = random.randint(3, size=(3, 3))
W_V = random.randint(3, size=(3, 3))

# generating the queries, keys and values
Q = words @ W_Q
K = words @ W_K
V = words @ W_V

# scoring the query vectors against all key vectors
scores = Q @ K.transpose()

# computing the weights by a softmax operation
weights = softmax(scores / K.shape[1] ** 0.5, axis=1)

# computing the attention by a weighted sum of the value vectors
attention = weights @ V

print(attention)

Let's write this a model to make it prettier.

from torch import nn
import torch
import torch.nn.functional as F

class AttentionLayer(nn.Module):
    def __init__(self, feature_size):
        super(AttentionLayer, self).__init__()
        self.feature_size = feature_size

        # calculate Q, K, V from the same source
        self.k = nn.Linear(feature_size, feature_size)
        self.q = nn.Linear(feature_size, feature_size)
        self.v = nn.Linear(feature_size, feature_size)

    def forward(self, x, mask=None):
        # linear trans
        keys = self.k(x)
        queries = self.q(x)
        values = self.v(x)

        # move into init?
        scaling_factor = torch.sqrt(self.feature_size, dtype=torch.float32)
        
        scores = torch.matmul(queries, keys.transpose(-2, -1)) / scaling_factor

        # optional: add mask here

        # appy softmax
        attention_weights = F.softmax(scores, dim=1)
        output = torch.matmul(attention_weights, values)
        return output, attention_weights

matmul is just a dot product for matrices.

Since key, query and value are part of the models parameters, during training, they are automatically updated.

How are the Query and Key values calculated?

We use a neural network to train the query, key values. This is done via end to end, we do not have labels for the keys or queries, but we can calculate them via back-propagation because we have labels for the final output of the model. The computational graph just becomes bigger.