Definition

The attention mechanism is an innovation, that allows models to focus on specific parts of the input when making predictions. It gives importance scores to each input token, highlighting critical information while suppressing less relevant details.

The raw input gets encoded into embeddings, that contain potential long range context information. After adding this attention part, the input can be processed via layers without built-in recurrence or attention, that by themselves aren't able to deal with long range context dependencies.

Before Transformers the issue was that long inputs cannot be handled properly. There is too much stuff to consider and some important inputs can get "lost". LSTMs in particular were having this issue.

Different types of attention

When to use it

You use attention, when parts of the input can only be understood if looking at other parts of the input, the context.

Example:

You have a long text:

First part: News article, about racially motivated killings.

Second part: Comments of users, one of the says: "They finally cleaned that neighborhood up".

We have a model, that detects hate speech within social media comments. We need to link the seemingly harmless sentence "They finally cleaned that neighbourhood up" with key parts of the news-article it was posted beneath.

Pre-trained attention models