A Gentle Introduction to Attention Masking in Transformer Models


Attention mechanisms in transformer models need to handle various constraints that prevent the model from attending to certain positions. This post explores how attention masking enables these constraints and their implementations in modern language models.

Let’s get started.

caleb jack jUxMsNZZCJ8 unsplash scaled

A Gentle Introduction to Attention Masking in Transformer Models
Photo by Caleb Jack. Some rights reserved.

Overview

This post is divided into four parts; they are:

  • Why Attention Masking is Needed
  • Implementation of Attention Masks
  • Mask Creation
  • Using PyTorch’s Built-in Attention

Why Attention Masking is Needed

In the previous post, you learned how attention mechanisms allow models to focus on relevant parts of sequences. However, there are several scenarios where you want to prevent the model from attending to certain positions:

  1. Causal Masking: In language modeling and text generation, the model should only attend to previous tokens, not future ones. Causal masks prevent information leakage from the future during training.
  2. Padding Masking: When processing batches of sequences with different lengths, shorter sequences are padded with special tokens. The model should ignore these padding tokens. This is the most common use of masking during inference.
  3. Custom Masking: In some applications, we might want to prevent attention to specific tokens or positions based on domain-specific rules.

Consider the sentence “The cat sat on the mat” being learned by a language model. When training the model to predict the word “sat”, it should only consider “The cat” and not “on the mat” to avoid cheating by looking at the future.

For causal masking, if you train a model with “The cat sat on the mat” as input, you would use the following mask:

$$
\begin{bmatrix}
1 & 0 & 0 & 0 & 0 & 0 \\
1 & 1 & 0 & 0 & 0 & 0 \\
1 & 1 & 1 & 0 & 0 & 0 \\
1 & 1 & 1 & 1 & 0 & 0 \\
1 & 1 & 1 & 1 & 1 & 0 \\
1 & 1 & 1 & 1 & 1 & 1
\end{bmatrix}
$$

This mask is a lower triangular matrix of all 1’s. The element $(i,j)$ is 1 means query token $i$ can attend to key token $j$. The lower triangular structure ensures that the key sequence never exceeds the query sequence length, even when a full sequence is fed to the model during training.

Some models like BERT are “bidirectional” and predict masked tokens rather than the next token. These models are trained with masks containing 0’s at random positions.

During inference, you might pass a batch of sequences to the model:

This batch contains two sequences of unequal length. After preprocessing and padding:

To ensure the model ignores padding tokens, you create a mask like this:

$$
\begin{bmatrix}
1 & 1 & 1 & 1 & 1 & 1 \\
1 & 1 & 1 & 0 & 0 & 0 \\
1 & 1 & 1 & 1 & 0 & 0
\end{bmatrix}
$$

Here, positions corresponding to padding tokens are set to 0, while all other positions are set to 1.

Implementation of Attention Masks

Building on the attention module from the previous post, you can modify it to support masking:

This is the standard implementation of multi-head attention with masking and dropout. The mask is applied to attention scores before softmax. In mathematical terms, the mask is a matrix $M$ such that:

$$
\text{Attention}(Q, K, V, M) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}} + M\right)V
$$

The mask must be added before softmax because softmax operates across entire rows. You don’t want softmax to consider masked elements. Since softmax computes:

$$
\text{softmax}(x_i) = \frac{\exp(x_i)}{\sum_{j=1}^n \exp(x_j)}
$$

To make masked elements contribute 0 to the softmax, you add $-\infty$ to those positions. This is what the masked_fill() function accomplishes.

Given this implementation, you can also use a mask directly if it is a matrix of $-\infty$ and 0 values:

The next section will show you how to create masks for different use cases.

Mask Creation

Since masks are essential and widely used, it’s valuable to create dedicated functions for mask generation:

These are the two most common mask types. You can extend these for other use cases. In create_causal_mask(), you create an upper triangular matrix of $-\infty$ values above the diagonal. Positions with 0 allow attention.

In create_padding_mask(), you first identify padding tokens in the batch with the padded tensor, which has the same shape as batch. The output mask has shape (batch_size, seq_len, seq_len), initialized with all 0’s, then modified by adding the padded tensor twice: once for rows and once for columns.

Running this code produces:

These masks can be used directly as the mask argument in the forward() method of the MultiHeadAttention class above.

Using PyTorch’s Built-in Attention with Masks

The matrix multiplication and softmax operations in the forward() method above can be replaced with PyTorch’s built-in SDPA function:

All other parts of the code remain the same, including the projection matrices and mask creation functions.

Alternatively, you can use PyTorch’s built-in MultiheadAttention class. Using it with masks is straightforward:

You only need to specify the dimension size and number of heads when creating the attention layer. The class handles all projection matrices and dropout internally. Note that you should set batch_first=True to use input tensors with shape (batch_size, seq_len, dim).

The code above demonstrates using MultiheadAttention for self-attention, where the same tensor x serves as query, key, and value. If your input tensor contains padding tokens, you can use key_padding_mask to indicate where attention should be masked.

For more precise control over attention masking, you can use the attn_mask argument:

Using attn_mask requires more setup because it expects a 3D mask of shape (batch_size * num_heads, query_len, key_len). The create_mask() function creates a 3D mask of shape (batch_size, query_len, key_len) indicating padding token positions in the query-key matrix. You then use repeat() to duplicate the mask for each attention head. This is the format expected by the built-in MultiHeadAttention class.

Further Readings

Below are some resources that you may find useful:

Summary

In this post, you learned about attention masking in transformer models. Specifically, you learned about:

  • Why attention masking is necessary for preventing information leakage and handling variable-length sequences
  • Different types of masks and their applications
  • How to implement attention masking in both custom and PyTorch’s built-in attention mechanisms

Learn Transformers and Attention!

Building Transformer Models with Attention

Teach your deep learning model to read a sentence

…using transformer models with attention

Discover how in my new Ebook:

Building Transformer Models with Attention

It provides self-study tutorials with working code to guide you into building a fully-working transformer models that can

translate sentences from one language to another

Give magical power of understanding human language for
Your Projects

See What’s Inside


Leave a Comment