Positional Encodings in Transformer Models

Natural language processing (NLP) has evolved significantly with transformer-based models. A key innovation in these models is positional encodings, which help capture the sequential nature of language. In this post, you will learn about:

Why positional encodings are necessary in transformer models
Different types of positional encodings and their characteristics
How to implement various positional encoding schemes
How positional encodings are used in modern language models

Let’s get started!

svetlana gumerova 6fAgnT3Dhl4 unsplash scaled

Positional Encodings in Language Models
Photo by Svetlana Gumerova. Some rights reserved.

Overview

This post is divided into five parts; they are:

Understanding Positional Encodings
Sinusoidal Positional Encodings
Learned Positional Encodings
Rotary Positional Encodings (RoPE)
Relative Positional Encodings

Understanding Positional Encodings

Consider these two sentences: “The fox jumps over the dog” and “The dog jumps over the fox”. They contain the same words but in different orders. In recurrent neural networks, the model processes words sequentially, naturally capturing this difference. However, transformer models process all words in parallel, making them unable to distinguish between these sentences without additional information.

Positional encodings solve this problem by providing information about each token’s position in the sequence. Each token is converted into a vector through the model’s embedding layer, with the vector size called the “hidden dimension”. Positional encoding adds position information by creating a vector of the same hidden dimension.

The positional encodings are added to the input in the attention module. During the dot-product operation, these encodings emphasize relationships between nearby tokens, helping the model understand context. This allows the model to distinguish between sentences with the same words in different orders.

The most common types of positional encodings are:

Sinusoidal Positional Encodings (used in the original Transformer): Uses constant vectors built with sine and cosine functions
Learned Positional Encodings (used in BERT and GPT): Vectors are learned during training
Rotary Positional Encodings (RoPE, used in Llama models): Uses constant vectors built with rotational matrices
Relative Positional Encodings (used in T5 and MPT): Based on distances between tokens rather than absolute positions
Attention with Linear Bias (ALiBi, used in Falcon models): A bias term added to attention scores based on token distances

Each type has unique advantages and limitations, which we’ll explore in detail.

Sinusoidal Positional Encodings

The original Transformer paper introduced sinusoidal positional encodings. Deterministic functions are used to generate unique patterns for each position, as shown in the following formulas:

$$
\begin{aligned}
PE(p, 2i) &= \sin\left(\frac{p}{10000^{2i/d}}\right) \\
PE(p, 2i+1) &= \cos\left(\frac{p}{10000^{2i/d}}\right)
\end{aligned}
$$

Where $d$ is the hidden dimension (must be even), and $i$ ranges from 0 to $d/2$. The positional encoding $PE(p, k)$ represents the $k$-th element in the vector for position $p$. The constant 10000 was suggested by the original Transformer paper. It should be larger than the maximum sequence length.

Here’s the PyTorch implementation:

mport torch import numpy as np def create_sinusoidal_encodings(seq_len, dim): N = 10000 i = torch.arange(0, dim//2) div_term = torch.exp(-np.log(N) * (2*i / dim)) position = torch.arange(seq_len).unsqueeze(1) pe = torch.zeros(seq_len, dim) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) return pe # Example usage seq_len = 512 dim = 768 positional_encodings = create_sinusoidal_encodings(seq_len, dim) sequence = sequence + positional_encodings

mport torch

import numpy as np

def create_sinusoidal_encodings(seq_len, dim):

N = 10000

i = torch.arange(0, dim//2)

div_term = torch.exp(–np.log(N) * (2*i / dim))

position = torch.arange(seq_len).unsqueeze(1)

pe = torch.zeros(seq_len, dim)

pe[:, 0::2] = torch.sin(position * div_term)

pe[:, 1::2] = torch.cos(position * div_term)

return pe

# Example usage

seq_len = 512

dim = 768

positional_encodings = create_sinusoidal_encodings(seq_len, dim)

sequence = sequence + positional_encodings

In this implementation, div_term computes $1/N^{2i/d}$ for $i=0$ to $d/2-1$. The position matrix has shape (seq_len,1). Their multiplication in the sine and cosine functions produces a matrix of shape (seq_len, dim//2). The results are interleaved in the output matrix pe of shape (seq_len, dim).

Sinusoidal encodings have two key advantages: they’re deterministic and can extrapolate to longer sequences than seen during training. The relative position between tokens can be easily computed from the dot product of their positional encoding vectors, thanks to the properties of sinusoidal functions.

However, these encodings don’t adapt to data characteristics and may be less effective for very long sequences.

Learned Positional Encodings

Models like GPT-2 use learned positional encodings. Here’s the PyTorch implementation:

import torch.nn as nn class LearnedPositionalEncoding(nn.Module): def __init__(self, max_seq_len, dim): super().__init__() self.position_embeddings = nn.Embedding(max_seq_len, dim) def forward(self, x): positions = torch.arange(x.size(1), device=x.device).expand(x.size(0), -1) position_embeddings = self.position_embeddings(positions) return x + position_embeddings # Example usage model = LearnedPositionalEncoding(max_seq_len=512, dim=768)

import torch.nn as nn

class LearnedPositionalEncoding(nn.Module):

def __init__(self, max_seq_len, dim):

super().__init__()

self.position_embeddings = nn.Embedding(max_seq_len, dim)

def forward(self, x):

positions = torch.arange(x.size(1), device=x.device).expand(x.size(0), –1)

position_embeddings = self.position_embeddings(positions)

return x + position_embeddings

# Example usage

model = LearnedPositionalEncoding(max_seq_len=512, dim=768)

The nn.Embedding layer acts as a lookup table mapping integer indices to vectors of dimension dim. In the forward() function, the positions tensor has shape (batch_size, seq_len, dim), matching the input x. The positional encoding is added to x before the attention operation.

Learned positional encodings adapt to data characteristics through training, potentially offering better performance when trained properly. However, they can’t extrapolate to longer sequences and may overfit. They also increase model size since they’re part of the model parameters.

Rotary Positional Encodings (RoPE)

Most modern large language models use rotary positional encodings (RoPE). These encode relative positions through rotation matrices, with each position representing a geometric progression of angles. The formulas are:

$$
\begin{aligned}
\hat{x}_m^{(i)} &= x_m^{(i)} \cos(m\theta_i) + x_m^{(d/2+i)} \sin(m\theta_i) \\
\hat{x}_m^{(d/2+i)} &= x_m^{(d/2+i)} \cos(m\theta_i) – x_m^{(i)} \sin(m\theta_i) \\
\end{aligned}
$$

where $\theta_i = 10000^{-2i/d}$, with $d$ as the embedding dimension, $m$ the position index, and $i$ ranging from 0 to $d/2-1$. In matrix form, it is:

$$
\mathbf{\hat{x}}_m = \mathbf{R}_m\mathbf{x}_m = \begin{bmatrix}
\cos(m\theta_i) & -\sin(m\theta_i) \\
\sin(m\theta_i) & \cos(m\theta_i)
\end{bmatrix} \mathbf{x}_m
$$

For $\mathbf{x}_m$ representing a pair $(i, d/2+i)$ of elements in the vector at position $m$.

Here’s the PyTorch implementation:

import torch import numpy as np def rotate_half(x): x1, x2 = x.chunk(2, dim=-1) return torch.cat((-x2, x1), dim=-1) def apply_rotary_pos_emb(x, cos, sin): return (x * cos) + (rotate_half(x) * sin) class RotaryPositionalEncoding(nn.Module): def __init__(self, dim, max_seq_len=512): super().__init__() N = 10000 inv_freq = 1. / (N ** (torch.arange(0, dim, 2).float() / dim)) position = torch.arange(max_seq_len).float() sinusoid_inp = torch.outer(position, inv_freq) self.register_buffer(“cos”, sinusoid_inp.cos()) self.register_buffer(“sin”, sinusoid_inp.sin()) def forward(self, x, seq_len=None): if seq_len is None: seq_len = x.size(1) cos = self.cos[:seq_len].view(1, seq_len, 1, -1) sin = self.sin[:seq_len].view(1, seq_len, 1, -1) return apply_rotary_pos_emb(x, cos, sin)

import torch

import numpy as np

def rotate_half(x):

x1, x2 = x.chunk(2, dim=–1)

return torch.cat((–x2, x1), dim=–1)

def apply_rotary_pos_emb(x, cos, sin):

return (x * cos) + (rotate_half(x) * sin)

class RotaryPositionalEncoding(nn.Module):

def __init__(self, dim, max_seq_len=512):

super().__init__()

N = 10000

inv_freq = 1. / (N ** (torch.arange(0, dim, 2).float() / dim))

position = torch.arange(max_seq_len).float()

sinusoid_inp = torch.outer(position, inv_freq)

self.register_buffer(“cos”, sinusoid_inp.cos())

self.register_buffer(“sin”, sinusoid_inp.sin())

def forward(self, x, seq_len=None):

if seq_len is None:

seq_len = x.size(1)

cos = self.cos[:seq_len].view(1, seq_len, 1, –1)

sin = self.sin[:seq_len].view(1, seq_len, 1, –1)

return apply_rotary_pos_emb(x, cos, sin)

The register_buffer() calls cache the sine and cosine calculations for efficiency. The inv_freq variable computes $\theta_i$ for all $i$, position represents $m$ (indices from 0 to max_seq_len-1), and sinusoid_inp holds $m\theta_i$ in a matrix of shape (max_seq_len, dim//2). The rotate_half() function converts a vector $(x_1, x_2, \cdots, x_{d-1}, x_{d})$ to $(-x_{d/2+1}, -x_{d/2+2}, \dots, x_{d/2-1}, x_{d/2})$. Then, apply_rotary_pos_emb() applies the rotation matrix to the input.

RoPE offers several advantages:

The rotation matrix $\mathbf{R}_m$ geometrically rotates the 2D input vector by an angle $m\theta_i$
The transpose $\mathbf{R}_m^\top = \mathbf{R}_m^{-1}$ represents reverse rotation. Hence the relative positions can be easily computed as $\mathbf{R}_{m-n} = \mathbf{R}_m\mathbf{R}_n^\top$
It can extrapolate to longer sequences due to the geometric progression of angles
Since $\cos^2t+\sin^2t=1$, RoPE preserves vector norms of $\mathbf{x}_m$, aiding training stability

Relative Positional Encodings

While previous implementations use absolute token positions, usually what matters is the relative positions between tokens. Here’s a simplified implementation on how to use relative positional encodings:

import torch import torch.nn as nn class RelativePositionalEncoding(nn.Module): def __init__(self, max_relative_position, d_model): super().__init__() self.max_relative_position = max_relative_position self.relative_attention_bias = nn.Parameter( torch.randn(2 * max_relative_position + 1, d_model) ) def forward(self, length): context_position = torch.arange(length, dtype=torch.long)[:, None] memory_position = torch.arange(length, dtype=torch.long)[None, :] relative_position = memory_position – context_position relative_position_bucket = relative_position + self.max_relative_position return self.relative_attention_bias[relative_position_bucket]

import torch

import torch.nn as nn

class RelativePositionalEncoding(nn.Module):

def __init__(self, max_relative_position, d_model):

super().__init__()

self.max_relative_position = max_relative_position

self.relative_attention_bias = nn.Parameter(

torch.randn(2 * max_relative_position + 1, d_model)

)

def forward(self, length):

context_position = torch.arange(length, dtype=torch.long)[:, None]

memory_position = torch.arange(length, dtype=torch.long)[None, :]

relative_position = memory_position – context_position

relative_position_bucket = relative_position + self.max_relative_position

return self.relative_attention_bias[relative_position_bucket]

The relative_position matrix has shape (length, length), with each element representing the relative position between tokens $i$ and $j$. This is computed by subtracting an $N\times 1$ matrix context_position from a $1\times N$ matrix memory_position.

The relative_position_bucket shifts values to be non-negative, and position encoding vectors are looked up from the relative_attention_bias tensor.

Relative positional encodings naturally handle variable-length sequences and work well for tasks like translation, making them the choice for models like T5.

Attention with Linear Bias (ALiBi) is a related approach that adds a bias matrix to attention scores instead of manipulating the input sequence. In the code above, you see that relative_positon_bucket is used to look up a sequence of vectors as the positional encoding, which is then added to the input sequence in the attention module. In ALiBi, the input sequence are used directly in calculating the attention score. But afterwards, the matrix of relative_positon_bucket is scaled and added to the attention score matrix before proceeding to the softmax operation. The scaling factor in ALiBi is computed as $m_h=1/2^{8h/H}$, where $h$ is the head index and $H$ is the total number of attention heads.

Summary

In this article, you learned about positional encodings and their importance in transformer models. In particular, you learned:

Positional encodings are necessary because transformers process tokens in parallel
Different types of positional encodings have different advantages and limitations
Sinusoidal encodings are deterministic and can extrapolate to longer sequences
Learned encodings are simple but can’t extrapolate
RoPE provides better performance on long sequences
Relative positional encodings focus on inter-token distances