Interpolation in Positional Encodings and Using YaRN for Larger Context Window

Transformer models are trained with a fixed sequence length, but during inference, they may need to process sequences of different lengths. This poses a challenge because positional encodings are computed based on the sequence length. The model might struggle with positional encodings it hasn’t encountered during training.

The ability to handle varying sequence lengths is crucial for a model. This post explores how different positional encoding methods address this challenge.

Let’s get started.

Interpolation in Positional Encodings and Using YaRN for Larger Context Window
Photo by enkuu smile_. Some rights reserved.

Overview

This post is divided into three parts; they are:

Interpolation and Extrapolation in Sinusoidal Encodings and RoPE
Interpolation in Learned Encodings
YaRN for Larger Context Window

Interpolation and Extrapolation in Sinusoidal Encodings and RoPE

Sinusoidal encodings excel at extrapolation due to their use of continuous functions:

$$
\begin{aligned}
PE(p, 2i) &= \sin\left(\frac{p}{10000^{2i/d}}\right) \\
PE(p, 2i+1) &= \cos\left(\frac{p}{10000^{2i/d}}\right)
\end{aligned}
$$

You can simply substitute $p$ with a larger value to obtain the positional encoding for a longer sequence. This is extrapolation.

Alternatively, you can use interpolation. Instead of using $p$ as an integer from 0 to $L-1$ for sequence length $L$, you can make $p$ a floating point number in the same range to represent a length $L’>L$. That is:

$$p = \frac{L}{L’}p’$$

where $p’$ is the integer position from 0 to $L’-1$ (the actual position in the longer sequence).

These techniques also apply to RoPE.

The function generating sinusoidal positional encodings or RoPE requires no modification and can handle sequences of any length. However, you may need to fine-tune the model to ensure it can process new encodings it hasn’t seen during training. For example, the Llama model uses RoPE and was trained with a maximum sequence length of 16K. Code Llama, a programming-focused model fine-tuned from Llama, extended the sequence length to 100K tokens with just 1000 fine-tuning steps.

Interpolation in Learned Encodings

Learned positional encodings retrieve position encoding vectors from a lookup table. This means the sequence length is fixed by the table size, making extrapolation impossible. However, interpolation can still handle sequences longer than the training length. For a sequence of length $L’>L$ where $L$ is the original sequence length, the encoding vector for position $p’=0, \dots, L’-1$ is:

$$P_{p’} = \frac{p-n}{m-n}P_m + \frac{m-p}{m-n}P_n$$

where $p = \frac{L}{L’}p’$ and $m,n$ are integers such that $m=n+1$ and $n\le p\le m$. In PyTorch:

class ExtrapolatingLearnedEncoding(nn.Module): def __init__(self, max_trained_len, d): super().__init__() self.max_trained_len = max_trained_len self.position_embeddings = nn.Embedding(max_trained_len, d) def forward(self, x) seq_len = x.size(1) if seq_len <= self.max_trained_len: # Normal case: use learned embeddings positions = torch.arange(seq_len, device=x.device) return x + self.position_embeddings(positions) else: # Extrapolation case: use interpolation positions = torch.arange(seq_len, device=x.device) # Interpolate between existing positions scale = (self.max_trained_len – 1) / (seq_len – 1) scaled_positions = positions * scale # Get floor and ceiling positions pos_floor = torch.floor(scaled_positions).long() pos_ceil = torch.ceil(scaled_positions).long() # Get weights for interpolation weights = (scaled_positions – pos_floor.float()).unsqueeze(-1) # Interpolate emb_floor = self.position_embeddings(pos_floor) emb_ceil = self.position_embeddings(pos_ceil) return x + (1 – weights) * emb_floor + weights * emb_ceil

class ExtrapolatingLearnedEncoding(nn.Module):

def __init__(self, max_trained_len, d):

super().__init__()

self.max_trained_len = max_trained_len

self.position_embeddings = nn.Embedding(max_trained_len, d)

def forward(self, x)

seq_len = x.size(1)

if seq_len <= self.max_trained_len:

# Normal case: use learned embeddings

positions = torch.arange(seq_len, device=x.device)

return x + self.position_embeddings(positions)

else:

# Extrapolation case: use interpolation

positions = torch.arange(seq_len, device=x.device)

# Interpolate between existing positions

scale = (self.max_trained_len – 1) / (seq_len – 1)

scaled_positions = positions * scale

# Get floor and ceiling positions

pos_floor = torch.floor(scaled_positions).long()

pos_ceil = torch.ceil(scaled_positions).long()

# Get weights for interpolation

weights = (scaled_positions – pos_floor.float()).unsqueeze(–1)

# Interpolate

emb_floor = self.position_embeddings(pos_floor)

emb_ceil = self.position_embeddings(pos_ceil)

return x + (1 – weights) * emb_floor + weights * emb_ceil

This is a basic interpolation implementation. However, there’s no guarantee the model can handle longer sequences without performance degradation before retraining.

YaRN for Larger Context Window

RoPE is currently the most widely used positional encoding in large language models. Recent research has focused on improving RoPE’s extrapolation capabilities.

YaRN is a method that extends RoPE to handle longer sequences, proving more effective than the interpolation method described above. Recall that RoPE sinusoids are computed as:

$$
\begin{aligned}
\theta_i &= \frac{1}{10000^{2i/d}} \\
\hat{x}_m^{(i)} &= x_m^{(i)} \cos(m\theta_i) + x_m^{(d/2+i)} \sin(m\theta_i) \\
\hat{x}_m^{(d/2+i)} &= x_m^{(d/2+i)} \cos(m\theta_i) – x_m^{(i)} \sin(m\theta_i)
\end{aligned}
$$

for position $m$ in the sequence, where vector $x_m$ has dimension $d$, and $x_m^{(i)}$ is the $i$-th element of vector $x_m$. YaRN modifies the formula to:

$$
\begin{aligned}
s &= \frac{L’}{L} \\
\theta_i &= \frac{1}{10000^{2i/d}} \\
r(i) &= \frac{L}{2\pi(10000^{2i/d})} \\
\gamma(r) &= \begin{cases}
0, & \text{if } r < \alpha \\ \dfrac{r – \alpha}{\beta – \alpha}, & \text{if } \alpha \le r \le \beta \\ 1, & \text{if } r > \beta
\end{cases} \\
\theta_i’ &= \Big[1-\gamma\big(r(i)\big)\Big]\frac{\theta_i}{s} + \gamma\big(r(i)\big)\theta_i \\
\sqrt{\frac{1}{t}} &= 0.1\ln(s) + 1 \\
\hat{x}_m^{(i)} &= \sqrt{\frac{1}{t}} \Big[x_m^{(i)} \cos(m\theta_i’) + x_m^{(d/2+i)} \sin(m\theta_i’)\Big] \\
\hat{x}_m^{(d/2+i)} &= \sqrt{\frac{1}{t}} \Big[x_m^{(d/2+i)} \cos(m\theta_i’) – x_m^{(i)} \sin(m\theta_i’)\Big]
\end{aligned}
$$

In code:

import torch import numpy as np def rotate_half(x): x1, x2 = x.chunk(2, dim=-1) return torch.cat((-x2, x1), dim=-1) def apply_rotary_pos_emb(x, cos, sin): return (x * cos) + (rotate_half(x) * sin) class YaRN(nn.Module): def __init__(self, dim, orig_seq_len=512, scale=4, alpha=1, beta=32): super().__init__() N = 10000 pos_freq = N ** (torch.arange(0, dim, 2).float() / dim) inv_freq_extrapolation = 1. / pos_freq inv_freq_interpolation = 1. / (scale * pos_freq) low = dim * np.log(orig_seq_len / (2*np.pi*beta)) / (2*np.log(N)) high = dim * np.log(orig_seq_len / (2*np.pi*alpha)) / (2*np.log(N)) low = max(np.floor(low), 0) high = min(np.ceil(high), dim-1) linear_func = (torch.arange(dim // 2).float() – low) / (high – low) ramp_func = torch.clamp(linear_func, 0, 1) inv_freq_factor = 1 – ramp_func inv_freq = inv_freq_interpolation * (1-inv_freq_factor) + inv_freq_extrapolation * inv_freq_factor # Original RoPE multiplied with a scaling factor scaling_factor = 0.1 * np.log(scale) + 1.0 position = torch.arange(orig_seq_len * scale).float() sinusoid_inp = torch.outer(position, inv_freq) self.register_buffer(“cos”, sinusoid_inp.cos() * scaling_factor) self.register_buffer(“sin”, sinusoid_inp.sin() * scaling_factor) def forward(self, x, seq_len=None): if seq_len is None: seq_len = x.size(1) cos = self.cos[:seq_len].view(1, seq_len, 1, -1) sin = self.sin[:seq_len].view(1, seq_len, 1, -1) return apply_rotary_pos_emb(x, cos, sin)

import torch

import numpy as np

def rotate_half(x):

x1, x2 = x.chunk(2, dim=–1)

return torch.cat((–x2, x1), dim=–1)

def apply_rotary_pos_emb(x, cos, sin):

return (x * cos) + (rotate_half(x) * sin)

class YaRN(nn.Module):

def __init__(self, dim, orig_seq_len=512, scale=4, alpha=1, beta=32):

super().__init__()

N = 10000

pos_freq = N ** (torch.arange(0, dim, 2).float() / dim)

inv_freq_extrapolation = 1. / pos_freq

inv_freq_interpolation = 1. / (scale * pos_freq)

low = dim * np.log(orig_seq_len / (2*np.pi*beta)) / (2*np.log(N))

high = dim * np.log(orig_seq_len / (2*np.pi*alpha)) / (2*np.log(N))

low = max(np.floor(low), 0)

high = min(np.ceil(high), dim–1)

linear_func = (torch.arange(dim // 2).float() – low) / (high – low)

ramp_func = torch.clamp(linear_func, 0, 1)

inv_freq_factor = 1 – ramp_func

inv_freq = inv_freq_interpolation * (1–inv_freq_factor) + inv_freq_extrapolation * inv_freq_factor

# Original RoPE multiplied with a scaling factor

scaling_factor = 0.1 * np.log(scale) + 1.0

position = torch.arange(orig_seq_len * scale).float()

sinusoid_inp = torch.outer(position, inv_freq)

self.register_buffer(“cos”, sinusoid_inp.cos() * scaling_factor)

self.register_buffer(“sin”, sinusoid_inp.sin() * scaling_factor)

def forward(self, x, seq_len=None):

if seq_len is None:

seq_len = x.size(1)

cos = self.cos[:seq_len].view(1, seq_len, 1, –1)

sin = self.sin[:seq_len].view(1, seq_len, 1, –1)

return apply_rotary_pos_emb(x, cos, sin)

YaRN’s key innovation is scaling RoPE sinusoid frequencies unevenly when expanding sequence length from $L$ to $L’$. This approach is called “NTK-by-parts” interpolation.

Consider the $\cos(m\theta_i)$ term in the RoPE formula, where $m$ ranges from 0 to $L’-1$ for the new sequence length $L’$. In inv_freq_interpolation, a factor of $1/s = L/L’$ is multiplied to $\theta_i$ to create the interpolation effect. Using the original $\cos(m\theta_i)$ with a larger $m$ constitutes extrapolation.

In NTK-by-part, instead of $\cos(m\theta_i)$, you use $\cos(m\theta_i’)$ where $\theta_i’$ blends interpolation and extrapolation. The weight between interpolation and extrapolation follows the formula above, implemented as inv_freq in the code.

YaRN improves upon NTK-by-part by adding a scaling factor $\sqrt{1/t}$. This enhancement improves model performance, measured by lower perplexity (higher accuracy in next-token prediction) under longer context lengths.

Summary

In this post, you learned how models trained with shorter context lengths can process longer input sequences. Specifically:

Sinusoidal encodings and RoPE can be easily extrapolated
Learned encodings only support interpolation
YaRN provides an advanced method for scaling RoPE to longer sequence lengths

The goal of scaling positional encodings is to enable models to handle longer input sequences without retraining. This is not an exhaustive list, as more advanced methods continue to build upon these foundational ideas.

Interpolation in Positional Encodings and Using YaRN for Larger Context Window

Overview

Interpolation and Extrapolation in Sinusoidal Encodings and RoPE

Interpolation in Learned Encodings

YaRN for Larger Context Window

Further Readings

Summary

Learn Transformers and Attention!

Teach your deep learning model to read a sentence

Give magical power of understanding human language for
Your Projects

Like this:

Related

Leave a Comment Cancel reply

Recent Posts