Transformer models are trained with a fixed sequence length, but during inference, they may need to process sequences of different lengths. This poses a challenge because positional encodings are computed based on the sequence length. The model might struggle with positional encodings it hasn’t encountered during training.
The ability to handle varying sequence lengths is crucial for a model. This post explores how different positional encoding methods address this challenge.
Let’s get started.

Interpolation in Positional Encodings and Using YaRN for Larger Context Window
Photo by enkuu smile_. Some rights reserved.
Overview
This post is divided into three parts; they are:
- Interpolation and Extrapolation in Sinusoidal Encodings and RoPE
- Interpolation in Learned Encodings
- YaRN for Larger Context Window
Interpolation and Extrapolation in Sinusoidal Encodings and RoPE
Sinusoidal encodings excel at extrapolation due to their use of continuous functions:
$$
\begin{aligned}
PE(p, 2i) &= \sin\left(\frac{p}{10000^{2i/d}}\right) \\
PE(p, 2i+1) &= \cos\left(\frac{p}{10000^{2i/d}}\right)
\end{aligned}
$$
You can simply substitute $p$ with a larger value to obtain the positional encoding for a longer sequence. This is extrapolation.
Alternatively, you can use interpolation. Instead of using $p$ as an integer from 0 to $L-1$ for sequence length $L$, you can make $p$ a floating point number in the same range to represent a length $L’>L$. That is:
$$p = \frac{L}{L’}p’$$
where $p’$ is the integer position from 0 to $L’-1$ (the actual position in the longer sequence).
These techniques also apply to RoPE.
The function generating sinusoidal positional encodings or RoPE requires no modification and can handle sequences of any length. However, you may need to fine-tune the model to ensure it can process new encodings it hasn’t seen during training. For example, the Llama model uses RoPE and was trained with a maximum sequence length of 16K. Code Llama, a programming-focused model fine-tuned from Llama, extended the sequence length to 100K tokens with just 1000 fine-tuning steps.
Interpolation in Learned Encodings
Learned positional encodings retrieve position encoding vectors from a lookup table. This means the sequence length is fixed by the table size, making extrapolation impossible. However, interpolation can still handle sequences longer than the training length. For a sequence of length $L’>L$ where $L$ is the original sequence length, the encoding vector for position $p’=0, \dots, L’-1$ is:
$$P_{p’} = \frac{p-n}{m-n}P_m + \frac{m-p}{m-n}P_n$$
where $p = \frac{L}{L’}p’$ and $m,n$ are integers such that $m=n+1$ and $n\le p\le m$. In PyTorch:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
class ExtrapolatingLearnedEncoding(nn.Module): def __init__(self, max_trained_len, d): super().__init__() self.max_trained_len = max_trained_len self.position_embeddings = nn.Embedding(max_trained_len, d)
def forward(self, x) seq_len = x.size(1) if seq_len <= self.max_trained_len: # Normal case: use learned embeddings positions = torch.arange(seq_len, device=x.device) return x + self.position_embeddings(positions) else: # Extrapolation case: use interpolation positions = torch.arange(seq_len, device=x.device) # Interpolate between existing positions scale = (self.max_trained_len – 1) / (seq_len – 1) scaled_positions = positions * scale # Get floor and ceiling positions pos_floor = torch.floor(scaled_positions).long() pos_ceil = torch.ceil(scaled_positions).long() # Get weights for interpolation weights = (scaled_positions – pos_floor.float()).unsqueeze(–1) # Interpolate emb_floor = self.position_embeddings(pos_floor) emb_ceil = self.position_embeddings(pos_ceil) return x + (1 – weights) * emb_floor + weights * emb_ceil |
This is a basic interpolation implementation. However, there’s no guarantee the model can handle longer sequences without performance degradation before retraining.
YaRN for Larger Context Window
RoPE is currently the most widely used positional encoding in large language models. Recent research has focused on improving RoPE’s extrapolation capabilities.
YaRN is a method that extends RoPE to handle longer sequences, proving more effective than the interpolation method described above. Recall that RoPE sinusoids are computed as:
$$
\begin{aligned}
\theta_i &= \frac{1}{10000^{2i/d}} \\
\hat{x}_m^{(i)} &= x_m^{(i)} \cos(m\theta_i) + x_m^{(d/2+i)} \sin(m\theta_i) \\
\hat{x}_m^{(d/2+i)} &= x_m^{(d/2+i)} \cos(m\theta_i) – x_m^{(i)} \sin(m\theta_i)
\end{aligned}
$$
for position $m$ in the sequence, where vector $x_m$ has dimension $d$, and $x_m^{(i)}$ is the $i$-th element of vector $x_m$. YaRN modifies the formula to:
$$
\begin{aligned}
s &= \frac{L’}{L} \\
\theta_i &= \frac{1}{10000^{2i/d}} \\
r(i) &= \frac{L}{2\pi(10000^{2i/d})} \\
\gamma(r) &= \begin{cases}
0, & \text{if } r < \alpha \\ \dfrac{r – \alpha}{\beta – \alpha}, & \text{if } \alpha \le r \le \beta \\ 1, & \text{if } r > \beta
\end{cases} \\
\theta_i’ &= \Big[1-\gamma\big(r(i)\big)\Big]\frac{\theta_i}{s} + \gamma\big(r(i)\big)\theta_i \\
\sqrt{\frac{1}{t}} &= 0.1\ln(s) + 1 \\
\hat{x}_m^{(i)} &= \sqrt{\frac{1}{t}} \Big[x_m^{(i)} \cos(m\theta_i’) + x_m^{(d/2+i)} \sin(m\theta_i’)\Big] \\
\hat{x}_m^{(d/2+i)} &= \sqrt{\frac{1}{t}} \Big[x_m^{(d/2+i)} \cos(m\theta_i’) – x_m^{(i)} \sin(m\theta_i’)\Big]
\end{aligned}
$$
In code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
import torch import numpy as np
def rotate_half(x): x1, x2 = x.chunk(2, dim=–1) return torch.cat((–x2, x1), dim=–1)
def apply_rotary_pos_emb(x, cos, sin): return (x * cos) + (rotate_half(x) * sin)
class YaRN(nn.Module): def __init__(self, dim, orig_seq_len=512, scale=4, alpha=1, beta=32): super().__init__() N = 10000 pos_freq = N ** (torch.arange(0, dim, 2).float() / dim) inv_freq_extrapolation = 1. / pos_freq inv_freq_interpolation = 1. / (scale * pos_freq)
low = dim * np.log(orig_seq_len / (2*np.pi*beta)) / (2*np.log(N)) high = dim * np.log(orig_seq_len / (2*np.pi*alpha)) / (2*np.log(N)) low = max(np.floor(low), 0) high = min(np.ceil(high), dim–1)
linear_func = (torch.arange(dim // 2).float() – low) / (high – low) ramp_func = torch.clamp(linear_func, 0, 1) inv_freq_factor = 1 – ramp_func inv_freq = inv_freq_interpolation * (1–inv_freq_factor) + inv_freq_extrapolation * inv_freq_factor
# Original RoPE multiplied with a scaling factor scaling_factor = 0.1 * np.log(scale) + 1.0 position = torch.arange(orig_seq_len * scale).float() sinusoid_inp = torch.outer(position, inv_freq) self.register_buffer(“cos”, sinusoid_inp.cos() * scaling_factor) self.register_buffer(“sin”, sinusoid_inp.sin() * scaling_factor)
def forward(self, x, seq_len=None): if seq_len is None: seq_len = x.size(1) cos = self.cos[:seq_len].view(1, seq_len, 1, –1) sin = self.sin[:seq_len].view(1, seq_len, 1, –1) return apply_rotary_pos_emb(x, cos, sin) |
YaRN’s key innovation is scaling RoPE sinusoid frequencies unevenly when expanding sequence length from $L$ to $L’$. This approach is called “NTK-by-parts” interpolation.
Consider the $\cos(m\theta_i)$ term in the RoPE formula, where $m$ ranges from 0 to $L’-1$ for the new sequence length $L’$. In inv_freq_interpolation
, a factor of $1/s = L/L’$ is multiplied to $\theta_i$ to create the interpolation effect. Using the original $\cos(m\theta_i)$ with a larger $m$ constitutes extrapolation.
In NTK-by-part, instead of $\cos(m\theta_i)$, you use $\cos(m\theta_i’)$ where $\theta_i’$ blends interpolation and extrapolation. The weight between interpolation and extrapolation follows the formula above, implemented as inv_freq
in the code.
YaRN improves upon NTK-by-part by adding a scaling factor $\sqrt{1/t}$. This enhancement improves model performance, measured by lower perplexity (higher accuracy in next-token prediction) under longer context lengths.
Further Readings
Below are some papers that are related to the topic:
Summary
In this post, you learned how models trained with shorter context lengths can process longer input sequences. Specifically:
- Sinusoidal encodings and RoPE can be easily extrapolated
- Learned encodings only support interpolation
- YaRN provides an advanced method for scaling RoPE to longer sequence lengths
The goal of scaling positional encodings is to enable models to handle longer input sequences without retraining. This is not an exhaustive list, as more advanced methods continue to build upon these foundational ideas.