Interpolation in Positional Encodings and Using YaRN for Larger Context Window


Transformer models are trained with a fixed sequence length, but during inference, they may need to process sequences of different lengths. This poses a challenge because positional encodings are computed based on the sequence length. The model might struggle with positional encodings it hasn’t encountered during training.

The ability to handle varying sequence lengths is crucial for a model. This post explores how different positional encoding methods address this challenge.

Let’s get started.

enkuu smile kbHvA6oXP8E unsplash scaled

Interpolation in Positional Encodings and Using YaRN for Larger Context Window
Photo by enkuu smile_. Some rights reserved.

Overview

This post is divided into three parts; they are:

  • Interpolation and Extrapolation in Sinusoidal Encodings and RoPE
  • Interpolation in Learned Encodings
  • YaRN for Larger Context Window

Interpolation and Extrapolation in Sinusoidal Encodings and RoPE

Sinusoidal encodings excel at extrapolation due to their use of continuous functions:

$$
\begin{aligned}
PE(p, 2i) &= \sin\left(\frac{p}{10000^{2i/d}}\right) \\
PE(p, 2i+1) &= \cos\left(\frac{p}{10000^{2i/d}}\right)
\end{aligned}
$$

You can simply substitute $p$ with a larger value to obtain the positional encoding for a longer sequence. This is extrapolation.

Alternatively, you can use interpolation. Instead of using $p$ as an integer from 0 to $L-1$ for sequence length $L$, you can make $p$ a floating point number in the same range to represent a length $L’>L$. That is:

$$p = \frac{L}{L’}p’$$

where $p’$ is the integer position from 0 to $L’-1$ (the actual position in the longer sequence).

These techniques also apply to RoPE.

The function generating sinusoidal positional encodings or RoPE requires no modification and can handle sequences of any length. However, you may need to fine-tune the model to ensure it can process new encodings it hasn’t seen during training. For example, the Llama model uses RoPE and was trained with a maximum sequence length of 16K. Code Llama, a programming-focused model fine-tuned from Llama, extended the sequence length to 100K tokens with just 1000 fine-tuning steps.

Interpolation in Learned Encodings

Learned positional encodings retrieve position encoding vectors from a lookup table. This means the sequence length is fixed by the table size, making extrapolation impossible. However, interpolation can still handle sequences longer than the training length. For a sequence of length $L’>L$ where $L$ is the original sequence length, the encoding vector for position $p’=0, \dots, L’-1$ is:

$$P_{p’} = \frac{p-n}{m-n}P_m + \frac{m-p}{m-n}P_n$$

where $p = \frac{L}{L’}p’$ and $m,n$ are integers such that $m=n+1$ and $n\le p\le m$. In PyTorch:

This is a basic interpolation implementation. However, there’s no guarantee the model can handle longer sequences without performance degradation before retraining.

YaRN for Larger Context Window

RoPE is currently the most widely used positional encoding in large language models. Recent research has focused on improving RoPE’s extrapolation capabilities.

YaRN is a method that extends RoPE to handle longer sequences, proving more effective than the interpolation method described above. Recall that RoPE sinusoids are computed as:

$$
\begin{aligned}
\theta_i &= \frac{1}{10000^{2i/d}} \\
\hat{x}_m^{(i)} &= x_m^{(i)} \cos(m\theta_i) + x_m^{(d/2+i)} \sin(m\theta_i) \\
\hat{x}_m^{(d/2+i)} &= x_m^{(d/2+i)} \cos(m\theta_i) – x_m^{(i)} \sin(m\theta_i)
\end{aligned}
$$

for position $m$ in the sequence, where vector $x_m$ has dimension $d$, and $x_m^{(i)}$ is the $i$-th element of vector $x_m$. YaRN modifies the formula to:

$$
\begin{aligned}
s &= \frac{L’}{L} \\
\theta_i &= \frac{1}{10000^{2i/d}} \\
r(i) &= \frac{L}{2\pi(10000^{2i/d})} \\
\gamma(r) &= \begin{cases}
0, & \text{if } r < \alpha \\ \dfrac{r – \alpha}{\beta – \alpha}, & \text{if } \alpha \le r \le \beta \\ 1, & \text{if } r > \beta
\end{cases} \\
\theta_i’ &= \Big[1-\gamma\big(r(i)\big)\Big]\frac{\theta_i}{s} + \gamma\big(r(i)\big)\theta_i \\
\sqrt{\frac{1}{t}} &= 0.1\ln(s) + 1 \\
\hat{x}_m^{(i)} &= \sqrt{\frac{1}{t}} \Big[x_m^{(i)} \cos(m\theta_i’) + x_m^{(d/2+i)} \sin(m\theta_i’)\Big] \\
\hat{x}_m^{(d/2+i)} &= \sqrt{\frac{1}{t}} \Big[x_m^{(d/2+i)} \cos(m\theta_i’) – x_m^{(i)} \sin(m\theta_i’)\Big]
\end{aligned}
$$

In code:

YaRN’s key innovation is scaling RoPE sinusoid frequencies unevenly when expanding sequence length from $L$ to $L’$. This approach is called “NTK-by-parts” interpolation.

Consider the $\cos(m\theta_i)$ term in the RoPE formula, where $m$ ranges from 0 to $L’-1$ for the new sequence length $L’$. In inv_freq_interpolation, a factor of $1/s = L/L’$ is multiplied to $\theta_i$ to create the interpolation effect. Using the original $\cos(m\theta_i)$ with a larger $m$ constitutes extrapolation.

In NTK-by-part, instead of $\cos(m\theta_i)$, you use $\cos(m\theta_i’)$ where $\theta_i’$ blends interpolation and extrapolation. The weight between interpolation and extrapolation follows the formula above, implemented as inv_freq in the code.

YaRN improves upon NTK-by-part by adding a scaling factor $\sqrt{1/t}$. This enhancement improves model performance, measured by lower perplexity (higher accuracy in next-token prediction) under longer context lengths.

Further Readings

Below are some papers that are related to the topic:

Summary

In this post, you learned how models trained with shorter context lengths can process longer input sequences. Specifically:

  • Sinusoidal encodings and RoPE can be easily extrapolated
  • Learned encodings only support interpolation
  • YaRN provides an advanced method for scaling RoPE to longer sequence lengths

The goal of scaling positional encodings is to enable models to handle longer input sequences without retraining. This is not an exhaustive list, as more advanced methods continue to build upon these foundational ideas.

Learn Transformers and Attention!

Building Transformer Models with Attention

Teach your deep learning model to read a sentence

…using transformer models with attention

Discover how in my new Ebook:

Building Transformer Models with Attention

It provides self-study tutorials with working code to guide you into building a fully-working transformer models that can

translate sentences from one language to another

Give magical power of understanding human language for
Your Projects

See What’s Inside


Leave a Comment