Daniel R. Kim, MD Posts About

L2 Norms and Memory Efficiency

18 Dec 2016

Suppose you want to find the pairwise $\text{L2}$ distances between a set $X$ of $n$ vectors and a set $Y$ of $m$ vectors in $\mathbb{R}^k$. (Example: 5,000 test images against 50,000 training images, with each image stored as a 3,072-length vector.) Considering that

\[\|x-y\|^2 = \sum (x_i - y_i)^2,\]

a naive approach might be to store pairwise difference vectors $ \{ x - y \} _ {x \in X, y \in Y}$ in a $n \times m \times k$ matrix, with the aim of squaring the components of each vector and summing them up. This approach uses (on the order of) $k$ times too much memory. Let’s see why.

Notice that if we expand the distance formula, we have

\[\|x-y\|^2 = \sum (x_i - y_i)^2 = x \cdot x - 2 x \cdot y + y \cdot y.\]

The key insight from the last expression is that $D(x,y)$ needs only three scalar pieces of information: dot products of $x$ with itself, $y$ with itself, and $x$ with $y$. In particular, it doesn’t care what the components of $x-y$ are.

If we were tasked with finding the distance between just two vectors, it doesn’t cost much to store the components of one difference vector of size $k$. But if we have to find pairwise distances between two large sets of vectors, it costs $n \times m \times k$ to store all these difference vectors.

Instead, we just need to store the following:

In practice, assuming we have matrices $X _ {n \times k}$ and $Y _ {m \times k}$, this information corresponds with (the diagonals of) $XX^T$ and $YY^T$, as well as $XY^T$.