Bayan Bennett

Embeddings in Machine Learning

I first came across the concept of embeddings while developing the RNN typing practice app.

Even though I am just beginning to understand the range of uses for embeddings, I thought it would be useful to write down some of the basics.

First, let's look at what I knew before embeddings, one-hot vectors.

Refresher on one-hot vectors

Remember one-hot vectors? No? Well do you remember unit vectors from math class? Still no? Okayβ€”assume that we have three labels, [🍎🍊🍌]\begin{bmatrix} 🍎&🍊&🍌\end{bmatrix}. We want to represent these values in a way that machines can understand. Initially, we might be tempted to assign the values [123]\begin{bmatrix} 1&2&3\end{bmatrix}, but the issue here is we don't necessarily want a 🍌 to equal three 🍎.

We could instead assign vectors to each label, where the dimension of each vector is equal to the number of labels. In this case we have three labels, so three dimensions.

🍎🍊🍌[100][010][001]\begin{matrix} 🍎&🍊&🍌\\ \begin{bmatrix} 1\\0\\0 \end{bmatrix} & \begin{bmatrix} 0\\1\\0 \end{bmatrix} & \begin{bmatrix} 0\\0\\1 \end{bmatrix} \end{matrix}

Embeddings

Think about a company that has a million books in that company's catalogue, which they would like to use as input data. It is not practical to create a system that needs a million-dimension vector to represent each book. Unless they are specifically looking for something unique to each book, there must be some aspect that some books will share. How can we manage that?

With one-hot vectors, each vector is unique, and the dot product of any two vectors is always 0. However, what if this is not desired and we want there to be some similarity between vectors (a non-zero dot product)? Going back to our example above, what if in our application we are looking at the shapes of fruit? It is logical that there would be some overlap of values.

🍎🍊🍌πŸ₯­[0.60.30.1][0.30.60.1][0.20.10.7][0.30.40.3]\begin{matrix} 🍎&🍊&🍌&πŸ₯­\\ \begin{bmatrix} 0.6\\0.3\\0.1 \end{bmatrix} & \begin{bmatrix} 0.3\\0.6\\0.1 \end{bmatrix} & \begin{bmatrix} 0.2\\0.1\\0.7 \end{bmatrix} & \begin{bmatrix} 0.3\\0.4\\0.3 \end{bmatrix} \end{matrix}

An 🍎 and an 🍊 share a similar shape, so the values are similar. A 🍌 is quite a unique shape, but slightly more like an 🍎 than an 🍊. I also snuck a πŸ₯­ in there. A πŸ₯­ is a bit more oblong, but has a round top, so the numbers are a mix. One of the main benefits of using embeddings is that we can have the number of labels exceed the dimension of the embedding. In this case, we have 4 labels, but a dimension of 3.

However, these numbers are just me, a human, squinting my eyes and pressing numbers on my keypad. These numbers are trainable and we can let our model adjust these values during training. If our data was the ratio of length to the width of the fruit, after training, the values may vaguely look like what I put above.

Embedding dimensions

Here's where the alchemy begins, the embedding dimension hyperparameter. Google's Machine Learning Crash Course on Embeddings mentions the following as a good starting point.

dimensionsβ‰ˆpossibleΒ values4dimensions \approx \sqrt[4]{possible\ values}

For the RNN typing practice app, there are 28 values (a-z, space, and a null character for masking). If I followed the above formula, would have resulted in a dimension of 2 or 3. After training, I didn't get acceptable results. I thought some more about the problem and I theorized that setting the dimension to 28 might work. This way there would be a high enough dimension to have each character compared to all characters. I ran tests with dimensions of [2,28,282β‡’784]\begin{bmatrix}2,&28,&28^2 \Rightarrow 784\end{bmatrix}, and 28 performed the best. Is there a better value? Possibly, but I was happy with the results.

TL;DR.

Embeddings are a basic method to encode label information into a vector. This information can begin as the representation of the label and can be trained so that similar labels have similar vectors. The possibility of representing many labels with far fewer dimensions is one of the main benefits. Additional information can also be encoded, such as positional information, which is used prominently in transformers.

Β© 2022 Bayan Bennett