Pseudo-English—Typing Practice w/ Machine Learning
Articles in this series:
The finished project is located here: https://www.bayanbennett.com/projects/rnn-typing-practice
Objective
Generate English-looking words using a recurrent neural network.
Trivial Methods
Random Letters
const getRandom = (distribution) => {
const randomIndex = Math.floor(Math.random() * distribution.length);
return distribution[randomIndex];
}
const alphabet = "abcdefghijklmnopqrstuvwxyz";
const randomLetter = getRandom(alphabet);
Unsurprisingly, no resemblance to English words. The character sequences that were generated were painful to type. Here are a few examples of five letter words:
snyam iqunm nbspl onrmx wjavb nmlgj
arkpt ppqjn zgwce nhnxl rwpud uqhuq
yjwpt vlxaw uxibk rfkqa hepxb uvxaw
Weighted Random Letters
What if we generated sequences that had the same distribution of letters as English? I obtained the letter frequencies from Wikipedia and created a JSON file that mapped the alphabet to their corresponding relative frequency.
// letter-frequencies.json
{
"a": 0.08497, "b": 0.01492, "c": 0.02202, "d": 0.04253,
"e": 0.11162, "f": 0.02228, "g": 0.02015, "h": 0.06094,
"i": 0.07546, "j": 0.00153, "k": 0.01292, "l": 0.04025,
"m": 0.02406, "n": 0.06749, "o": 0.07507, "p": 0.01929,
"q": 0.00095, "r": 0.07587, "s": 0.06327, "t": 0.09356,
"u": 0.02758, "v": 0.00978, "w": 0.02560, "x": 0.00150,
"y": 0.01994, "z": 0.00077
}
The idea here is to create a large sequence of letters whose distribution closely matches frequencies above. Math.random
has a uniform distribution, so when we select random letters from the sequence, the probability for picking a letter matches its frequency.
const TARGET_DISTRIBUTION_LENGTH = 1e4; // 10,000
const letterFrequencyMap = require("./letter-frequencies.json");
const letterFrequencyEntries = Object.entries(letterFrequencyMap);
const reduceLetterDistribution = (result, [letter, frequency]) => {
const num = Math.round(TARGET_DISTRIBUTION_LENGTH * frequency);
const letters = letter.repeat(num);
return result.concat(letters);
};
const letterDistribution = letterFrequencyEntries
.reduce(reduceLetterDistribution, "");
const randomLetter = getRandom(letterDistribution);
The increase in the number of vowels was noticeable, but the generated sequences still fail to resemble an English word. Here are a few examples of five-letter words:
aoitv aertc cereb dettt rtrsl ararm
oftoi rurtd ehwra rnfdr rdden kidda
nieri eeond cntoe rirtp srnye enshk
Markov Chains
Recurrent Neural Networks
Neural networks are usually memoryless, where the system has no information from previous steps. RNNs are a type of neural network where the previous state of the network is an input to the current step.
- Input: A character
- Output: A tensor with the probabilities for the next character.
NNs are inherently bad at processing inputs of varying length, there are ways around this (like with positional encoding in transformers). With RNNs, the inputs are consistent in size, a single character. Natural language processing has a natural affinity for RNNs as languages are unidirectional (LTR or RTL) and the order of the characters are important. In other words, although the words united and untied only have two characters swapped, but they have opposite meanings (see: antigram).
The model below is based on the Tensorflow Text generation with an RNN tutorial.
Input Layer with Embedding
This was the first time I encountered the concept of an embedding layer. It was a fascinating concept and I was excited to start using it.
I wrote a short post summarizing embeddings here: https://bayanbennett.com/posts/embeddings-in-machine-learning
const generateEmbeddingLayer = (batchSize, outputDim) =>
tf.layers.embedding({
inputDim: vocabSize,
outputDim,
maskZero: true,
batchInputShape: [batchSize, null],
});
Gated Recurrent Unit (GRU)
I don't have enough knowledge to justify why a GRU was chosen, so I deferred to the implementation in the aforementioned Tensorflow tutorial.
const generateRnnLayer = (units) =>
tf.layers.gru({
units,
returnSequences: true,
recurrentInitializer: "glorotUniform",
activation: "softmax",
});
Putting it all together
Since we are sequentially feeding the output of one layer into the input of another layer, tf.Sequential
is the class of model that we should use.
const generateModel = (embeddingDim, rnnUnits, batchSize) => {
const layers = [
generateEmbeddingLayer(batchSize, embeddingDim),
generateRnnLayer(rnnUnits),
];
return tf.sequential({ layers });
};
Training Data
I used Princeton's WordNet 3.1 data set as a source for words.
Since I was only interested in the words, I parsed each file and extracted only the words. Words with spaces were split into separate words. Words that matched the following criteria were also removed:
- Words with diacritics
- Single character words
- Words with numbers
- Roman numerals
- Duplicate words
Dataset Generator
Both the tf.LayersModel
and tf.Sequential
both have the .fitDataset
method, which is a convenient way of—fitting a dataset. We need to create a tf.data.Dataset
, but first here are some helper functions:
// utils.js
const characters = Array.from("\0 abcdefghijklmnopqrstuvwxyz");
const mapCharToInt = Object.fromEntries(
characters.map((char, index) => [char, index])
);
const vocabSize = characters.length;
const int2Char = (int) => characters[int];
const char2Int = (char) => mapCharToInt[char];
// dataset.js
const wordsJson = require("./wordnet-3.1/word-set.json");
const wordsArray = Array.from(wordsJson);
// add 1 to max length to accommodate a single space that follows each word
const maxLength = wordsArray.reduce((max, s) => Math.max(max, s.length), 0) + 1;
const data = wordsArray.map((word) => {
const paddedWordInt = word
.concat(" ")
.padEnd(maxLength, "\0")
.split("")
.map(char2Int);
return { input: paddedWordInt, expected: paddedWordInt.slice(1).concat(0) };
});
function* dataGenerator() {
for (let { input, expected } of data) {
/* If I try to make the tensors inside `wordsArray.map`,
* I get an error on the second epoch of training */
yield { xs: tf.tensor1d(input), ys: tf.tensor1d(expected) };
}
}
module.exports.dataset = tf.data.generator(dataGenerator);
Note that we need all the inputs to be the same length, so we pad all words with null characters, which will be converted to integer 0 with the char2Int
function.
Generating and compiling the model
Here it is, the moment we've been building towards:
const BATCH_SIZE = 500;
const batchedData = dataset.shuffle(10 * BATCH_SIZE).batch(BATCH_SIZE, false);
const model = generateModel(vocabSize, vocabSize, BATCH_SIZE);
const optimizer = tf.train.rmsprop(1e-2);
model.compile({
optimizer,
loss: "sparseCategoricalCrossentropy",
metrics: tf.metrics.sparseCategoricalAccuracy,
});
model.fitDataset(batchedData, { epochs: 100 });
A batch size of 500 was selected as that was around what I could fit without running out of memory.
Examples
ineco uno kam whya qunaben qunobin
xexaela sadinon zaninab mecoomasph
anonyus lyatra fema inimo unenones
It's not perfect, but it produces words that vaguely appear to come from another Romance or Germanic language. The size of the model.json
and weights.bin
files are only 44 kB. This is important since simpler models generally run inference faster and are light enough for the end user to download without affecting perceived page performance.
The next step is where the fun begins, building a typing practice web app!