Dumb Cipher Solver

Kublai: sevglzioelkehcnwezshieoinkosy.ezwslegyoegaaese.kzevglezwsleakclxeliyziyniq

Uh, are you OK, Kublai?

Kublai: Hey, I’m confused too! Someone sent me a “monoalphabetic substitution cipher” and I don’t even know what that is!

Oh, a monoalphabetic substitution cipher is much simpler than it sounds. For each letter of the alphabet, you designate some replacement letter, and then make these replacements to each letter in the original message to produce the ciphertext. It’s like ROT13 or the Caeser cipher but the alphabet can be scrambled with an arbitrary permutation. Since your ciphertext includes periods but no spaces, my guess is that instead of just scrambling the 26 letters of the alphabet, we also include periods and spaces for a total of 28 “letters”.

Kublai: Yeah yeah whatever, but, how do we crack the cipher? Wait, I know! Let’s use a ✨generative AI language model✨ written in 🚀blazingly fast fearless concurrency Rust🚀!

Dang, I think you might be on to something. Let’s do it!

✨Language models✨

Our task is to find the permutation of the alphabet that when applied to the ciphertext, yields a string that looks the most “Englishy”, which is probably the original message. That is, we want a string that has the highest probability of being valid English text.

Kublai: Let’s use an ✨AI language model✨!

Exactly! A ✨language model✨ is just a model that can estimate the likelihood of a string, which is exactly what we need. For instance, gzip is a ✨language model✨, where its estimate of the likelihood of a string $s$ is proportional to $2^{-|gzip(s)|}$. We can even train gzip and use it to generate text! (When trained on my blog, gzip generated “you jucrypaost’scord”. Zstd is faster but even less coherent.)

Kublai: Wow, so gzip is ✨AI✨!!

For performance reasons, we’ll forgo ✨transformers✨ and instead use a simpler image recognition architecture for our ✨language model✨.

Kublai: Huh, image recognition? Strings are images now?

Yes! We can embed each letter into a vector in an $N$-dimensional latent space, which turns our string into a matrix with height $N$. Now it’s an image! There’s another parallel: Imagine instead that we’re estimating the likelihood that an image contains a camel. We don’t care if the camel is in the bottom left corner or center of the image, so we’d like our model to be translation-invariant. Likewise, if we’re estimating the likelihood that a string is valid English text, we don’t really care if the word “Kublai” shows up at the beginning or middle of the string.

Kublai: Cool, so now we can estimate the likelihood of a string. But how can we find a permutation that produces the string with the highest likelihood? Oh, let’s use ✨generative AI✨!

Good idea. Instead of outputting just a single float, we can make our ✨language model✨ predict what each character should be based on its surrounding context. Then, we can identify which characters differ from their predictions and use that information to improve our alphabet permutation. To train this model, all we need is a lot of English text. Note that during training, we have to mask the current character we’re trying to predict, or else the model can just look at the answer and cheat.

Here’s the architecture:

Embedding layer to 12 dimensions
12x33 convolution with 512 channels with the center stripe masked
ReLU
512x512 FC layer
ReLU
512x512 FC layer
ReLU
512x28 FC layer
Softmax

The architecture visualized using Netron

I implemented it in 🚀blazingly fast Rust🚀 using tch-rs and trained it on a GPU for a few hours using the WikiText-103 dataset. The code is really trashy so I’ll defer showing it until the end of this post.

Now let’s crack some ciphers! Except… this method doesn’t work. The ✨language model✨ only works well when fed valid or nearly-valid English text. Starting with a random permutation and hence a gibberish string, the ✨language model✨ gets confused and outputs garbage as well. We need some other way to find a nearly-correct permutation, and then use the ✨language model✨ to fix the remaining the typos.

Dumb brute force

Kublai: I know, let’s use more ✨language models✨!

Here’s a very very simple ✨language model✨:

const LETTER_FREQS: [f32; VOCAB] = [
    -2.7440639503257156, -4.3881183352746, -3.9830585967599133, -3.338597298934179, -2.2996092900998364, -4.0120112763964775, -4.130611311302383, -2.9885467899954916, -2.8851769079051217, -6.889575331006274, -5.15651085790723, -3.461925426670114, -3.779473184027688, -2.8729751718671457, -2.77598577757311, -4.357324003441537, -7.006877096099109, -3.0310144905461676, -3.001198702038712, -2.6516498426171307, -3.7717428198441567, -4.808103331776061, -3.933129954510123, -6.7211616715489635, -4.01909939164524, -8.050873628703975, -1.676946064410797, -4.684835363922925
];

let mut lp = 0.;
for c in s {
    lp += LETTER_FREQS[*c as usize];
}

Kublai: I see a tensor of floats, so it must be ✨AI✨!

Nah, it’s just an array of the log frequency of every English letter. Then to predict the probability of a string, we multiply together the probabilities of each letter. Easy as that.

We can improve this silly model by using bigram (pairs of consecutive letters) frequencies or trigram frequencies. Now, we can use a very effective and sophisticated algorithm known as “doing super dumb brute force a gazillion times”.

Basically, we’ll start with a permutation based on letter frequencies. Then we’ll just randomly try swapping two letters in the permutation and see if that makes the text more likely based on bigram or trigram frequencies. Actually sometimes we still want to do the swap even if it makes us worse off to get out of local optima, so there’s something called Metropolis-Hastings that makes this mathematically precise. Then repeat that 10000 times and check if the final permutation decodes the text well. Then, repeat this entire algorithm 50000 times using multithreading and pick the best answer. Thanks to 🚀blazing fast fearless concurrency Rust🚀, it only takes a few seconds to run the algorithm an absurd number of times.

This dumb brute force algorithm works surprisingly well, and we can fix any remaining typos using the fancy ✨language model✨ from earlier. The last piece of the puzzle is that my code has a lot of tuneable hyperparameters, so I tried various values and picked what seemed to work best. Additionally, my code is designed to run quickly on a CPU, so there might be room for improvement by using a GPU for inference in addition to training. For more details, consult my garbage-quality code.

Kublai: Cool! We need to turn this ✨generative AI language model✨ written in 🚀blazingly-fast Rust🚀 into a startup now! I bet we can scam billions from clueless investors!