Learn N-Grams Interactively!

The Problem of Encoding

Let’s face it, if you go deep into it, a computer is really dumb. A computer’s software is highly dependent on its hardware. An 8-bit microprocessor like CMOS 6502 can be used to create a computer like Apple II or it can be used to create a small gamebord whose only means of output is a small LED screen. From mid-70s to late 90s, most programmers coded in the Assembly language of the microprocessor they were coding it. And when you code in Assembly what becomes clear to you is,

The only thing that a computer understands are digits!

So, when we feed a text to a computer and tell it ‘Classify this with this model!’, if you just give it the text, as encoded inUTF-8 as a series of 8-bit octets, it will tell you ‘W0t?’

That’s true. Most modern text is encoded in UTF-8, Unicode’s official codec for characters. UTF-8 uses an octet of 8 bits, which the peasantry call a byte (but that’s actually kinda not true because a byte can be 6 or 12 bits, depending on the architecture of your computer. A byte only came to be defined as an octet because IBM PCs used Intel which defines a byte as a little-endian octet) to encode all the possible characters in the Unicode space. And oh, baby, Unicode’s LARGE!

So you can’t just say, ok, let’s do label encoding, let’s say the letter a is 0 and the emoji 🎉 is 1229 — why? Because just god knows how many characters are in a large body of text, plus, you will have to account for characters which will be inputted by the end user. Basically you have to accoutn for a gazillion labels!

So what do we do? We, of course, could use one-hot-encoding instead of lable encoding but then we will be dealing with an array the size of the unicode space instead of just an inteeger for each characte, or lik eea good boy, we code take the text into a latent space.

But what is this Latent Space?

Do you know what a vector is? Well I bet you do. But let me explain. A vecor is an entity that has a size and a direction in the Carthesian coordinate system. Say par example this:

image/svg+xml (2,3)

What you see drawn above is this vector:

\[\vec{v} = \begin{bmatrix} 2 \\ 3 \end{bmatrix}\]

Ok, buddy, so what?

Now imagine this vector includes attributes instead of coordinates. These attributes can be anything we wish them to be. But what would make these vectors represent a so-called latent space is them representing a text encoded in such a way that it’s considerably smaller than the actual text. Imagine this piece of Python code

my_text = "This is my text! I love it!"
print(hash(my_text))

That would print a series of numbers, a hash, in its own unique way, is a latent space. A non-reversible latent space however. A vector is a reversible latent space. You can give a large text, maybe the text to Count of Monte Cristo or the Prayers of the Great Shield, to this encoder, and it will input a small representation of the text. You can always reverse it back. But you can also do stuff with it! A lot of stuff! Know what I’m saying?

This latent space is often called an embedding — and, no, my dear Crypto Bro, the latent space is stil too large to be used in a smart contract!

Embeddings, from TF-IDF, to Bag of Words, to BERT, to…

One simple embeddings vector is TF-IDF. It’s so simple, Scikit-Learn can do it. TF-IDF stands for Terms Frequency Times Inverse Document Frequency. Term frequencies are the coutns of each word in a document, and the inverse meas you’ll divide each of these word counts by the number of documents they occur in. It’s widely used in shallow ML. A vector that is considered a classical “Bag of Words” is vector of word counts of frequencies. It’s the simplest form of latent space. On the deep side of spectrum we have embeddings that are ACTUAL latent spaces, such as BERT encodings. BERT is a neural network developed by Google that takes your text into a latent space represented by of so many attributes, not just frequeny. BERT has an ‘attention’ layer so it pays attention to the relevancy of the word, and as a result, the redundancy. Like it knows that in this sentence:

I scream, you scream, we all scream for ice cream!

There are some redundancies so it factors them in. But how does it do that? How does it know that you scream and I scream and we all scream are all single units made up of smaller units?

Bag of n-grams

So, now that we know what a latent space is well and good — let us talk about them grams!

What is an n-grams?

An n-gram is basically a of 1-ples, 2-ples, 3-ples, or more words that represent a single meaningful unit in computational linguistics. n-grams are computational semantics, not grammar, despite what their name might suggest.

Note: n-grams we consider in this piece are of words, but they are often used for characters as well.

n-grams come into play when we are tokenizing a sentence. We can tokenize a sentence on 1-grams, 2-grams, 3-grams or more. But at some point n-grams become so niche that it doesn’t make computational sense to do a number further than that.

Tokenization is basically splitting of a sentence into meaningful units

So, let’s say we want to split the famous sentence “I scream, you scream, we all scream for ice cream” into 1-grams. That would simple be:

I Scream You Scream We All Scream For Ice Cream

Now let’s get all the 2-, 3-, 4- and 5-grams for a more complex text, shown below:

I love Chicago deep dish pizza!
New York style pizza is also good.
San Francisco pizza, it can be very good

After removing the stropwords and punctions and using this code:

from nltk.util import ngrams
import re
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')

sw_set = list(set(stopwords.words('english')))

sent = '''
I love Chicago deep dish pizza!
New York style pizza is also good.
San Francisco pizza, it can be very good
'''

sent = sent.lower()

sent_rem = re.sub(r"(?!\s)\W", "", sent)
print("Sentence without puncutation is: ", sent_rem)

sent_wo_sw = ' '.join([w for w in sent_rem.split(" ") if w.lower() not in sw_set])

print("W/O stopwords: ", sent_wo_sw)

sent_split = set(re.split(r"\s+", sent_wo_sw))


print("2-grams are: ")
print('\n'.join(list([', '.join(ng) for ng in tuple(ngrams(sent_split, 2))])))
print("\n\n")
print("3-grams are: ")
print('\n'.join(list([', '.join(ng) for ng in tuple(ngrams(sent_split, 3))])))
print("\n\n")
print("4-grams are: ")
print('\n'.join(list([', '.join(ng) for ng in tuple(ngrams(sent_split, 4))])))
print("\n\n")
print("5-grams are: ")
print('\n'.join(list([', '.join(ng) for ng in tuple(ngrams(sent_split, 5))])))

[nltk_data] Downloading package stopwords to /home/chubak/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Sentence without puncutation is:  
i love chicago deep dish pizza
new york style pizza is also good
san francisco pizza it can be very good

W/O stopwords:  
i love chicago deep dish pizza
new york style pizza also good
san francisco pizza good

2-grams are: 
, francisco
francisco, san
san, new
new, york
york, pizza
pizza, style
style, dish
dish, also
also, deep
deep, love
love, good
good, chicago
chicago, i



3-grams are: 
, francisco, san
francisco, san, new
san, new, york
new, york, pizza
york, pizza, style
pizza, style, dish
style, dish, also
dish, also, deep
also, deep, love
deep, love, good
love, good, chicago
good, chicago, i



4-grams are: 
, francisco, san, new
francisco, san, new, york
san, new, york, pizza
new, york, pizza, style
york, pizza, style, dish
pizza, style, dish, also
style, dish, also, deep
dish, also, deep, love
also, deep, love, good
deep, love, good, chicago
love, good, chicago, i



5-grams are: 
, francisco, san, new, york
francisco, san, new, york, pizza
san, new, york, pizza, style
new, york, pizza, style, dish
york, pizza, style, dish, also
pizza, style, dish, also, deep
style, dish, also, deep, love
dish, also, deep, love, good
also, deep, love, good, chicago
deep, love, good, chicago, i

As we get furthr, this formula starts shaping in our minds:

\[N_{n-grams} = \begin{pmatrix} \vert S\vert \\ n \end{pmatrix}\]

Where \(\vert S \vert\) is the size of the sentence expresed in words. Basically, number of n-grams is the combination of the number of words and the n itself.

Use the controls below to generate n-grams for any sentence you want. This will give you a much more clear idea of what they are.

Uses of n-grams

So what are n-grams good for?

One use we can think of is, filtering the n-grams that are too frequent in the latent space. Take the n-gram you, scream in the second example as an instance of what a reapeatative ngram looks like.

The example we have is somehow wrong because when tokenizing based on n-grams all stopwords such for, to, on etc must be removed. Because they are simply too frequent.

But there are other uses of n-grams which we will discuss now.

Using n-grams in Conjunction with Stats

Using n-grams to Fill in the Holes

If we, for example, have a bigram (that is to say, a 2-gram) we can approximate the probability of a word given all the previous words only by using the conditional probability of one preceding word, In the bigram of {hello, world} the conditional probablity of word can be estimated using the following formula:

\[P(wotld|hello) \simeq P(w_{world}|w_{1}^{hello}) \simeq P(w_{world}|w_{hello})\]

This is basically what’s called a Markov assumption — since this bigram makes a Markov chain model. In fact we can generalize this to trigrams where it would be \(P(n|\left\{n - 2, n - 1\right\})\).

pizzastyledish pizza style <text xml:spapodi:role="line" id="tspan33374" style="stroke-width:0.264583" x="105.83333" y="42.333332">p(pizza|style)</tspan></text> p(pizza|dish)

A Markov Chain can be described as a sequence of random ‘states’ where each new state is conditionally reliant on the previous state. So let’s say…

A Room Full of Monkeys and Typewriters: Generating Text with n-grams

There’s an age old question, can a room full of monkeys with type writers, vicariously and flippantly pressing the keys, collectively come up with a Shakespeare play?

Actually, we can come up with an answer using basic n-grams. Not n-grams of words, but n-grams of characters.

Let’s say we want these subordinate brute beasts to at least come up with one sentence:

to_be_or_not_to_be

Let’s say we wish to calculate the probabiliteos of characters in tihs sentence appearing after the letter O.

\[\begin{cases} o & \underset{25\%}{\rightarrow} & t \\ o & \underset{25\%}{\rightarrow} & r \\ o & \underset{50\%}{\rightarrow} & \_ \end{cases}\]

And we also have these:

\[\begin{cases} t & \underset{67\%}{\rightarrow} & o \\ t & \underset{33\%}{\rightarrow} & \_ \end{cases}\]

We can express this at:

't' -> ['o', '_', 'o']
'o' -> ['_', 'r', 't', '_']

Now we wish to generate a new text with these grams. We have to cold start it off wih a seed, that a is a node letter. Then we keep adding and adding iteratively until we a node that has nothing after it.

Press the button below, select an initial click “Generate”.

[Result]

Now we can do this, instead of bigrams of characters, let’s o bigrams of phonemes. The basis is the same. Select an initial phoneme and see the code do its magic.

[Result]

Using n-grams and Stats to tell the Future!

So, as it turns out, if you have data in forms of n-grams, you can associatively predict the next part of the n-gram.

So let’s say, if you have all the 2-, 3-, 4- and 5- grams of a user’s keypresses between two keys on a keyword, you can predict in succession what key they are going to press next!

And this is exactly what this little project does.

So it’s not just NLP where n-grams are useful in but also porbabilities. In fact we could say successive failures in getting head (!) in a coin toss game is by itself a practie in n-grams.

Using n-grams to Fix and Highlight Errors

It is estimated that there are over 760 million EFL (English as Foreign Language) and 350 ESL (English as Second Language) speakers around the globe. With this volume of people who were not exposed to English from dat one, and people who associate English grammar and semantics with that of their native tongue, then there is definitely going to need for grammar-correcton software.

But how would these software achieve detecting of wrong grammar?

One way, of course, is n-grams. But n-grams alone cannot be used to detect errors. We need tagging alongside it, and then we need to annotate various related corpora with these tags, train a shallow or a deep classifier with bigrams or trigrams of of words.

Imagine these sentences:

1. I accuse you of make a bad decision!
2. I accuse you of making a bad decision!

Use the n-gram gnerator above to get the trigrams. If you do so, you’ll realize the only difference between these two sentences is one simple n-gram.

We just need to do Maximum Likelihood (basially Bayesian stats) Calculation with the tagged n-grams to catch the error. The likelihood of a trigram containing ‘bad’ appearing in this sentence is less than a set threshold (most often 0.5) so the label would be 0, and this sentence would be wrong.

Making Sense of Nonesense Using n-grams

Every teenager who’s used MCs t generate text knows this, they most often generate total nonesense. The idea of generating ‘sensible’ (not sensual!) text using n-grams is, that the last word (\(x^{n}\)) of the n-gram can be inferred from th other words that appear in it. So in the context of a bigram, let’s say, jimbo soup the context of jimbo can be inferred from soup. Remember this from 2 minutes ago?

\[P(x^{(t + 1)} | x^t, \dots, x^{(1)})\]

And remember Baye’s rule which we incited half a minute ago? (or less than 5 second, depending on whether you’re actually reading this article).

\[P(x^{t + 1} | x^t, \dots, x^{t - n + 2}) = \frac{P(x^{t + 1}, x^t, \dots, x^{t - n + 2}))}{P(x^t, \dots, x^{t - n + 2})}\]

So imagine we have the following strings of words:

- deep dish samurai septugenarian carwash next
- The septugenarian Samurai ordered deep dish pizza next to the carwash.

Now imagine we have scoured thousands of text, and we know this to be true:

P(dish | deep, piza) = 0.8

So we apply this to the first text, and nope! But the first text, yay!

And that’s how we make sense of a text using n-grams.

About the Author

The author of this article is Chubak Bidpaa. When he’s not doing [insert inane bullshit] he’s doing [insert another inane bullshit]. The Markdown for this article is self-contained and can be downloaded from here. The HTML is also self-contained and can be downloaded from here. Chubak Bidpaa can be contacted on Discord at Chubak#7400. He can be reached via email at [email protected]. Make sure to tag your email as [Gentooman] or [Gentoomen].