The 3 steps of NLP pre-processing

4 min readApr 19, 2022

Trying to understand the different methods, algorithms and tools used in NLP (natural language processing) can be quite overwhelming if you are just starting out.

Therefore, this article will serve as a brief introduction to the different steps required to prepare your data for NLP-specific tasks. We will go over the most common algorithms, when to use them and where to learn more about them.

Step 1: Cleaning

Regardless of what you want to do with your text data, the first thing to do is getting it tidied up! This will help with storage use and probably improves running-time as well as general performance of our model.

The basic methods include:
Lower-case: It’s self explanatory. We convert each word into lower-case to get a more unified representation.

Removing stop-words: This means removing words like “and”, which hold no real meaning and are pretty irrelevant for the context. Additionally, we most likely want to remove any punctuation.

Stemming: Transforming all words into their root by removing the endings. E.g. “coding” and “coded” become “code”.

Lemmatization: Similar to stemming the goal is to get the root form, but now we are using the grammatical root. E.g. “was” and “were” becomes “be”.

These are mostly things that you can implement yourself with a few lines of code. Of course, many machine learning libraries such as Sklearn already include functions to perform text cleaning. Furthermore, the Python “re” library can be used to remove regular expressions.

https://giphy.com/gifs/life-interesting-footage-ZVik7pBtu9dNS

Step 2: Tokenization

After we’ve cleaned our text data, it’s ready for tokenization! This means splitting the text into smaller parts (e.g. sentences into words).

The simplest form of tokenization consists in going over our data and creating a token of each word. Afterwards, we will end up with an array containing all words of our cleaned text.
It’s important to mention that the way we tokenize can have a significant impact on the performance of our model.
There are other methods such as n-gram tokenizers, were we split into sets of n words. This means that a bi-gram (n=2) would produce 2-word-tokens like “have fun”.

If you simply want to tokenize each word, the Python split() function does exactly that! Libraries such as NLTK or Sklearn also include tokenizers, which often even clean the text for you.

Step 3: Vectorization

Now it get’s interesting! In order for the computer to work with text data, we need to convert it into a numeric format. To do this, we use vectorization.

The most popular and simple methods include:
Bag of words (BoW): Here we count the times each unique word (token) appears in our data. This way we can get a vector where each “row” represents a word and the value represents the count.

Term frequency-inverse document frequency(TF-DF): TF-IDF is a metric to measure the importance of each word in a collection of texts (corpus). For the TF part, we measure how frequent each term appears in a document. (just like in BoW)

The IDF value consists of the total number of documents divided by the number of documents in which a certain term appears.
Combined, TF-IDF means multiplying the TF and the IDF values. The higher the result, the more important is that particular word.

But both of these methods do not capture the relationship between words, since they only look at the amount in which words appear in relationship to the entire text.
Therefore, more advanced methods like Word2Vec, CBoW and Skipgram could be interesting. (We won’t be covering them today)

The 3 steps of NLP pre-processing

Step 1: Cleaning

Step 2: Tokenization

Step 3: Vectorization

Further Reading:

Written by Caspar Pagel