Everything you need to start with Preprocessing for NLP
NLP stands for Natural Language Processing. As the name suggests, it means converting human language(natural language) into something a machine can understand and perform actions.
Like summarization, chatbot, translation, search, etc.
Like how we clean data in a ‘pandas dataframe’ before feeding it to a model, raw data here also has be to preprocessed.
Take a look at this example -
I am a natural swimmer. I love swimming. Who doesn’t like to swim?
It’s easy to understand that it is about swimming. If someone you know can’t comprehend it, you can translate it into a language of their comfort. But it is the same for a machine?
‘swimmer’, ‘swimming’, ‘swim’
The essence of those 3 words is the same, but different parts of speech. For a machine, convert each alphabet to its ASCII and in binary — all are different. Compare ‘art with artist’ and ‘swim with artist’. You get the difference, a machine would not. That is why we preprocess. Prep the raw data before using it for the actual process.
Before you begin to preprocess, understand your need. That will help you with which of the following steps should be — skipped, reduced, or increased. How relevant are they to your need?
- “Great” and “great”, are they supposed to be the same or be different? Case Folding — reduction of a word to lowercase.
- Numbers. Convert 1 to one, one to 1, or remove them.
- Punctuation. How relevant and how many do you want to keep?
- Stopwords. Words that are found frequently and don’t hold weight like a, an, the, and, at, etc. Remove them or customize your list with the ones you need.
- URLs, hashtags, handles, mentions, etc. Remove or keep as needed.
- ‘swim’, ‘swimming’, ‘swimmer’. Want to update each word as their root? Lemmatization. If you want to strip the suffixes from the words? Then stemming.
I have been asking you to analyze parts of the string like words. That’s — word tokenization.
- Do you need sentences for further steps? Then you need sentence tokenization.
Have a list of all that you need to do? Refer to NLTK as it contains packages to help with this.
Thanks for reading! Have a nice day!