Basics of Natural Language Processing with NLTK

A key element of Artificial Intelligence, Natural Language Processing is the manipulation of textual data through a machine in order to “understand” it, that is to say, analyze it to obtain insights and/or generate new text. In Python, this is most commonly done with NLTK.

Natural Language Processing

The basic operations of Natural Language Processing – NLP – aim at reaching some understanding of a text by a machine. This generally means obtaining insights from written textual data, which can be spoken language transcribed into text, or to generate new text. Text processing is conducted by converting written text into numbers so that a machine can then apply some operations that lead to various results.

With the Python programming language, this is most commonly done through the Natural Language Toolkit library, NLTK. These most fundamental operations can easily be implemented with NLTK with the functions/code detailed hereafter to obtain powerful results. They can obviously also serve as first steps before a more complex algorithm, for machine learning, sentiment analysis, text classification, or text generation, etc.

Remark that this quick summary makes extensive use NLTK tutorial playlist of Sentdex (Harrison Kinsley) on Youtube and the corresponding code on his website, pythonprogramming.net. It also relies on other codes, tutorials and resources found around the web and collected here (with links to the original source) as well as the deeper and more complete presentation of NLTK that can be found in the official NLTK book.

Basic NLP vocabulary

To make sure we understand what we are dealing with here, and more generally in NLP literature and code, a basic understanding of the following vocabulary will be required.

Corpus (plural: Corpora): a body of text, generally containing a lot of texts of the same type.

Example: movie reviews, US President “State of the Union” speeches, Bible text, medical journals, all English books, etc.

Lexicon: the words considered and their meanings.

Example: For the entire English language, that would be an English dictionary. It can also be more specific, depending on the context of the NLP at hand, such as the business vocabulary, investor vocabulary, investor “bull” vocabulary, etc. In that case “bull” (bull = positive about the market) would have a different meaning that in the general English (bull = male animal).

Lemma: in lexicography, a lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.

Example: In English, run, runs, ran and running are different forms of run, the lemma by which they are indexed.

Stop words: stop words are words that are meaningless words for data analysis. They are often eluded during data pre-processing before any data analysis can be implemented.

Examples of English stopwords: a, the, and, of…

Part of speech: the part of speech is the function of a word in a sentence: noun, verb, adjective, pronoun, etc.

N-gram: in computational linguistics, an n-gram is a contiguous sequence of n items: phonemes, syllables, letters, words or base pairs, from a given sample of text or speech.

Examples from the Google n-gram corpus:

3-grams:

ceramics collectables collectibles
ceramics collectables fine
ceramics collected by

4-grams:

serve as the incoming
serve as the incubator
serve as the independent

Using NLTK

To load the NLTK library in Python make sure to use the following import command:

import nltk

and the following command which will download all corpora (text corpus) that can then be used to train NLP machine learning algorithm. The following operation only needs to be done once.

nltk.download()

In the downloading box opening, choose all and press download

Remark that NLTK is developed for the English language. For support in other languages, you will need to find related corpora and use them to train your NLTK algorithm. However, many resources are available in most common languages, and NLTK stopwords corpora are also included in multiple languages.

Tokenization

Since paragraphs often present ideas in a number of sentences, it can be useful to keep track of sentence groups in paragraphs and analyze these sentences and the words they contain. A simple and yet powerful operation that can then be conducted is tokenization.

In natural language processing, tokenization is the operation of separating a string of text into words and/or sentences. The resulting list of words/sentences can then be further analyzed, notably to measure word frequency, which can be a first step to understanding what a text is about.

In Python, tokenization is done with the following code with NLTK:

from nltk import sent_tokenize, word_tokenize

example_text = 'This is an example text about Mr. Naturalangproc. With a second sentence that's a question?'

sent_tokenize(example_text)

word_tokenize(example_text)

Remark that NLTK is programmed to recognize “Mr.” not as the end of a sentence but as a separate word in itself. It also considers that punctuation marks are words in themselves.

For more information on Tokenization, check this resource from Stanford University, and the detailed presentation of the nltk.tokenize package.

Removing stop words

Stop words tend to be of little use for textual data analysis. Therefore in the pre-processing phase, you may need to remove them from your words in certain cases.

The set of stop words is already defined in NLTK, and it is therefore very easy to use it for data preparation with Python.

from nltk import stopwords

stop_words = set(stopwords.words("english")

Just change the “english” parameter to another language to get the list of stopwords in that language.

And here is an example from PythonProgramming of how to use stopwords: removing stopwords from a tokenized sentence.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

# The previous one-liner is equivalent to:
# filtered_sentence = []
# for w in word_tokens:
#     if w not in stop_words:
#         filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

Stemming and lemmatization

Stemming refers to the operation of reverting to a word’s root. Plurals, conjugated verbs or words that correspond to a specific part-of-speech (or function in the sentence), such as adverbs, superlatives, etc. are composed of a stem, which conveys the meaning of the word, with additional affixes providing an indication of function in the sentence.

The same word can thus be present in a corpus under different forms. Examples could be rapid/rapidly, eat/eaten/eating, etc. So in order to reduce datasets and increase relevance, we may want to revert to the words stem so as to better analyze the vocabulary in a text. That is to say, stemming the words, which removes the different affixes (-ly, -ing, -s, etc.).

in NLTK, the stemmers available are Porter and Snowball, with the Snowball (or PorterStemmer 2) being generally favored. Here is how to use it:

from nltk.stem import *

stemmer = PorterStemmer()

from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")

With the results showing why the Snowball stemmer is better:

>>> print(SnowballStemmer("english").stem("generously"))
generous
>>> print(SnowballStemmer("porter").stem("generously"))
gener

More info on stemmers with NLTK, and how to use them, is available here.

Lemmatization is very similar to stemming, except the root that a word is reverted is its “lemma”, a valid word (with a meaning), not just the cut stem of a word, in decomposed, singular form. It is therefore generally preferred to stemming in most cases as the results of lemmatization are more natural.

Here is how to code with NLTK:

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

Lemmatization will always return a valid word in its singular form.

>> print(lemmatizer.lemmatize("cactus"))
cactus
>> print(lemmatizer.lemmatize("geese"))
goose

However, note that the function lemmatizer.lemmatize() default parameter is a noun (n). To lemmatize to other parts of speech, you need to define the parameter in the function as pos="a":

>> print(lemmatizer.lemmatize("better"))
better
>> print(lemmatizer.lemmatize("better", pos="a"))
good
>> print(lemmatizer.lemmatize("best", pos="a"))
best

Parts of speech tagging

Part of Speech (PoS) tagging is labeling the part of speech of every single word of a text, identifying if words are nouns, verbs, adjectives, pronouns, etc. It is used to go deeper into the comprehension of a body of text, allowing the analysis of each word.

The code below, from PythonProgramming, detail how to use part of speech tagging with NTLK, creating a list of words with their part of speech function. It requires to use some text in order to train the unsupervised machine learning tokenizer PunktSentenceTokenizer.

You can notably use the NLTK corpus which was downloaded above. Check the different NLTK corpora for more text available to train and use PoS tagging.

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized[:5]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)

    except Exception as e:
        print(str(e))


process_content()

And here is also the list of all different parts of speech available for the English language.

POS tag list:

CC	coordinating conjunction
CD	cardinal digit
DT	determiner
EX	existential there (like: "there is" ... think of it like "there exists")
FW	foreign word
IN	preposition/subordinating conjunction
JJ	adjective	'big'
JJR	adjective, comparative	'bigger'
JJS	adjective, superlative	'biggest'
LS	list marker	1)
MD	modal	could, will
NN	noun, singular 'desk'
NNS	noun plural	'desks'
NNP	proper noun, singular	'Harrison'
NNPS	proper noun, plural	'Americans'
PDT	predeterminer	'all the kids'
POS	possessive ending	parent\'s
PRP	personal pronoun	I, he, she
PRP$	possessive pronoun	my, his, hers
RB	adverb	very, silently,
RBR	adverb, comparative	better
RBS	adverb, superlative	best
RP	particle	give up
TO	to	go 'to' the store.
UH	interjection	errrrrrrrm
VB	verb, base form	take
VBD	verb, past tense	took
VBG	verb, gerund/present participle	taking
VBN	verb, past participle	taken
VBP	verb, sing. present, non-3d	take
VBZ	verb, 3rd person sing. present	takes
WDT	wh-determiner	which
WP	wh-pronoun	who, what
WP$	possessive wh-pronoun	whose
WRB	wh-abverb	where, when

Chunking and chinking

Chunking is the process of extracting parts, or “chunks”, from a body of text with regular expression. It is used to extract lists of groups of words, that would otherwise be split by standard tokenizers. This is especially useful to extract particular descriptive expressions from a text, such as “noun + adjective” or “noun + verb” and more complex patterns.

Chunking relies on the key following regular expression. More details on regular expressions with Regex can be found here, with all the different formats and details.

+ = match 1 or more
? = match 0 or 1 repetitions.
* = match 0 or MORE repetitions	  
. = Any character except a new line

Here is an example to define chunks with regular expressions, using the part of speech codes used previously. Chunks can then be defined with much more granular filters through more complex queries.

chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""

As the output of Chunking will be a list of corresponding occurrences of the expression, the output can then be represented into a parse tree, thanks to Matplotlib with .draw().

chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
chunked.draw()

A complete example of chunking by Sentdex can be found here with the corresponding video.

Chinking is the process of removing parts from the chunks seen before. Chinking is making exceptions to the chunk selections, removing sub-chunks from larger chunks. Here is an example of removing a chink from a chunk.

chunkGram = r"""Chunk: {<.*>+}
                        }<VB.?|IN|DT|TO>+{"""

N-grams

An n-gram model is a type of probabilistic language model for predicting the next item in a sequence, such as a string of text. It relies on the analysis of consecutive items in the text: letters, phonemes, words…

In NLP, N-grams are particularly of use to analyze sequences of words, so as to compute the frequency of collocation of words and predict the next possible word in a given request.

In NLTK use the following code to import ngrams module:

from nltk.util import ngrams

Make sure you have clean, regular text (no code tags and other markers…) to use ngrams, so as to process the text in tokens and bigrams.

# first get individual words
tokenized = text.split()

# and get a list of all the bi-grams.
# Change the parameter for tri-grams, four-grams and so on.
Bigrams = ngrams(tokenized, 2)

Now we can analyze the frequencies of bigrams thanks to functions built-in NLTK.

# get the frequency of each bigram in our corpus
BigramFreq = collections.Counter(Bigrams)

# what are the ten most popular ngrams in this corpus?
BigramFreq.most_common(10)

The same process would work for trigrams, four-grams, and so on. More information on this code from Rachael Tatman on Kaggle and how to use ngrams can be found here. And also here is the complete source from NLTK to process ngrams.

Named Entity Recognition

Named Entity Recognition is used to extract types of nouns out of a body of text. The function also returns the type of named entities according to the following types.

NE Type	Examples
ORGANIZATION	Georgia-Pacific Corp., WHO
PERSON	Eddy Bonte, President Obama
LOCATION	Murray River, Mount Everest
DATE	June, 2008-06-29
TIME	two fifty a m, 1:30 p.m.
MONEY	175 million Canadian Dollars, GBP 10.40
PERCENT	twenty pct, 18.75 %
FACILITY	Washington Monument, Stonehenge
GPE	South East Asia, Midlothian

Using the code nltk.ne_chunk(), NLTK will return a list with the named entity recognized and their types. Adding the parameter binary="true" will only highlight the named entities without defining their types. The results can also be drawn in parse trees with .draw().

Here is the example from PythonProgramming.

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged, binary=True)
            namedEnt.draw()
    except Exception as e:
        print(str(e))


process_content()

Using corpora

Any natural language processing program will need to run on some text. To train machine learning algorithms on natural language, the more text used, the more accurate the model will be. So it is advised to use entire corpora of text to train better ML algorithms.

Luckily NLTK comes with a number of lengthy, useful, and labeled corpora. To use them, you may request a given corpus as needed, or more simply use the function nltk.download() presented above to download all NLTK corpora available, and use them on your local machine.

The NLTK corpora are already formatted in .txt files that can easily be processed by a machine, with hundreds of text datasets. Note that you may want to open them with a formatting tool (such as Notepad++) to view them in a humanly readable format. The NLTK corpora notably include:

Shakespeare plays
Chat messages exchanges
Positive/Negative movie reviews
Gutenberg Bible
State of the Union speeches
Sentiwordnet (sentiment database)
Twitter samples
WordNet: see below

The complete list of datasets available can be found in the NLTK Corpora list.

To open any corpus, use the following command, with the example for the Gutenberg Bible corpus:

from nltk.corpus import gutenberg

text = gutenberg.raw("bible-kjv.txt")

Here is the complete tutorial on how to access and use the NLTK corpora.

Other free corpora can be found here.

Using WordNet

WordNet is a large lexical database of English developed by Princeton. It includes synonyms, antonyms, definitions, example use of words, to be used directly into Python programs thanks to NLTK.

To use WordNet use the following import:

from nltk.corpus import wordnet

Then you can use WordNet for a number of uses including returning synonyms, antonyms, definitions, examples use. Remark that WordNet returns a list, so you can just tell an index to obtain the first (or any) word in the list.

synonyms = wordnet.synsets("program")

print(synonyms[0].name())
>> plan.n.01

print(synonyms[0].lemmas()[0].name())
>> plan

print(synonyms[0].definition())
>> a series of steps to be carried out or goals to be accomplished

print(synonyms[0].examples())
>> ['they drew up a six-step plan', 'they discussed plans for a new bond issue']


syns = wordnet.synsets("good")

print(syns[0].lemmas()[0].antonyms()[0].name())
>> evilness

WordNet can also compare the semantic similarity between words. It returns a percentage of similarity according to the lexical meaning of words, defined by the Wu and Palmer paper.

w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('boat.n.01')
print(w1.wup_similarity(w2))
>>0.9090909090909091

w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('car.n.01')
print(w1.wup_similarity(w2))
>> 0.6956521739130435

w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('cactus.n.01')
print(w1.wup_similarity(w2))
>> 0.38095238095238093

More information on WordNet can be found on the Princeton WordNet website. Note that even though the original WordNet was built for the English language, a number of other languages have also been assembled into a usable lexical database. More information about the other languages WordNet can be found here.

Lots of simple yet powerful NLP tools. Nice! 🙂 Of course, if you want to dive further into a particular method check the sources and extra resources linked above. Any other basic that should be included? Any update? Which one would you use to build a great NLP tool? Couple it with Machine Learning?

Let me know what you’d build in the comments!