A key element of Artificial Intelligence, Natural Language Processing is the manipulation of textual data through a machine in order to “understand” it, that is to say, analyze it to obtain insights and/or generate new text. In Python, this is most commonly done with NLTK.
The basic operations of Natural Language Processing – NLP – aim at reaching some understanding of a text by a machine. This generally means obtaining insights from written textual data, which can be spoken language transcribed into text, or to generate new text. Text processing is conducted by converting written text into numbers so that a machine can then apply some operations that lead to various results.
With the Python programming language, this is most commonly done through the Natural Language Toolkit library, NLTK. These most fundamental operations can easily be implemented with NLTK with the functions/code detailed hereafter to obtain powerful results. They can obviously also serve as first steps before a more complex algorithm, for machine learning, sentiment analysis, text classification, or text generation, etc.
Remark that this quick summary makes extensive use NLTK tutorial playlist of Sentdex (Harrison Kinsley) on Youtube and the corresponding code on his website, pythonprogramming.net. It also relies on other codes, tutorials and resources found around the web and collected here (with links to the original source) as well as the deeper and more complete presentation of NLTK that can be found in the official NLTK book.
Basic NLP vocabulary
To make sure we understand what we are dealing with here, and more generally in NLP literature and code, a basic understanding of the following vocabulary will be required.
Corpus (plural: Corpora): a body of text, generally containing a lot of texts of the same type.
Example: movie reviews, US President “State of the Union” speeches, Bible text, medical journals, all English books, etc.
Lexicon: the words considered and their meanings.
Example: For the entire English language, that would be an English dictionary. It can also be more specific, depending on the context of the NLP at hand, such as the business vocabulary, investor vocabulary, investor “bull” vocabulary, etc. In that case “bull” (bull = positive about the market) would have a different meaning that in the general English (bull = male animal).
Lemma: in lexicography, a lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.
Example: In English, run, runs, ran and running are different forms of run, the lemma by which they are indexed.
Stop words: stop words are words that are meaningless words for data analysis. They are often eluded during data pre-processing before any data analysis can be implemented.
Examples of English stopwords: a, the, and, of…
Part of speech: the part of speech is the function of a word in a sentence: noun, verb, adjective, pronoun, etc.
N-gram: in computational linguistics, an n-gram is a contiguous sequence of n items: phonemes, syllables, letters, words or base pairs, from a given sample of text or speech.
Examples from the Google n-gram corpus:
3-grams:
- ceramics collectables collectibles
- ceramics collectables fine
- ceramics collected by
4-grams:
- serve as the incoming
- serve as the incubator
- serve as the independent
Using NLTK
To load the NLTK library in Python make sure to use the following import command:
import nltk
and the following command which will download all corpora (text corpus) that can then be used to train NLP machine learning algorithm. The following operation only needs to be done once.
nltk.download()
In the downloading box opening, choose all
and press download
Remark that NLTK is developed for the English language. For support in other languages, you will need to find related corpora and use them to train your NLTK algorithm. However, many resources are available in most common languages, and NLTK stopwords corpora are also included in multiple languages.
Tokenization
Since paragraphs often present ideas in a number of sentences, it can be useful to keep track of sentence groups in paragraphs and analyze these sentences and the words they contain. A simple and yet powerful operation that can then be conducted is tokenization.
In natural language processing, tokenization is the operation of separating a string of text into words and/or sentences. The resulting list of words/sentences can then be further analyzed, notably to measure word frequency, which can be a first step to understanding what a text is about.
In Python, tokenization is done with the following code with NLTK:
from nltk import sent_tokenize, word_tokenize example_text = 'This is an example text about Mr. Naturalangproc. With a second sentence that's a question?' sent_tokenize(example_text) word_tokenize(example_text)
Remark that NLTK is programmed to recognize “Mr.” not as the end of a sentence but as a separate word in itself. It also considers that punctuation marks are words in themselves.
For more information on Tokenization, check this resource from Stanford University, and the detailed presentation of the nltk.tokenize package.
Removing stop words
Stop words tend to be of little use for textual data analysis. Therefore in the pre-processing phase, you may need to remove them from your words in certain cases.
The set of stop words is already defined in NLTK, and it is therefore very easy to use it for data preparation with Python.
from nltk import stopwords stop_words = set(stopwords.words("english")
Just change the “english” parameter to another language to get the list of stopwords in that language.
And here is an example from PythonProgramming of how to use stopwords: removing stopwords from a tokenized sentence.
from nltk.corpus import stopwords from nltk.tokenize import word_tokenize example_sent = "This is a sample sentence, showing off the stop words filtration." stop_words = set(stopwords.words('english')) word_tokens = word_tokenize(example_sent) filtered_sentence = [w for w in word_tokens if not w in stop_words] # The previous one-liner is equivalent to: # filtered_sentence = [] # for w in word_tokens: # if w not in stop_words: # filtered_sentence.append(w) print(word_tokens) print(filtered_sentence)
Stemming and lemmatization
Stemming refers to the operation of reverting to a word’s root. Plurals, conjugated verbs or words that correspond to a specific part-of-speech (or function in the sentence), such as adverbs, superlatives, etc. are composed of a stem, which conveys the meaning of the word, with additional affixes providing an indication of function in the sentence.
The same word can thus be present in a corpus under different forms. Examples could be rapid/rapidly, eat/eaten/eating, etc. So in order to reduce datasets and increase relevance, we may want to revert to the words stem so as to better analyze the vocabulary in a text. That is to say, stemming the words, which removes the different affixes (-ly, -ing, -s, etc.).
in NLTK, the stemmers available are Porter and Snowball, with the Snowball (or PorterStemmer 2) being generally favored. Here is how to use it:
from nltk.stem import * stemmer = PorterStemmer()
from nltk.stem.snowball import SnowballStemmer stemmer = SnowballStemmer("english")
With the results showing why the Snowball stemmer is better:
>>> print(SnowballStemmer("english").stem("generously")) generous >>> print(SnowballStemmer("porter").stem("generously")) gener
More info on stemmers with NLTK, and how to use them, is available here.
Lemmatization is very similar to stemming, except the root that a word is reverted is its “lemma”, a valid word (with a meaning), not just the cut stem of a word, in decomposed, singular form. It is therefore generally preferred to stemming in most cases as the results of lemmatization are more natural.
Here is how to code with NLTK:
from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer()
Lemmatization will always return a valid word in its singular form.
>> print(lemmatizer.lemmatize("cactus")) cactus >> print(lemmatizer.lemmatize("geese")) goose
However, note that the function lemmatizer.lemmatize()
default parameter is a noun (n). To lemmatize to other parts of speech, you need to define the parameter in the function as pos="a"
:
>> print(lemmatizer.lemmatize("better")) better >> print(lemmatizer.lemmatize("better", pos="a")) good >> print(lemmatizer.lemmatize("best", pos="a")) best
Parts of speech tagging
Part of Speech (PoS) tagging is labeling the part of speech of every single word of a text, identifying if words are nouns, verbs, adjectives, pronouns, etc. It is used to go deeper into the comprehension of a body of text, allowing the analysis of each word.
The code below, from PythonProgramming, detail how to use part of speech tagging with NTLK, creating a list of words with their part of speech function. It requires to use some text in order to train the unsupervised machine learning tokenizer PunktSentenceTokenizer.
You can notably use the NLTK corpus which was downloaded above. Check the different NLTK corpora for more text available to train and use PoS tagging.
import nltk from nltk.corpus import state_union from nltk.tokenize import PunktSentenceTokenizer train_text = state_union.raw("2005-GWBush.txt") sample_text = state_union.raw("2006-GWBush.txt") custom_sent_tokenizer = PunktSentenceTokenizer(train_text) tokenized = custom_sent_tokenizer.tokenize(sample_text) def process_content(): try: for i in tokenized[:5]: words = nltk.word_tokenize(i) tagged = nltk.pos_tag(words) print(tagged) except Exception as e: print(str(e)) process_content()
And here is also the list of all different parts of speech available for the English language.
POS tag list: CC coordinating conjunction CD cardinal digit DT determiner EX existential there (like: "there is" ... think of it like "there exists") FW foreign word IN preposition/subordinating conjunction JJ adjective 'big' JJR adjective, comparative 'bigger' JJS adjective, superlative 'biggest' LS list marker 1) MD modal could, will NN noun, singular 'desk' NNS noun plural 'desks' NNP proper noun, singular 'Harrison' NNPS proper noun, plural 'Americans' PDT predeterminer 'all the kids' POS possessive ending parent\'s PRP personal pronoun I, he, she PRP$ possessive pronoun my, his, hers RB adverb very, silently, RBR adverb, comparative better RBS adverb, superlative best RP particle give up TO to go 'to' the store. UH interjection errrrrrrrm VB verb, base form take VBD verb, past tense took VBG verb, gerund/present participle taking VBN verb, past participle taken VBP verb, sing. present, non-3d take VBZ verb, 3rd person sing. present takes WDT wh-determiner which WP wh-pronoun who, what WP$ possessive wh-pronoun whose WRB wh-abverb where, when
Chunking and chinking
Chunking is the process of extracting parts, or “chunks”, from a body of text with regular expression. It is used to extract lists of groups of words, that would otherwise be split by standard tokenizers. This is especially useful to extract particular descriptive expressions from a text, such as “noun + adjective” or “noun + verb” and more complex patterns.
Chunking relies on the key following regular expression. More details on regular expressions with Regex can be found here, with all the different formats and details.
+ = match 1 or more ? = match 0 or 1 repetitions. * = match 0 or MORE repetitions . = Any character except a new line
Here is an example to define chunks with regular expressions, using the part of speech codes used previously. Chunks can then be defined with much more granular filters through more complex queries.
chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
As the output of Chunking will be a list of corresponding occurrences of the expression, the output can then be represented into a parse tree, thanks to Matplotlib with .draw().
chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}""" chunkParser = nltk.RegexpParser(chunkGram) chunked = chunkParser.parse(tagged) chunked.draw()
A complete example of chunking by Sentdex can be found here with the corresponding video.
Chinking is the process of removing parts from the chunks seen before. Chinking is making exceptions to the chunk selections, removing sub-chunks from larger chunks. Here is an example of removing a chink from a chunk.
chunkGram = r"""Chunk: {<.*>+} }<VB.?|IN|DT|TO>+{"""
N-grams
An n-gram model is a type of probabilistic language model for predicting the next item in a sequence, such as a string of text. It relies on the analysis of consecutive items in the text: letters, phonemes, words…
In NLP, N-grams are particularly of use to analyze sequences of words, so as to compute the frequency of collocation of words and predict the next possible word in a given request.
In NLTK use the following code to import ngrams module:
from nltk.util import ngrams
Make sure you have clean, regular text (no code tags and other markers…) to use ngrams, so as to process the text in tokens and bigrams.
# first get individual words tokenized = text.split() # and get a list of all the bi-grams. # Change the parameter for tri-grams, four-grams and so on. Bigrams = ngrams(tokenized, 2)
Now we can analyze the frequencies of bigrams thanks to functions built-in NLTK.
# get the frequency of each bigram in our corpus BigramFreq = collections.Counter(Bigrams) # what are the ten most popular ngrams in this corpus? BigramFreq.most_common(10)
The same process would work for trigrams, four-grams, and so on. More information on this code from Rachael Tatman on Kaggle and how to use ngrams can be found here. And also here is the complete source from NLTK to process ngrams.
Named Entity Recognition
Named Entity Recognition is used to extract types of nouns out of a body of text. The function also returns the type of named entities according to the following types.
NE Type | Examples |
---|---|
ORGANIZATION | Georgia-Pacific Corp., WHO |
PERSON | Eddy Bonte, President Obama |
LOCATION | Murray River, Mount Everest |
DATE | June, 2008-06-29 |
TIME | two fifty a m, 1:30 p.m. |
MONEY | 175 million Canadian Dollars, GBP 10.40 |
PERCENT | twenty pct, 18.75 % |
FACILITY | Washington Monument, Stonehenge |
GPE | South East Asia, Midlothian |
Using the code nltk.ne_chunk()
, NLTK will return a list with the named entity recognized and their types. Adding the parameter binary="true"
will only highlight the named entities without defining their types. The results can also be drawn in parse trees with .draw()
.
Here is the example from PythonProgramming.
import nltk from nltk.corpus import state_union from nltk.tokenize import PunktSentenceTokenizer train_text = state_union.raw("2005-GWBush.txt") sample_text = state_union.raw("2006-GWBush.txt") custom_sent_tokenizer = PunktSentenceTokenizer(train_text) tokenized = custom_sent_tokenizer.tokenize(sample_text) def process_content(): try: for i in tokenized[5:]: words = nltk.word_tokenize(i) tagged = nltk.pos_tag(words) namedEnt = nltk.ne_chunk(tagged, binary=True) namedEnt.draw() except Exception as e: print(str(e)) process_content()
Using corpora
Any natural language processing program will need to run on some text. To train machine learning algorithms on natural language, the more text used, the more accurate the model will be. So it is advised to use entire corpora of text to train better ML algorithms.
Luckily NLTK comes with a number of lengthy, useful, and labeled corpora. To use them, you may request a given corpus as needed, or more simply use the function nltk.download()
presented above to download all NLTK corpora available, and use them on your local machine.
The NLTK corpora are already formatted in .txt
files that can easily be processed by a machine, with hundreds of text datasets. Note that you may want to open them with a formatting tool (such as Notepad++) to view them in a humanly readable format. The NLTK corpora notably include:
- Shakespeare plays
- Chat messages exchanges
- Positive/Negative movie reviews
- Gutenberg Bible
- State of the Union speeches
- Sentiwordnet (sentiment database)
- Twitter samples
- WordNet: see below
The complete list of datasets available can be found in the NLTK Corpora list.
To open any corpus, use the following command, with the example for the Gutenberg Bible corpus:
from nltk.corpus import gutenberg text = gutenberg.raw("bible-kjv.txt")
Here is the complete tutorial on how to access and use the NLTK corpora.
Other free corpora can be found here.
Using WordNet
WordNet is a large lexical database of English developed by Princeton. It includes synonyms, antonyms, definitions, example use of words, to be used directly into Python programs thanks to NLTK.
To use WordNet use the following import:
from nltk.corpus import wordnet
Then you can use WordNet for a number of uses including returning synonyms, antonyms, definitions, examples use. Remark that WordNet returns a list, so you can just tell an index to obtain the first (or any) word in the list.
synonyms = wordnet.synsets("program") print(synonyms[0].name()) >> plan.n.01 print(synonyms[0].lemmas()[0].name()) >> plan print(synonyms[0].definition()) >> a series of steps to be carried out or goals to be accomplished print(synonyms[0].examples()) >> ['they drew up a six-step plan', 'they discussed plans for a new bond issue'] syns = wordnet.synsets("good") print(syns[0].lemmas()[0].antonyms()[0].name()) >> evilness
WordNet can also compare the semantic similarity between words. It returns a percentage of similarity according to the lexical meaning of words, defined by the Wu and Palmer paper.
w1 = wordnet.synset('ship.n.01') w2 = wordnet.synset('boat.n.01') print(w1.wup_similarity(w2)) >>0.9090909090909091 w1 = wordnet.synset('ship.n.01') w2 = wordnet.synset('car.n.01') print(w1.wup_similarity(w2)) >> 0.6956521739130435 w1 = wordnet.synset('ship.n.01') w2 = wordnet.synset('cactus.n.01') print(w1.wup_similarity(w2)) >> 0.38095238095238093
More information on WordNet can be found on the Princeton WordNet website. Note that even though the original WordNet was built for the English language, a number of other languages have also been assembled into a usable lexical database. More information about the other languages WordNet can be found here.
Lots of simple yet powerful NLP tools. Nice! 🙂 Of course, if you want to dive further into a particular method check the sources and extra resources linked above. Any other basic that should be included? Any update? Which one would you use to build a great NLP tool? Couple it with Machine Learning?
Let me know what you’d build in the comments!
the information is ‘wow’
Thanks!