In natural language processing, Tokenization is the operation of separating a string of text into words and/or sentences. The resulting list of words/sentences can then be further analyzed, notably to measure word frequency, which can be a first step to understanding what a text is about.
Check the basics of NLP with NLTK to program Tokenization in Python.
More on Tokenization on Wikipedia.
« Back to Glossary Index