NLTK, which stands for Natural Language Toolkit is a suite of libraries and programs for symbolic and statistical Natural Language Processing – NLP – for the Python programming language.
Developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania, NLTK notably allows to easily conduct the following operations:
- Lexical analysis: Word and text tokenizer
- n-gram and collocations
- Part-of-speech tagger
- Tree model and Text chunker for capturing
- Named-entity recognition
NLTK also provides access to more than 50 corpora and lexical resources, including WordNet, as well as a number of other NLP resources.
For more information on NLTK, check the NLTK website and the NLTK Wikipedia page. There is also a complete and free book on NLTK available online for NLTK 3 and Python 3 with the following chapters.
- Preface
- Language Processing and Python
- Accessing Text Corpora and Lexical Resources
- Processing Raw Text
- Writing Structured Programs
- Categorizing and Tagging Words (minor fixes still required)
- Learning to Classify Text
- Extracting Information from Text
- Analyzing Sentence Structure
- Building Feature Based Grammars
- Analyzing the Meaning of Sentences (minor fixes still required)
- Managing Linguistic Data (minor fixes still required)
- Afterword: Facing the Language Challenge