Natural language processing (NLP) using Python NLTK (Simple Examples) (2024)

The Natural Language Toolkit, or NLTK, is a Python library created for symbolic and natural language processing tasks.

It has the potential to make natural language processing accessible to everyone, from the English language to any natural human language.

Table of Contents hide

  • 1 Installing Python NLTK
  • 2 Text Preprocessing
  • 3 Sentence and word tokenization
  • 4 Stopwords removal
  • 5 Stemming
  • 6 Lemmatization
  • 7 Part of Speech tagging
  • 8 Named Entity Recognition
  • 9 Understanding synsets
  • 10 Semantic relationships
  • 11 Measuring semantic similarity
  • 12 Context-free grammar
  • 13 Parse trees
  • 14 Chunking
  • 15 Chinking
  • 16 N-Grams
  • 17 Sentiment Analysis
  • 18 Information Retrieval
  • 19 Frequency Distribution
  • 20 Further Reading

Installing Python NLTK

To get started, you need to install NLTK on your computer. Run the following command:

!pip install nltk

After installation, you need to import NLTK and download the necessary packages.

import nltknltk.download('punkt')nltk.download('wordnet')nltk.download('averaged_perceptron_tagger')nltk.download('stopwords')nltk.download('maxent_ne_chunker')nltk.download('words')

Here’s the output that you should expect:

[nltk_data] Downloading package punkt to /home/user/nltk_data...[nltk_data] Unzipping tokenizers/punkt.zip.[nltk_data] Downloading package wordnet to /home/user/nltk_data...[nltk_data] Unzipping corpora/wordnet.zip.[nltk_data] Downloading package averaged_perceptron_tagger to...

The above commands download several NLTK packages using nltk.download().

You will need these to perform tasks such as part of speech tagging, stopword removal, and lemmatization.
With the Natural Language Toolkit installed, we are now ready to explore the next steps of preprocessing.

Text Preprocessing

Text preprocessing is the practice of cleaning and preparing text data for machine learning algorithms. The primary steps include tokenizing, removing stop words, stemming, lemmatizing, and more.

These steps help reduce the complexity of the data and extract meaningful information from it.
In the coming sections of this tutorial, we’ll walk you through each of these steps using NLTK.

Sentence and word tokenization

Tokenization is the process of breaking down text into words, phrases, symbols, or other meaningful elements called tokens. The input to the tokenizer is a unicode text, and the output is a list of sentences or words.

In NLTK, we have two types of tokenizers – the word tokenizer and the sentence tokenizer.
Let’s see an example:

from nltk.tokenize import sent_tokenize, word_tokenizetext = "Natural language processing is fascinating. It involves many tasks such as text classification, sentiment analysis, and more."sentences = sent_tokenize(text)print(sentences)words = word_tokenize(text)print(words)

Output:

['Natural language processing is fascinating.', 'It involves many tasks such as text classification, sentiment analysis, and more.']['Natural', 'language', 'processing', 'is', 'fascinating', '.', 'It', 'involves', 'many', 'tasks', 'such', 'as', 'text', 'classification', ',', 'sentiment', 'analysis', ',', 'and', 'more', '.']

The sent_tokenize function splits the text into sentences, and the word_tokenize function splits the text into words. As you can see, punctuation is also treated as a separate token.

Stopwords removal

In natural language processing, stopwords are words that you want to ignore, so you filter them out when you’re processing your text.

These are usually words that occur very frequently in any text and do not convey much meaning, such as “is”, “an”, “the”, “in”, etc.
NLTK comes with a predefined list of stopwords in several languages, including English.
Let’s use NLTK to filter out stopwords from our list of tokenized words:

from nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizetext = "Natural language processing is fascinating. It involves many tasks such as text classification, sentiment analysis, and more."stop_words = set(stopwords.words('english'))words = word_tokenize(text)filtered_words = [word for word in words if word.casefold() not in stop_words]print(filtered_words)

Output:

['Natural', 'language', 'processing', 'fascinating', '.', 'involves', 'many', 'tasks', 'text', 'classification', ',', 'sentiment', 'analysis', ',', '.',]

In this piece of code, we first import the stopwords from NLTK, tokenize the text, and then filter out the stopwords. The casefold() method is used to ignore the case while comparing words to the stop words list.

Stemming

Stemming is the process of reducing inflection in words (like running, runs) to their root form (e.g., run). The ‘root’ in this case may not actually be a real root word, but just a canonical form of the original word. NLTK provides several famous stemmers interfaces, such as PorterStemmer.
Here’s how to use NLTK’s PorterStemmer:

from nltk.stem import PorterStemmerfrom nltk.tokenize import word_tokenizetext = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."porter_stemmer = PorterStemmer()words = word_tokenize(text)stemmed_words = [porter_stemmer.stem(word) for word in words]print(stemmed_words)

Output:

['he', 'wa', 'run', 'and', 'eat', 'at', 'same', 'time', '.', 'he', 'ha', 'bad', 'habit', 'of', 'swim', 'after', 'play', 'long', 'hour', 'in', 'the', 'sun', '.']

In this piece of code, we first tokenize the text, and then we pass each word into the stem function of our stemmer.

Note how the words “running”, “eating”, “swimming”, and “playing” have been reduced to their root form: “run”, “eat”, “swim”, and “play”, respectively.

Lemmatization

Lemmatization is a process that takes into consideration the morphological analysis of the words and efficiently reduces a word to its base or root form.

Unlike stemming, it reduces the inflected words properly ensuring that the root word, also known as the lemma, belongs to the language.
We’ll use the WordNet lexical database for lemmatization. WordNetLemmatizer is a class which gets the lemma of a word.
Here’s an example:

from nltk.stem import WordNetLemmatizerfrom nltk.tokenize import word_tokenizetext = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."lemmatizer = WordNetLemmatizer()words = word_tokenize(text)lemmatized_words = [lemmatizer.lemmatize(word) for word in words]print(lemmatized_words)

Output:

['He', 'wa', 'running', 'and', 'eating', 'at', 'same', 'time', '.', 'He', 'ha', 'bad', 'habit', 'of', 'swimming', 'after', 'playing', 'long', 'hour', 'in', 'the', 'Sun', '.']

In this code, we first tokenize the text, and then we pass each word into the lemmatize function of our lemmatizer.

Note how the words “was” and “has” have been reduced to their lemma: “wa” and “ha”, respectively.

Part of Speech tagging

Part of speech (POS) tagging is the process of marking a word in a text as corresponding to a particular part of speech (noun, verb, adjective, etc.), based on both its definition and its context.

The NLTK library has a function called pos_tag to label words with a part of speech descriptor.
Let’s see it in action:

from nltk.tokenize import word_tokenizefrom nltk import pos_tagtext = "Natural language processing is fascinating. It involves many tasks such as text classification, sentiment analysis, and more."words = word_tokenize(text)tagged_words = pos_tag(words)print(tagged_words)

Output:

[('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('is', 'VBZ'), ('fascinating', 'VBG'), ('.', '.'), ('It', 'PRP'), ('involves', 'VBZ'), ('many', 'JJ'), ('tasks', 'NNS'), ('such', 'JJ'), ('as', 'IN'), ('text', 'NN'), ('classification', 'NN'), (',', ','), ('sentiment', 'NN'), ('analysis', 'NN'), (',', ','), ('and', 'CC'), ('more', 'JJR'), ('.', '.')]

The pos_tag function returns a tuple with the word and a tag representing the part of speech. For instance, ‘NN’ stands for a noun, ‘JJ’ is an adjective, ‘VBZ’ is a verb in the third person, and so on.

Here’s a list of some common POS (Part of Speech) tags used in NLTK, along with their meaning:

TagMeaning
CCCoordinating conjunction
CDCardinal number
DTDeterminer
EXExistential there
FWForeign word
INPreposition or subordinating conjunction
JJAdjective
JJRAdjective, comparative
JJSAdjective, superlative
LSList item marker
MDModal
NNNoun, singular or mass
NNSNoun, plural
NNPProper noun, singular
NNPSProper noun, plural
PDTPredeterminer
POSPossessive ending
PRPPersonal pronoun
PRP$Possessive pronoun
RBAdverb
RBRAdverb, comparative
RBSAdverb, superlative
RPParticle
SYMSymbol
TOto
UHInterjection
VBVerb, base form
VBDVerb, past tense
VBGVerb, gerund or present participle
VBNVerb, past participle
VBPVerb, non-3rd person singular present
VBZVerb, 3rd person singular present
WDTWh-determiner
WPWh-pronoun
WP$Possessive wh-pronoun
WRBWh-adverb

These tags are part of the Penn Treebank tagset.

Named Entity Recognition

Named Entity Recognition (NER) is the process of locating and classifying named entities present in your text into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
Let’s try out a simple example:

from nltk.tokenize import word_tokenizefrom nltk import pos_tag, ne_chunktext = "John works at Google in Mountain View, California."words = word_tokenize(text)tagged_words = pos_tag(words)named_entities = ne_chunk(tagged_words)print(named_entities)

The output will be a tree with named entities as subtrees. The label of the subtree will indicate the type of the entity (i.e., PERSON, ORGANIZATION, LOCATION, etc.). For instance:

(S (PERSON John/NNP) works/VBZ at/IN (ORGANIZATION Google/NNP) in/IN (GPE Mountain/NNP View/NNP) ,/, (GPE California/NNP) ./.)

In this code, we first tokenize the text and then tag each word with its part of speech.

The ne_chunk function then identifies the named entities. In the result, ‘John’ is recognized as a person, ‘Google’ as an organization, and ‘Mountain View’ and ‘California’ as geographical locations.

Understanding synsets

A synset (or synonym set) is a collection of synonyms that are interchangeable in some contexts.

These are a very useful resource for building knowledge graphs, semantic links, or for finding the meaning of a word in a context.

NLTK provides an interface to the WordNet API, which can be used to look up words and their synonyms, definitions, and examples.
Let’s demonstrate how to use this:

from nltk.corpus import wordnetsyn = wordnet.synsets("dog")[0]print(f"Synset name: {syn.name()}")print(f"Lemma names: {syn.lemma_names()}")print(f"Definition: {syn.definition()}")print(f"Examples: {syn.examples()}")

Output:

Synset name: dog.n.01Lemma names: ['dog', 'domestic_dog', 'Canis_familiaris']Definition: a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breedsExamples: ['the dog barked all night']

In this code, wordnet.synsets("dog")[0] gives us the first synset of the word “dog”. The name method returns the name of the synset, lemma_names gives all synonyms, definition provides a brief definition, and examples provide usage examples.

Semantic relationships

Semantic relationships between words are an integral part of natural language understanding tasks. NLTK provides easy-to-use interfaces to explore these relationships:

  • Hyponyms: More specific terms. For example, ‘poodle’ is a hyponym of ‘dog’.
  • Hypernyms: More general terms. For example, ‘dog’ is a hypernym of ‘poodle’.
  • Antonyms: Opposite terms. For example, ‘good’ is an antonym of ‘evil’.

Here’s how to explore these relationships using NLTK:

from nltk.corpus import wordnetsyn = wordnet.synsets('dog')[0]# Get hyponyms for doghyponyms = syn.hyponyms()print("Hyponyms of 'dog': ", [h.lemmas()[0].name() for h in hyponyms])syn = wordnet.synsets('poodle')[0]# Get hypernyms for poodlehypernyms = syn.hypernyms()print("Hypernyms of 'dog': ", [h.lemmas()[0].name() for h in hypernyms])# Get Antonymsynsets = wordnet.synsets('good')antonym = None# Search for an antonym in all synsets/lemmasfor syn in synsets: for lemma in syn.lemmas(): if lemma.antonyms(): antonym = lemma.antonyms()[0].name() break if antonym: breakif antonym: print("Antonym of 'good': ", antonym)else: print("No antonym found for 'good'")

Output:

Hyponyms of 'dog': ['basenji', 'corgi', 'cur', 'dalmatian', 'Great_Pyrenees', 'griffon', 'hunting_dog', 'lapdog', 'Leonberg', 'Mexican_hairless', 'Newfoundland', 'pooch', 'poodle', 'pug', 'puppy', 'spitz', 'toy_dog', 'working_dog']Hypernyms of 'dog': ['dog']Antonym of 'good': evil

The hyponyms method gives a list of more specific terms (hyponyms), while hypernyms gives a list of more general terms (hypernyms).

For antonyms, we first iterates over all synsets of “good”. Then it iterates over all lemmas of a synset. If it finds an antonym, it breaks from the loops and prints the antonym.

If no antonym is found after checking all synsets and lemmas, it prints a message to indicate that no antonym was found.

Measuring semantic similarity

We can also measure the semantic similarity between two words based on the distance between these words in the hypernym tree.
Here is an example:

from nltk.corpus import wordnet# Get the first synset for each worddog = wordnet.synsets('dog')[0]cat = wordnet.synsets('cat')[0]# Get the similarity valuesimilarity = dog.path_similarity(cat)print("Semantic similarity between 'dog' and 'cat': ", similarity)

Output:

Semantic similarity: 0.2

In this code, we first get the first synset of each word using wordnet.synsets(). Then we measure the semantic similarity between these synsets using path_similarity().

In this example, we are only comparing the first sense of each word. If you want a more comprehensive measure of similarity, you may need to compare all senses of the words and possibly aggregate the similarity scores in some way.

Here’s an example of how to do this:

from nltk.corpus import wordnet# Get all synsets for each wordsynsets_dog = wordnet.synsets('dog')synsets_cat = wordnet.synsets('cat')# Initialize max similaritymax_similarity = 0# Compare all pairs of synsetsfor synset_dog in synsets_dog: for synset_cat in synsets_cat: similarity = synset_dog.path_similarity(synset_cat) if similarity is not None: # If the words are connected in the hypernym/hyponym taxonomy max_similarity = max(max_similarity, similarity)print("Comprehensive semantic similarity between 'dog' and 'cat': ", max_similarity)

Output:

Comprehensive semantic similarity between 'dog' and 'cat': 0.2

In this script, we first get all synsets of each word using wordnet.synsets(). Then we initialize the max similarity to 0.

We compare all pairs of synsets and update the max similarity each time we find a higher similarity. Finally, we print the max similarity.

Context-free grammar

In natural language processing, a context-free grammar (CFG) is a formal grammar which is used to generate all possible sentences in a given formal language.
Here’s how you can define a CFG in NLTK and generate sentences from it:

from nltk import CFGfrom nltk.parse.generate import generategrammar = CFG.fromstring(""" S -> NP VP VP -> V NP | V NP PP PP -> P NP V -> "saw" | "ate" NP -> "John" | "Mary" | "Bob" | Det N | Det N PP Det -> "a" | "an" | "the" | "my" N -> "dog" | "cat" | "cookie" | "park" P -> "in" | "on" | "by" | "with"""")for sentence in generate(grammar, n=10): # generating only 10 sentences print(' '.join(sentence))

Output:

John saw JohnJohn saw MaryJohn saw BobJohn saw a dogJohn saw a catJohn saw a cookieJohn saw a parkJohn saw an dogJohn saw an catJohn saw an cookie

In this code, we first define a context-free grammar in NLTK using CFG.fromstring method.

The string contains the rules of the CFG in the format "LHS -> RHS", where LHS is a single non-terminal symbol, and RHS is a sequence of terminal and non-terminal symbols.

Then we generate sentences from the CFG using nltk.parse.generate.generate function.

Parse trees

A parse tree or parsing tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar.
Here’s how you can generate a parse tree for a sentence given a context-free grammar:

from nltk import CFGfrom nltk.parse import RecursiveDescentParsergrammar = CFG.fromstring(""" S -> NP VP VP -> V NP | V NP PP PP -> P NP V -> "saw" | "ate" NP -> "John" | "Mary" | "Bob" | Det N | Det N PP Det -> "a" | "an" | "the" | "my" N -> "dog" | "cat" | "cookie" | "park" P -> "in" | "on" | "by" | "with"""")rd_parser = RecursiveDescentParser(grammar)sentence = 'John saw a cat'.split()for tree in rd_parser.parse(sentence): print(tree)

Output:

(S (NP John) (VP (V saw) (NP (Det a) (N cat))))

In this code, we first define a context-free grammar in NLTK using CFG.fromstring method. Then we create a RecursiveDescentParser instance with the given grammar.

After that, we provide a sentence as a list of words to the parse method of the RecursiveDescentParser instance. This method returns a generator which generates all possible parse trees for the given sentence.

Chunking

Chunking is a process of extracting phrases from unstructured text. Instead of just simple tokens which may not represent the actual meaning of the text, it’s beneficial to use phrases such as “South Africa” as a single word instead of ‘South’ and ‘Africa’ separate words.
Here’s how you can do noun phrase chunking in NLTK:

import nltkfrom nltk import pos_tagfrom nltk.tokenize import word_tokenizefrom nltk.chunk import RegexpParsersentence = "The big cat ate the little mouse who was after fresh cheese"# PoS taggingtagged = pos_tag(word_tokenize(sentence))# Define your grammar using regular expressionsgrammar = (''' NP: {<DT>?<JJ>*<NN>} # NP''')chunk_parser = RegexpParser(grammar)result = chunk_parser.parse(tagged)print(result)

Output:

(S (NP The/DT big/JJ cat/NN) ate/VBD (NP the/DT little/JJ mouse/NN) who/WP was/VBD after/IN (NP fresh/JJ cheese/NN))

In this code, we first tokenized and PoS tagged our sentence. Then we defined a grammar for a noun phrase (NP) to be any optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN).

Then we created a chunk parser with this grammar using RegexpParser, and finally parsed our tagged sentence.

Chinking

Chinking is the process of removing a sequence of tokens from a chunk. If the matching sequence of tokens spans an entire chunk, then the entire chunk is removed; if the sequence of tokens appears in the middle of the chunk, these tokens are removed, leaving two chunks where there was only one before.

If the sequence is at the periphery of the chunk, these tokens are removed, and a smaller chunk remains.
Here’s how you can do chinking in NLTK:

from nltk import pos_tag, RegexpParserfrom nltk.tokenize import word_tokenizesentence = "The big cat ate the little mouse who was after fresh cheese"# PoS taggingtagged = pos_tag(word_tokenize(sentence))# We are removing from the chink one or more verbs, prepositions, determiners, or the word 'to'.grammar = r""" NP: {<.*>+} # Chunk everything }<VBD|IN|DT|TO>+{ # Chink sequences of VBD, IN, DT, TO"""chunk_parser = RegexpParser(grammar)result = chunk_parser.parse(tagged)print(result)

Output:

(S (NP The/DT big/JJ cat/NN) ate/VBD (NP the/DT little/JJ mouse/NN) who/WP was/VBD after/IN (NP fresh/JJ cheese/NN))

In this code, we first tokenize and PoS tag our sentence. Then we define a grammar for chinking: we are removing from the chunk one or more verbs, prepositions, determiners, or the word ‘to’.

Then we create a chunk parser with this grammar using RegexpParser, and finally parse our tagged sentence.

N-Grams

N-grams of texts are extensively used in text mining and natural language processing tasks.

They are basically a set of co-occurring words within a given window and when computing the n-grams you typically move one word forward.
Here’s how to generate bigrams using NLTK:

from nltk import ngramsfrom nltk.tokenize import word_tokenizesentence = "The big cat ate the little mouse who was after fresh cheese"# Tokenize the sentencetokens = word_tokenize(sentence)# Generate bigramsbigrams = list(ngrams(tokens, 2))print(bigrams)

Output:

[('The', 'big'), ('big', 'cat'), ('cat', 'ate'), ('ate', 'the'), ('the', 'little'), ('little', 'mouse'), ('mouse', 'who'), ('who', 'was'), ('was', 'after'), ('after', 'fresh'), ('fresh', 'cheese')]

In the code above, we first tokenize our sentence, then generate bigrams using the ngrams function from NLTK.

The second argument to the ngrams function is the number of grams, in this case, 2. Hence, we get pairs of consecutive words.

Sentiment Analysis

For sentiment analysis, NLTK has a built-in module, nltk.sentiment.vader, which uses a combination of lexical and grammatical heuristics and a statistical model trained on human-annotated data.

Here’s a basic example of how you can perform sentiment analysis using NLTK:

from nltk.sentiment import SentimentIntensityAnalyzerfrom nltk.sentiment.util import *sia = SentimentIntensityAnalyzer()text = "Python is an awesome programming language."print(sia.polarity_scores(text))

Output:

{'neg': 0.0, 'neu': 0.439, 'pos': 0.561, 'compound': 0.6249}

In the code above, we first create a SentimentIntensityAnalyzer object. Then we feed a piece of text to the analyzer and print the resulting sentiment scores.

The output is a dictionary that contains four keys: neg, neu, pos, and compound. The neg, neu, and pos values represent the proportions of negative, neutral, and positive sentiment in the text, respectively.

The compound score is a summary metric that represents the overall sentiment of the text, calculated based on the previous three metrics.

Information Retrieval

Information retrieval is the activity of obtaining information system resources that are relevant to an information need from a collection of those resources.

In terms of NLP and text mining, information retrieval is a critical component.
Here’s an example of how you can retrieve information about specific tokens using NLTK:

from nltk.text import Textfrom nltk.tokenize import word_tokenizetext = "Python is a high-level programming language. Python is an interpreted language. Python is interactive."tokens = word_tokenize(text)text = Text(tokens)# Concordance gives words that appear in a similar range of contextsprint(text.concordance("Python"))

Output:

Displaying 3 of 3 matches: Python is a high-level programming languaga high-level programming language . Python is an interpreted language . PythonPython is an interpreted language . Python is interactive .None

In this code, we first tokenize our text and create a Text object with our tokens. Then we call the concordance method on our Text object with the word “Python”. The concordance method gives words that appear in a similar range of contexts to our input word.

Frequency Distribution

Frequency Distribution is used to count the frequency of each word in a text. It is a distribution because it tells us how the total number of word tokens in the text are distributed across the types of words.

Here’s how to calculate frequency distribution using NLTK:

from nltk.probability import FreqDistfrom nltk.tokenize import word_tokenizetext = "Python is an interpreted, high-level, general-purpose programming language."# Tokenize the sentencetokens = word_tokenize(text)# Create frequency distributionfdist = FreqDist(tokens)# Print the frequency of each wordfor word, freq in fdist.items(): print(f'{word}: {freq}')

Output:

Python: 1is: 1an: 1interpreted: 1,: 2high-level: 1general-purpose: 1programming: 1language: 1.: 1

In the code above, we first tokenize our text and then create a frequency distribution with the FreqDist class from NLTK. We pass our tokens to the FreqDist class.

The FreqDist object has an items method that returns a list of tuples, where each tuple is a word from the text and its corresponding frequency. We print each word and its frequency.

Further Reading

https://www.nltk.org/book/

Natural language processing (NLP) using Python NLTK (Simple Examples) (1)

Mokhtar Ebrahim

Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.

Natural language processing (NLP) using Python NLTK (Simple Examples) (2024)

References

Top Articles
Latest Posts
Article information

Author: Lilliana Bartoletti

Last Updated:

Views: 5985

Rating: 4.2 / 5 (53 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Lilliana Bartoletti

Birthday: 1999-11-18

Address: 58866 Tricia Spurs, North Melvinberg, HI 91346-3774

Phone: +50616620367928

Job: Real-Estate Liaison

Hobby: Graffiti, Astronomy, Handball, Magic, Origami, Fashion, Foreign language learning

Introduction: My name is Lilliana Bartoletti, I am a adventurous, pleasant, shiny, beautiful, handsome, zealous, tasty person who loves writing and wants to share my knowledge and understanding with you.