Introduction to NLTK: Tokenization, Stemming, Lemmatization, POS Tagging - GeeksforGeeks (2024)

Natural Language Toolkit (NLTK) is one of the largest Python libraries for performing various Natural Language Processing tasks. From rudimentary tasks such as text pre-processing to tasks like vectorized representation of text – NLTK’s API has covered everything. In this article, we will accustom ourselves to the basics of NLTK and perform some crucial NLP tasks: Tokenization, Stemming, Lemmatization, and POS Tagging.

Table of Content

  • What is the Natural Language Toolkit (NLTK)?
  • Tokenization
  • Stemming and Lemmatization
  • Stemming
  • Lemmatization
  • Part of Speech Tagging

As discussed earlier, NLTK is Python’s API library for performing an array of tasks in human language. It can perform a variety of operations on textual data, such as classification, tokenization, stemming, tagging, Leparsing, semantic reasoning, etc.

Installation:
NLTK can be installed simply using pip or by running the following code.

! pip install nltk

Accessing Additional Resources:
To incorporate the usage of additional resources, such as recourses of languages other than English – you can run the following in a python script. It has to be done only once when you are running it for the first time in your system.

Python
import nltknltk.download('all')

Now, having installed NLTK successfully in our system, let’s perform some basic operations on text data using NLTK.

Tokenization

Tokenization refers to break down the text into smaller units. It entails splitting paragraphs into sentences and sentences into words. It is one of the initial steps of any NLP pipeline. Let us have a look at the two major kinds of tokenization that NLTK provides:

Work Tokenization

It involves breaking down the text into words.

 "I study Machine Learning on GeeksforGeeks." will be word-tokenized as
['I', 'study', 'Machine', 'Learning', 'on', 'GeeksforGeeks', '.'].

Sentence Tokenization

It involves breaking down the text into individual sentences.

Example:
"I study Machine Learning on GeeksforGeeks. Currently, I'm studying NLP"
will be sentence-tokenized as
['I study Machine Learning on GeeksforGeeks.', 'Currently, I'm studying NLP.']

In Python, both these tokenizations can be implemented in NLTK as follows:

Python
# Tokenization using NLTKfrom nltk import word_tokenize, sent_tokenizesent = "GeeksforGeeks is a great learning platform.\It is one of the best for Computer Science students."print(word_tokenize(sent))print(sent_tokenize(sent))

Output:

['GeeksforGeeks', 'is', 'a', 'great', 'learning', 'platform', '.',
'It', 'is', 'one', 'of', 'the', 'best', 'for', 'Computer', 'Science', 'students', '.']
['GeeksforGeeks is a great learning platform.',
'It is one of the best for Computer Science students.']

Stemming and Lemmatization

When working with Natural Language, we are not much interested in the form of words – rather, we are concerned with the meaning that the words intend to convey. Thus, we try to map every word of the language to its root/base form. This process is called canonicalization.

E.g. The words ‘play’, ‘plays’, ‘played’, and ‘playing’ convey the same action – hence, we can map them all to their base form i.e. ‘play’.

Now, there are two widely used canonicalization techniques: Stemming and Lemmatization.

Stemming

Stemming generates the base word from the inflected word by removing the affixes of the word. It has a set of pre-defined rules that govern the dropping of these affixes. It must be noted that stemmers might not always result in semantically meaningful base words. Stemmers are faster and computationally less expensive than lemmatizers.

In the following code, we will be stemming words using Porter Stemmer – one of the most widely used stemmers:

Python
from nltk.stem import PorterStemmer# create an object of class PorterStemmerporter = PorterStemmer()print(porter.stem("play"))print(porter.stem("playing"))print(porter.stem("plays"))print(porter.stem("played"))

Output:

play
play
play
play

We can see that all the variations of the word ‘play’ have been reduced to the same word – ‘play’. In this case, the output is a meaningful word, ‘play’. However, this is not always the case. Let us take an example.

Please note that these groups are stored in the lemmatizer; there is no removal of affixes as in the case of a stemmer.

Python
from nltk.stem import PorterStemmer# create an object of class PorterStemmerporter = PorterStemmer()print(porter.stem("Communication"))

Output:

commun

The stemmer reduces the word ‘communication’ to a base word ‘commun’ which is meaningless in itself.

Lemmatization

Lemmatization involves grouping together the inflected forms of the same word. This way, we can reach out to the base form of any word which will be meaningful in nature. The base from here is called the Lemma.

Lemmatizers are slower and computationally more expensive than stemmers.

Example:
'play', 'plays', 'played', and 'playing' have 'play' as the lemma.

In Python, both these tokenizations can be implemented in NLTK as follows:

Python
from nltk.stem import WordNetLemmatizer# create an object of class WordNetLemmatizerlemmatizer = WordNetLemmatizer()print(lemmatizer.lemmatize("plays", 'v'))print(lemmatizer.lemmatize("played", 'v'))print(lemmatizer.lemmatize("play", 'v'))print(lemmatizer.lemmatize("playing", 'v'))

Output:

play
play
play
play

Please note that in lemmatizers, we need to pass the Part of Speech of the word along with the word as a function argument.

Also, lemmatizers always result in meaningful base words. Let us take the same example as we took in the case for stemmers.

Python
from nltk.stem import WordNetLemmatizer# create an object of class WordNetLemmatizerlemmatizer = WordNetLemmatizer()print(lemmatizer.lemmatize("Communication", 'v'))

Output:

Communication

Part of Speech Tagging

Part of Speech (POS) tagging refers to assigning each word of a sentence to its part of speech. It is significant as it helps to give a better syntactic overview of a sentence.

Example:
"GeeksforGeeks is a Computer Science platform."
Let's see how NLTK's POS tagger will tag this sentence.

In Python, both these tokenizations can be implemented in NLTK as follows:

Python
from nltk import pos_tagfrom nltk import word_tokenizetext = "GeeksforGeeks is a Computer Science platform."tokenized_text = word_tokenize(text)tags = tokens_tag = pos_tag(tokenized_text)tags

Output:

[('GeeksforGeeks', 'NNP'),
('is', 'VBZ'),
('a', 'DT'),
('Computer', 'NNP'),
('Science', 'NNP'),
('platform', 'NN'),
('.', '.')]

Conclusion

In conclusion, the Natural Language Toolkit (NLTK) works as a powerful Python library that a wide range of tools for Natural Language Processing (NLP). From fundamental tasks like text pre-processing to more advanced operations such as semantic reasoning, NLTK provides a versatile API that caters to the diverse needs of language-related tasks.



`; tags.map((tag)=>{ let tag_url = `videos/${getTermType(tag['term_id__term_type'])}/${tag['term_id__slug']}/`; tagContent+=``+ tag['term_id__term_name'] +``; }); tagContent+=`
Introduction to NLTK: Tokenization, Stemming, Lemmatization, POS Tagging - GeeksforGeeks (2024)

References

Top Articles
Taco Potluck Sign Up Sheet Printable
Community events: Community Potlucks: A Taste of Togetherness: The Tradition of Community Potlucks - FasterCapital
Top 11 Best Bloxburg House Ideas in Roblox - NeuralGamer
Metallica - Blackened Lyrics Meaning
Quick Pickling 101
Prosper TX Visitors Guide - Dallas Fort Worth Guide
360 Training Alcohol Final Exam Answers
Richard Sambade Obituary
Trade Chart Dave Richard
Dark Souls 2 Soft Cap
Degreeworks Sbu
How Much Is Tj Maxx Starting Pay
Pittsburgh Ultra Advanced Stain And Sealant Color Chart
About Us | TQL Careers
Download Center | Habasit
Q33 Bus Schedule Pdf
"Une héroïne" : les funérailles de Rebecca Cheptegei, athlète olympique immolée par son compagnon | TF1 INFO
Leccion 4 Lesson Test
Closest Bj Near Me
Samantha Aufderheide
Toyota Camry Hybrid Long Term Review: A Big Luxury Sedan With Hatchback Efficiency
Quest: Broken Home | Sal's Realm of RuneScape
Gina Wilson All Things Algebra Unit 2 Homework 8
Happy Life 365, Kelly Weekers | 9789021569444 | Boeken | bol
Jeffers Funeral Home Obituaries Greeneville Tennessee
Egizi Funeral Home Turnersville Nj
Lexus Credit Card Login
Anonib Oviedo
Student Portal Stvt
Xxn Abbreviation List 2023
Craigslist Efficiency For Rent Hialeah
Frequently Asked Questions - Hy-Vee PERKS
Diggy Battlefield Of Gods
Barrage Enhancement Lost Ark
10 Most Ridiculously Expensive Haircuts Of All Time in 2024 - Financesonline.com
Keeper Of The Lost Cities Series - Shannon Messenger
SOC 100 ONL Syllabus
Frommer's Philadelphia & the Amish Country (2007) (Frommer's Complete) - PDF Free Download
Sept Month Weather
Anguilla Forum Tripadvisor
The Angel Next Door Spoils Me Rotten Gogoanime
Actor and beloved baritone James Earl Jones dies at 93
Walmart Car Service Near Me
Doublelist Paducah Ky
Oklahoma City Farm & Garden Craigslist
Embry Riddle Prescott Academic Calendar
Xre 00251
The Largest Banks - ​​How to Transfer Money With Only Card Number and CVV (2024)
Myra's Floral Princeton Wv
Turok: Dinosaur Hunter
A Snowy Day In Oakland Showtimes Near Maya Pittsburg Cinemas
Costco Tire Promo Code Michelin 2022
Latest Posts
Article information

Author: Jamar Nader

Last Updated:

Views: 5989

Rating: 4.4 / 5 (55 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Jamar Nader

Birthday: 1995-02-28

Address: Apt. 536 6162 Reichel Greens, Port Zackaryside, CT 22682-9804

Phone: +9958384818317

Job: IT Representative

Hobby: Scrapbooking, Hiking, Hunting, Kite flying, Blacksmithing, Video gaming, Foraging

Introduction: My name is Jamar Nader, I am a fine, shiny, colorful, bright, nice, perfect, curious person who loves writing and wants to share my knowledge and understanding with you.