Skip to main content

Natural Language Processing with Python NLTK part 7 - Lemmatizing

Natural Language Processing - Lemmatizing 



Stemming and Lemmatizing process goes in hand in hand. Both of these process do the same thing but in different way. In stemming we considered to cut off the last part of the word and get a meaningful word but in lemmatizing it is more considered upon getting a more meaningful word by removing infectious part and returning the vocabulary word. Lets understand with a simple example.

from nltk.stem import PorterStemmer, WordNetLemmatizer

# lemmatizing verbs

words_verbs = ["run", "ran", "running", "gave", "took", "shot"]

print("*************Stemming verbs********************")
for w in words_verbs:
    # Stemming the words
    print(PorterStemmer().stem(w))

print("*************Lemmatizing verbs********************")
for w in words_verbs:
    # lemmatize the words
    print(WordNetLemmatizer().lemmatize(w, pos="v"))

# lemmatizing nouns
print("*************Stemming nouns********************")
words_nouns = ["goons", "clocks", "machines", "wolves", "shelves"]
for x in words_nouns:
    print(PorterStemmer().stem(x))

print("*************Lemmatizing nouns********************")
for x in words_nouns:
    print(WordNetLemmatizer().lemmatize(x, pos="n"))

print("*************Lemmatizing adjectives********************")
words_adjective = ["better", "slower", "slowest", "strongest", "busiest"]
for x in words_adjective:
    print(WordNetLemmatizer().lemmatize(x, pos="a"))

In the following example I have lemmatized verbs, nouns and adjectives separately to show the effect. The result was as following. You can now clearly see the difference between the stemming and lemmatizing.


Popular posts from this blog

Natural Language Processing with Python NLTK part 5 - Chunking and Chinking

Natural Language Processing Using regular expression modifiers we can chunk out the PoS tagged words from the earlier example. The chunking is done with regular expressions defining a chunk rule. The Chinking defines what we need to exclude from the selection. Here are list of modifiers for Python: {1,3} = for digits, u expect 1-3 counts of digits, or "places" + = match 1 or more ? = match 0 or 1 repetitions. * = match 0 or MORE repetitions $ = matches at the end of string ^ = matches start of a string | = matches either/or. Example x|y = will match either x or y [] = range, or "variance" {x} = expect to see this amount of the preceding code. {x,y} = expect to see this x-y amounts of the preceding code source: https://pythonprogramming.net/regular-expressions-regex-tutorial-python-3/ Chunking import nltk from nltk.tokenize import word_tokenize # POS tagging sent = "This will be chunked. This is for Test. World is awesome. Hello world....

Natural Language Processing with Python NLTK part 2 - Stop Words

Natural Language Processing Stop words are the words which we ignore due to the fact that they do not generate any specific meaning to the sentence. Words like the, is, at etc. can be removed to extract the meaning of the sentence more easily. So NLTK has introduced us a stop words filter we can easily use. Let's see how it works. from nltk.corpus import stopwords from nltk.tokenize import word_tokenize sent = "As you can see this is the blog of myself which is written by Anjula" w = word_tokenize(sent) # set English stop words stop_words = set(stopwords . words( 'english' )) # list of standard stop words in English print (stop_words) # making empty arrays to store stop words and others stop_words_in_sent = [] non_stop_words = [] # Loop through to get the stop words for x in w: if x not in stop_words: non_stop_words . append(x) else : stop_words_in_sent . append(x) # print result ...

Natural Language Processing with Python NLTK part 6 - Named Entity Recognition

Natural Language Processing - NER Named entities are specific reference to something. As a part of recognizing text NLTK has allowed us to used the named entity recognition and recognize certain types of entities. Those types are as follows NE Type Examples ORGANIZATION Georgia-Pacific Corp. ,  WHO PERSON Eddy Bonte ,  President Obama LOCATION Murray River ,  Mount Everest DATE June ,  2008-06-29 TIME two fifty a m ,  1:30 p.m. MONEY 175 million Canadian Dollars ,  GBP 10.40 PERCENT twenty pct ,  18.75 % FACILITY Washington Monument ,  Stonehenge GPE South East Asia ,  Midlothian Source:  http://www.nltk.org/book/ch07.html Simple example on NER: import nltk from nltk.tokenize import word_tokenize, sent_tokenize para = " America is a country. John is a name. " sent = sent_tokenize(para) for s in sent: word = word_tokenize(s) tag = nltk . pos_tag(word) n...