Skip to main content

Natural Language Processing with Python NLTK part 7 - Lemmatizing

Natural Language Processing - Lemmatizing 



Stemming and Lemmatizing process goes in hand in hand. Both of these process do the same thing but in different way. In stemming we considered to cut off the last part of the word and get a meaningful word but in lemmatizing it is more considered upon getting a more meaningful word by removing infectious part and returning the vocabulary word. Lets understand with a simple example.

from nltk.stem import PorterStemmer, WordNetLemmatizer

# lemmatizing verbs

words_verbs = ["run", "ran", "running", "gave", "took", "shot"]

print("*************Stemming verbs********************")
for w in words_verbs:
    # Stemming the words
    print(PorterStemmer().stem(w))

print("*************Lemmatizing verbs********************")
for w in words_verbs:
    # lemmatize the words
    print(WordNetLemmatizer().lemmatize(w, pos="v"))

# lemmatizing nouns
print("*************Stemming nouns********************")
words_nouns = ["goons", "clocks", "machines", "wolves", "shelves"]
for x in words_nouns:
    print(PorterStemmer().stem(x))

print("*************Lemmatizing nouns********************")
for x in words_nouns:
    print(WordNetLemmatizer().lemmatize(x, pos="n"))

print("*************Lemmatizing adjectives********************")
words_adjective = ["better", "slower", "slowest", "strongest", "busiest"]
for x in words_adjective:
    print(WordNetLemmatizer().lemmatize(x, pos="a"))

In the following example I have lemmatized verbs, nouns and adjectives separately to show the effect. The result was as following. You can now clearly see the difference between the stemming and lemmatizing.


Popular posts from this blog

Natural Language Processing with Python NLTK part 5 - Chunking and Chinking

Natural Language Processing Using regular expression modifiers we can chunk out the PoS tagged words from the earlier example. The chunking is done with regular expressions defining a chunk rule. The Chinking defines what we need to exclude from the selection. Here are list of modifiers for Python: {1,3} = for digits, u expect 1-3 counts of digits, or "places" + = match 1 or more ? = match 0 or 1 repetitions. * = match 0 or MORE repetitions $ = matches at the end of string ^ = matches start of a string | = matches either/or. Example x|y = will match either x or y [] = range, or "variance" {x} = expect to see this amount of the preceding code. {x,y} = expect to see this x-y amounts of the preceding code source: https://pythonprogramming.net/regular-expressions-regex-tutorial-python-3/ Chunking import nltk from nltk.tokenize import word_tokenize # POS tagging sent = "This will be chunked. This is for Test. World is awesome. Hello world....

Natural Language Processing with Python NLTK part 1 - Tokenizer

Natural Language Processing Starting with the NLP articles first we will try the  tokenizer  in the NLTK package. Tokenizer breaks a paragraph into the relevant sub strings or sentences based on the tokenizer you used. In this I will use the Sent tokenizer, word_tokenizer and TweetTokenizer which has its specific work to do. import nltk from nltk.tokenize import sent_tokenize, word_tokenize, TweetTokenizer para = "Hello there this is the blog about NLP. In this blog I have made some posts. " \ "I can come up with new content." tweet = "#Fun night. :) Feeling crazy #TGIF" # tokenizing the paragraph into sentences and words sent = sent_tokenize(para) word = word_tokenize(para) # printing the output print ( "this paragraph has " + str(len(sent)) + " sentences and " + str(len(word)) + " words" ) # print each sentence k = 1 for i in sent: print ( "sentence ...

Natural Language Processing with Python NLTK part 3 - Stemming

Natural Language Processing So this one will be about stemming. Stemming is used in NLP for various reasons Stemming is removing certain parts of the word to get the meaning of it. For example, Running when stemmed returns run, and cooking when stemmed returns cook. from nltk.stem import PorterStemmer from nltk.tokenize import word_tokenize # testing with a sentence sent = "when we run we get healthy, Running is awesome. I have ran for may miles." myWords = word_tokenize(sent) for w in myWords: print (PorterStemmer() . stem(w)) print ( "**********Custom List************" ) # Testing with several custom words listwords = [ "come" , "came" , "coming" , "run" , "running" , "added" , "adding" ] for w in listwords: print (PorterStemmer() . stem(w)) The output will be as follows: