Skip to main content

Natural Language Processing with Python NLTK part 5 - Chunking and Chinking

Natural Language Processing


Using regular expression modifiers we can chunk out the PoS tagged words from the earlier example. The chunking is done with regular expressions defining a chunk rule. The Chinking defines what we need to exclude from the selection.

Here are list of modifiers for Python:

  • {1,3} = for digits, u expect 1-3 counts of digits, or "places"
  • + = match 1 or more
  • ? = match 0 or 1 repetitions.
  • * = match 0 or MORE repetitions
  • $ = matches at the end of string
  • ^ = matches start of a string
  • | = matches either/or. Example x|y = will match either x or y
  • [] = range, or "variance"
  • {x} = expect to see this amount of the preceding code.
  • {x,y} = expect to see this x-y amounts of the preceding code
source: https://pythonprogramming.net/regular-expressions-regex-tutorial-python-3/

Chunking


import nltk
from nltk.tokenize import word_tokenize

# POS tagging
sent = "This will be chunked. This is for Test. World is awesome. Hello world."

print(nltk.pos_tag(word_tokenize(sent)))

# creating a regular expression for chunking verbs and nouns
chunkRule = r"""chunk: {<NN.?>*<NNS.?>*<NNP.?>*<NNPS.?>*<VB.?>*<VBD.?>*<VBG.?>*<VBN.?>*<VBP.?>*<VBZ.?>*}"""

My_parser = nltk.RegexpParser(chunkRule)
chunked = My_parser.parse(nltk.pos_tag(word_tokenize(sent)))

print(chunked)


This will give the output as follows:


If you have matplotlib installed you can use a simple code to graphically view the tree which is easier to understand.

My_parser = nltk.RegexpParser(chunkRule)
chunked = My_parser.parse(nltk.pos_tag(word_tokenize(sent)))

# draw the tree
chunked.draw()
print(chunked)

This will draw the tree as follows.





Chinking

chinking is the process of excluding. In this example we will be chunking all the tags and excluding nouns.

import nltk
from nltk.tokenize import word_tokenize

sent = "This will be the day that I will chink all the nouns. Everything will be there.Except nouns"

print(nltk.pos_tag(word_tokenize(sent)))

# first chunk everything and chink only nouns
chunkRule = r"""Chunk: {<.*>+}
                        }<NN.?|NNS|NNP|NNPS>+{"""

MyParser = nltk.RegexpParser(chunkRule)
chunked = MyParser.parse(nltk.pos_tag(word_tokenize(sent)))

chunked.draw()
print(chunked)

Outputs:







Popular posts from this blog

Natural Language Processing with Python NLTK part 2 - Stop Words

Natural Language Processing Stop words are the words which we ignore due to the fact that they do not generate any specific meaning to the sentence. Words like the, is, at etc. can be removed to extract the meaning of the sentence more easily. So NLTK has introduced us a stop words filter we can easily use. Let's see how it works. from nltk.corpus import stopwords from nltk.tokenize import word_tokenize sent = "As you can see this is the blog of myself which is written by Anjula" w = word_tokenize(sent) # set English stop words stop_words = set(stopwords . words( 'english' )) # list of standard stop words in English print (stop_words) # making empty arrays to store stop words and others stop_words_in_sent = [] non_stop_words = [] # Loop through to get the stop words for x in w: if x not in stop_words: non_stop_words . append(x) else : stop_words_in_sent . append(x) # print result ...

Natural Language Processing with Python NLTK part 4 - PoS tagging

Natural Language Processing  PoS tagging or Part of Speech tagging is a commonly used mechanism. This will allow NLTK to tag the words that is in your corpus and give the tags accordingly. There are many tags predefined by the NLTK and here are the list. Number Tag Description 1. CC Coordinating conjunction 2. CD Cardinal number 3. DT Determiner 4. EX Existential  there 5. FW Foreign word 6. IN Preposition or subordinating conjunction 7. JJ Adjective 8. JJR Adjective, comparative 9. JJS Adjective, superlative 10. LS List item marker 11. MD Modal 12. NN Noun, singular or mass 13. NNS Noun, plural 14. NNP Proper noun, singular 15. NNPS Proper noun, plural 16. PDT Predeterminer 17. POS Possessive ending 18. PRP Personal pronoun 19. PRP$ Possessive pronoun 20. RB Adverb 21. RBR Adverb, comparative 22. RBS ...