Natural Language Processing with Python NLTK part 5

Natural Language Processing with Python NLTK part 5 - Chunking and Chinking

Natural Language Processing

Using regular expression modifiers we can chunk out the PoS tagged words from the earlier example. The chunking is done with regular expressions defining a chunk rule. The Chinking defines what we need to exclude from the selection.

Here are list of modifiers for Python:

{1,3} = for digits, u expect 1-3 counts of digits, or "places"
+ = match 1 or more
? = match 0 or 1 repetitions.
* = match 0 or MORE repetitions
$ = matches at the end of string
^ = matches start of a string
| = matches either/or. Example x|y = will match either x or y
[] = range, or "variance"
{x} = expect to see this amount of the preceding code.
{x,y} = expect to see this x-y amounts of the preceding code

source: https://pythonprogramming.net/regular-expressions-regex-tutorial-python-3/

Chunking

import nltk
from nltk.tokenize import word_tokenize

# POS tagging
sent = "This will be chunked. This is for Test. World is awesome. Hello world."

print(nltk.pos_tag(word_tokenize(sent)))

# creating a regular expression for chunking verbs and nouns
chunkRule = r"""chunk: {<NN.?>*<NNS.?>*<NNP.?>*<NNPS.?>*<VB.?>*<VBD.?>*<VBG.?>*<VBN.?>*<VBP.?>*<VBZ.?>*}"""

My_parser = nltk.RegexpParser(chunkRule)
chunked = My_parser.parse(nltk.pos_tag(word_tokenize(sent)))

print(chunked)

This will give the output as follows:

If you have matplotlib installed you can use a simple code to graphically view the tree which is easier to understand.

My_parser = nltk.RegexpParser(chunkRule)
chunked = My_parser.parse(nltk.pos_tag(word_tokenize(sent)))

# draw the tree
chunked.draw()
print(chunked)

This will draw the tree as follows.

Chinking

chinking is the process of excluding. In this example we will be chunking all the tags and excluding nouns.

import nltk
from nltk.tokenize import word_tokenize

sent = "This will be the day that I will chink all the nouns. Everything will be there.Except nouns"

print(nltk.pos_tag(word_tokenize(sent)))

# first chunk everything and chink only nouns
chunkRule = r"""Chunk: {<.*>+}
                        }<NN.?|NNS|NNP|NNPS>+{"""

MyParser = nltk.RegexpParser(chunkRule)
chunked = MyParser.parse(nltk.pos_tag(word_tokenize(sent)))

chunked.draw()
print(chunked)

Outputs:

Some Stuff

Search This Blog