Natural Language Processing
Using regular expression modifiers we can chunk out the PoS tagged words from the earlier example. The chunking is done with regular expressions defining a chunk rule. The Chinking defines what we need to exclude from the selection.
Here are list of modifiers for Python:
- {1,3} = for digits, u expect 1-3 counts of digits, or "places"
- + = match 1 or more
- ? = match 0 or 1 repetitions.
- * = match 0 or MORE repetitions
- $ = matches at the end of string
- ^ = matches start of a string
- | = matches either/or. Example x|y = will match either x or y
- [] = range, or "variance"
- {x} = expect to see this amount of the preceding code.
- {x,y} = expect to see this x-y amounts of the preceding code
Chunking
import nltk from nltk.tokenize import word_tokenize # POS tagging sent = "This will be chunked. This is for Test. World is awesome. Hello world." print(nltk.pos_tag(word_tokenize(sent))) # creating a regular expression for chunking verbs and nouns chunkRule = r"""chunk: {<NN.?>*<NNS.?>*<NNP.?>*<NNPS.?>*<VB.?>*<VBD.?>*<VBG.?>*<VBN.?>*<VBP.?>*<VBZ.?>*}""" My_parser = nltk.RegexpParser(chunkRule) chunked = My_parser.parse(nltk.pos_tag(word_tokenize(sent))) print(chunked)
This will give the output as follows:
If you have matplotlib installed you can use a simple code to graphically view the tree which is easier to understand.
My_parser = nltk.RegexpParser(chunkRule) chunked = My_parser.parse(nltk.pos_tag(word_tokenize(sent))) # draw the tree chunked.draw() print(chunked)
This will draw the tree as follows.
Chinking
chinking is the process of excluding. In this example we will be chunking all the tags and excluding nouns.
import nltk from nltk.tokenize import word_tokenize sent = "This will be the day that I will chink all the nouns. Everything will be there.Except nouns" print(nltk.pos_tag(word_tokenize(sent))) # first chunk everything and chink only nouns chunkRule = r"""Chunk: {<.*>+} }<NN.?|NNS|NNP|NNPS>+{""" MyParser = nltk.RegexpParser(chunkRule) chunked = MyParser.parse(nltk.pos_tag(word_tokenize(sent))) chunked.draw() print(chunked)
Outputs: