Skip to main content

Natural Language Processing with Python NLTK part 6 - Named Entity Recognition

Natural Language Processing - NER

Named entities are specific reference to something. As a part of recognizing text NLTK has allowed us to used the named entity recognition and recognize certain types of entities. Those types are as follows

NE TypeExamples
ORGANIZATIONGeorgia-Pacific Corp.WHO
PERSONEddy BontePresident Obama
LOCATIONMurray RiverMount Everest
DATEJune2008-06-29
TIMEtwo fifty a m1:30 p.m.
MONEY175 million Canadian DollarsGBP 10.40
PERCENTtwenty pct18.75 %
FACILITYWashington MonumentStonehenge
GPESouth East AsiaMidlothian

Simple example on NER:

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

para = " America is a country. John is a name. "

sent = sent_tokenize(para)

for s in sent:
    word = word_tokenize(s)
    tag = nltk.pos_tag(word)
    namedEntity = nltk.ne_chunk(tag)
    namedEntity.draw()

The two sentences will be tagged and the named entities will be identified by the Python NLTK library. The result is like this.



The NLTK identifies America and John as named entities. What if the named entity has two words like Sri Lanka, Saudi Arabia etc. This we use a simple things to get that as a single named entity. That is enabling binaries.

So for this I have changed the sentence of the previous code to para = " Saudi Arabia is a country. John Peters is a name. "

para = " Saudi Arabia is a country. John Peters is a name. "

sent = sent_tokenize(para)

for s in sent:
    word = word_tokenize(s)
    tag = nltk.pos_tag(word)
    # making binary = true
    namedEntity = nltk.ne_chunk(tag, binary=True)
    namedEntity.draw()

The output without enabling binary is as follows.





The result after making binary = True.



You can see that all the named entities are grouped together.

Popular posts from this blog

Natural Language Processing with Python NLTK part 5 - Chunking and Chinking

Natural Language Processing Using regular expression modifiers we can chunk out the PoS tagged words from the earlier example. The chunking is done with regular expressions defining a chunk rule. The Chinking defines what we need to exclude from the selection. Here are list of modifiers for Python: {1,3} = for digits, u expect 1-3 counts of digits, or "places" + = match 1 or more ? = match 0 or 1 repetitions. * = match 0 or MORE repetitions $ = matches at the end of string ^ = matches start of a string | = matches either/or. Example x|y = will match either x or y [] = range, or "variance" {x} = expect to see this amount of the preceding code. {x,y} = expect to see this x-y amounts of the preceding code source: https://pythonprogramming.net/regular-expressions-regex-tutorial-python-3/ Chunking import nltk from nltk.tokenize import word_tokenize # POS tagging sent = "This will be chunked. This is for Test. World is awesome. Hello world....

Natural Language Processing with Python NLTK part 1 - Tokenizer

Natural Language Processing Starting with the NLP articles first we will try the  tokenizer  in the NLTK package. Tokenizer breaks a paragraph into the relevant sub strings or sentences based on the tokenizer you used. In this I will use the Sent tokenizer, word_tokenizer and TweetTokenizer which has its specific work to do. import nltk from nltk.tokenize import sent_tokenize, word_tokenize, TweetTokenizer para = "Hello there this is the blog about NLP. In this blog I have made some posts. " \ "I can come up with new content." tweet = "#Fun night. :) Feeling crazy #TGIF" # tokenizing the paragraph into sentences and words sent = sent_tokenize(para) word = word_tokenize(para) # printing the output print ( "this paragraph has " + str(len(sent)) + " sentences and " + str(len(word)) + " words" ) # print each sentence k = 1 for i in sent: print ( "sentence ...