Natural Language Processing - NER
Named entities are specific reference to something. As a part of recognizing text NLTK has allowed us to used the named entity recognition and recognize certain types of entities. Those types are as follows
NE Type | Examples |
---|---|
ORGANIZATION | Georgia-Pacific Corp., WHO |
PERSON | Eddy Bonte, President Obama |
LOCATION | Murray River, Mount Everest |
DATE | June, 2008-06-29 |
TIME | two fifty a m, 1:30 p.m. |
MONEY | 175 million Canadian Dollars, GBP 10.40 |
PERCENT | twenty pct, 18.75 % |
FACILITY | Washington Monument, Stonehenge |
GPE | South East Asia, Midlothian |
Simple example on NER:
import nltk from nltk.tokenize import word_tokenize, sent_tokenize para = " America is a country. John is a name. " sent = sent_tokenize(para) for s in sent: word = word_tokenize(s) tag = nltk.pos_tag(word) namedEntity = nltk.ne_chunk(tag) namedEntity.draw()
The two sentences will be tagged and the named entities will be identified by the Python NLTK library. The result is like this.
The NLTK identifies America and John as named entities. What if the named entity has two words like Sri Lanka, Saudi Arabia etc. This we use a simple things to get that as a single named entity. That is enabling binaries.
So for this I have changed the sentence of the previous code to para = " Saudi Arabia is a country. John Peters is a name. "
para = " Saudi Arabia is a country. John Peters is a name. " sent = sent_tokenize(para) for s in sent: word = word_tokenize(s) tag = nltk.pos_tag(word) # making binary = true namedEntity = nltk.ne_chunk(tag, binary=True) namedEntity.draw()
The output without enabling binary is as follows.