Skip to main content

Natural Language Processing with Python NLTK part 1 - Tokenizer

Natural Language Processing




Starting with the NLP articles first we will try the tokenizer in the NLTK package. Tokenizer breaks a paragraph into the relevant sub strings or sentences based on the tokenizer you used. In this I will use the Sent tokenizer, word_tokenizer and TweetTokenizer which has its specific work to do.


import nltk
from nltk.tokenize import sent_tokenize, word_tokenize, TweetTokenizer

para = "Hello there this is the blog about NLP. In this blog I have made some posts. " \
       "I can come up with new content."
tweet = "#Fun night. :) Feeling crazy #TGIF"

# tokenizing the paragraph into sentences and words
sent = sent_tokenize(para)
word = word_tokenize(para)

# printing the output
print("this paragraph has " + str(len(sent)) + " sentences and " + str(len(word)) + " words")

# print each sentence
k = 1
for i in sent:
    print("sentence " + str(k) + " = " + i)
    k += 1

# print each word
k = 1
for i in word:
    print("word " + str(k) + " = " + i)
    k += 1

# Comparing different kinds of tokenizer
print("Comparing word_tokenizer and TweetTokenizer")
print(word_tokenize(tweet))
print(TweetTokenizer().tokenize(tweet))

The Code first tokenize for sentences and words using sent_tokenizer and word_tokenizer. The output of this code will be as follows.



As you can see the sent_tokenizer has separated the para in to separate sentences and the word tokenizer into sub strings. Notice at the end I have given a comparison to Word_tokenizer and TweetTokenizer. See how smileys are taken as a one component in TweetTokenizer and two separate things in word_tokenizer.

Popular posts from this blog

Natural Language Processing with Python NLTK part 5 - Chunking and Chinking

Natural Language Processing Using regular expression modifiers we can chunk out the PoS tagged words from the earlier example. The chunking is done with regular expressions defining a chunk rule. The Chinking defines what we need to exclude from the selection. Here are list of modifiers for Python: {1,3} = for digits, u expect 1-3 counts of digits, or "places" + = match 1 or more ? = match 0 or 1 repetitions. * = match 0 or MORE repetitions $ = matches at the end of string ^ = matches start of a string | = matches either/or. Example x|y = will match either x or y [] = range, or "variance" {x} = expect to see this amount of the preceding code. {x,y} = expect to see this x-y amounts of the preceding code source: https://pythonprogramming.net/regular-expressions-regex-tutorial-python-3/ Chunking import nltk from nltk.tokenize import word_tokenize # POS tagging sent = "This will be chunked. This is for Test. World is awesome. Hello world....

Natural Language Processing with Python NLTK part 6 - Named Entity Recognition

Natural Language Processing - NER Named entities are specific reference to something. As a part of recognizing text NLTK has allowed us to used the named entity recognition and recognize certain types of entities. Those types are as follows NE Type Examples ORGANIZATION Georgia-Pacific Corp. ,  WHO PERSON Eddy Bonte ,  President Obama LOCATION Murray River ,  Mount Everest DATE June ,  2008-06-29 TIME two fifty a m ,  1:30 p.m. MONEY 175 million Canadian Dollars ,  GBP 10.40 PERCENT twenty pct ,  18.75 % FACILITY Washington Monument ,  Stonehenge GPE South East Asia ,  Midlothian Source:  http://www.nltk.org/book/ch07.html Simple example on NER: import nltk from nltk.tokenize import word_tokenize, sent_tokenize para = " America is a country. John is a name. " sent = sent_tokenize(para) for s in sent: word = word_tokenize(s) tag = nltk . pos_tag(word) n...