Skip to main content

Natural Language Processing with Python NLTK part 1 - Tokenizer

Natural Language Processing




Starting with the NLP articles first we will try the tokenizer in the NLTK package. Tokenizer breaks a paragraph into the relevant sub strings or sentences based on the tokenizer you used. In this I will use the Sent tokenizer, word_tokenizer and TweetTokenizer which has its specific work to do.


import nltk
from nltk.tokenize import sent_tokenize, word_tokenize, TweetTokenizer

para = "Hello there this is the blog about NLP. In this blog I have made some posts. " \
       "I can come up with new content."
tweet = "#Fun night. :) Feeling crazy #TGIF"

# tokenizing the paragraph into sentences and words
sent = sent_tokenize(para)
word = word_tokenize(para)

# printing the output
print("this paragraph has " + str(len(sent)) + " sentences and " + str(len(word)) + " words")

# print each sentence
k = 1
for i in sent:
    print("sentence " + str(k) + " = " + i)
    k += 1

# print each word
k = 1
for i in word:
    print("word " + str(k) + " = " + i)
    k += 1

# Comparing different kinds of tokenizer
print("Comparing word_tokenizer and TweetTokenizer")
print(word_tokenize(tweet))
print(TweetTokenizer().tokenize(tweet))

The Code first tokenize for sentences and words using sent_tokenizer and word_tokenizer. The output of this code will be as follows.



As you can see the sent_tokenizer has separated the para in to separate sentences and the word tokenizer into sub strings. Notice at the end I have given a comparison to Word_tokenizer and TweetTokenizer. See how smileys are taken as a one component in TweetTokenizer and two separate things in word_tokenizer.

Popular posts from this blog

Natural Language Processing with Python NLTK part 5 - Chunking and Chinking

Natural Language Processing Using regular expression modifiers we can chunk out the PoS tagged words from the earlier example. The chunking is done with regular expressions defining a chunk rule. The Chinking defines what we need to exclude from the selection. Here are list of modifiers for Python: {1,3} = for digits, u expect 1-3 counts of digits, or "places" + = match 1 or more ? = match 0 or 1 repetitions. * = match 0 or MORE repetitions $ = matches at the end of string ^ = matches start of a string | = matches either/or. Example x|y = will match either x or y [] = range, or "variance" {x} = expect to see this amount of the preceding code. {x,y} = expect to see this x-y amounts of the preceding code source: https://pythonprogramming.net/regular-expressions-regex-tutorial-python-3/ Chunking import nltk from nltk.tokenize import word_tokenize # POS tagging sent = "This will be chunked. This is for Test. World is awesome. Hello world....

Natural Language Processing with Python NLTK part 4 - PoS tagging

Natural Language Processing  PoS tagging or Part of Speech tagging is a commonly used mechanism. This will allow NLTK to tag the words that is in your corpus and give the tags accordingly. There are many tags predefined by the NLTK and here are the list. Number Tag Description 1. CC Coordinating conjunction 2. CD Cardinal number 3. DT Determiner 4. EX Existential  there 5. FW Foreign word 6. IN Preposition or subordinating conjunction 7. JJ Adjective 8. JJR Adjective, comparative 9. JJS Adjective, superlative 10. LS List item marker 11. MD Modal 12. NN Noun, singular or mass 13. NNS Noun, plural 14. NNP Proper noun, singular 15. NNPS Proper noun, plural 16. PDT Predeterminer 17. POS Possessive ending 18. PRP Personal pronoun 19. PRP$ Possessive pronoun 20. RB Adverb 21. RBR Adverb, comparative 22. RBS ...