Natural Language Processing
Starting with the NLP articles first we will try the tokenizer in the NLTK package. Tokenizer breaks a paragraph into the relevant sub strings or sentences based on the tokenizer you used. In this I will use the Sent tokenizer, word_tokenizer and TweetTokenizer which has its specific work to do.
import nltk from nltk.tokenize import sent_tokenize, word_tokenize, TweetTokenizer para = "Hello there this is the blog about NLP. In this blog I have made some posts. " \ "I can come up with new content." tweet = "#Fun night. :) Feeling crazy #TGIF" # tokenizing the paragraph into sentences and words sent = sent_tokenize(para) word = word_tokenize(para) # printing the output print("this paragraph has " + str(len(sent)) + " sentences and " + str(len(word)) + " words") # print each sentence k = 1 for i in sent: print("sentence " + str(k) + " = " + i) k += 1 # print each word k = 1 for i in word: print("word " + str(k) + " = " + i) k += 1 # Comparing different kinds of tokenizer print("Comparing word_tokenizer and TweetTokenizer") print(word_tokenize(tweet)) print(TweetTokenizer().tokenize(tweet))
The Code first tokenize for sentences and words using sent_tokenizer and word_tokenizer. The output of this code will be as follows.
As you can see the sent_tokenizer has separated the para in to separate sentences and the word tokenizer into sub strings. Notice at the end I have given a comparison to Word_tokenizer and TweetTokenizer. See how smileys are taken as a one component in TweetTokenizer and two separate things in word_tokenizer.