Natural Language Processing-Tokenization

Which is there in your hands(your smartphone) day and night look into it pay attention!!!

Don’t you think mini-computer in your hands? why not its a sample of a computer only right, Now let’s begin our task, think how our computer understands our language….that the story begins of NLP as machine translation language back in 17th century after a long and long trial we got our Turing machines where, In 1950, Alan Turing published his famous article “Computing Machinery and Intelligence” which proposed what is now called the Turing Test as a criterion of intelligence. This criterion depends on the ability of a computer program to impersonate a human in a real-time written conversation with a human judge, sufficiently well that the judge is unable to distinguish reliably — on the basis of the conversational content alone — between the program and a real human.

Alan Turing(1912–1954) -The image of him says a lot about him.

So, Coming to our point Natural Language Processing(NLP) is the ability of a computer program to understand human language as it is spoken.NLP is a component of Artificial Intelligence(AI).

Here, is are important events in the history of Natural Language Processing:

1950- NLP started when Alan Turing published an article called “Machine and Intelligence.”

1950- Attempts to automate translation between Russian and English

1960- The work of Chomsky and others on formal language theory and generative syntax

1990- Probabilistic and data-driven models had become quite standard

2000- A Large amount of spoken and textual data become available

Till now we have come to know History, little amount inventor and little definition about NLP, by this your not curious about NLP How does it work and what is the process involved.

This is image says about languages which were in the data set

I will try to put some effort into your curiousness shall we start….

The field of NLP involves in making computers to perform useful tasks with the natural languages human use. The input and output of the NLP system can be Speech and written Text.

Natural language generation is the process of producing meaningful phrases and sentences in the form of natural language from some internal representation.

Need some Terminology: —

  • Phonology − It is the study of organizing sound systematically.
  • Morphology − It is a study of the construction of words from primitive meaningful units.
  • Morpheme − It is a primitive unit of meaning in a language.
  • Syntax − It refers to arranging words to make a sentence. It also involves determining the structural role of words in the sentence and in phrases.
  • Semantics − It is concerned with the meaning of words and how to combine words into meaningful phrases and sentences.
  • Pragmatics − It deals with using and understanding sentences in different situations and how the interpretation of the sentence is affected.
  • Discourse − It deals with how the immediately preceding sentence can affect the interpretation of the next sentence.
  • World Knowledge − It includes general knowledge about the world.

You can say Architecture or Model of working in NLP

from the above figure, we can say that for analysis we required the input that is very much available, the next step is the Morphological processing which includes a study of constructing words, the very next step is syntax analysis which includes grammar detection and structure of the sentence formation

here we go with an example like for finding out the syntax analysis we required reg expression. The syntax refers to the principles and rules that govern the sentence structure of any individual language.

So as far now we are discussed theory now let's try to explain with a simple example:

Syntax and Semantics: Syntax is the grammatical structure of the text, Semantic is the meaning that is being conveyed.

for implementing these two methods you require machine-understandable language or a machine level language(java,c,c++, python, R, etc many more).

some techniques need to understand:

  1. Tokenization:-

for example, a text sequence, “Diamonds were first mined in India”

(General understanding) Tokenization is the task of breaking any sentence into fragments separated by whitespaces, certain characters are removed, like punctuation, digits, emotions, etc., these fragments are called tokens.

Tokens composed of one-word -> Uni-grams

Tokens composed of two consecutive words -> bi-grams

tokens composed of three consecutive words -> tri-grams

what are the n-grams: an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phoneme, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text

import re

import nltk

s = “Diamonds were first mined in India”

importing necessary packages and we have taken one sample sentence.

here the function says that

sent.lower() is converts into lower case, in the next in the fig says

re.sub(r’[^a-zA-Z0–9\s]’, ‘ ‘, sent)

^a-zA-Z0–9: Matches any character excluding a to z, A to Z, 0 to 9.

\s: Matches with a single white space character (space, newline, return, tab, form) matches any non-white space character.

further coding very well knew all of us if you are well with functions

This is a small piece of code of Tokenization I have tried to explain

Next blog of NLP is based on POS tagging, Parsing, Name entity recognition

Further reading of reg expression please click on

  1. https://www.analyticsvidhya.com/blog/2015/06/regular-expression-python/?utm_source=blog&utm_medium=learning-path-nlp-2020
  2. https://www.tutorialspoint.com/natural_language_processing/index.htm
  3. NLP with python: http://www.nltk.org/book/

***************Have Wonderful reading *******************

************************************************************