Natural Language Processing-Tokenization
Which is there in your hands(your smartphone) day and night look into it pay attention!!!
Don’t you think mini-computer in your hands? why not its a sample of a computer only right, Now let’s begin our task, think how our computer understands our language….that the story begins of NLP as machine translation language back in 17th century after a long and long trial we got our Turing machines where, In 1950, Alan Turing published his famous article “Computing Machinery and Intelligence” which proposed what is now called the Turing Test as a criterion of intelligence. This criterion depends on the ability of a computer program to impersonate a human in a real-time written conversation with a human judge, sufficiently well that the judge is unable to distinguish reliably — on the basis of the conversational content alone — between the program and a real human.
So, Coming to our point Natural Language Processing(NLP) is the ability of a computer program to understand human language as it is spoken.NLP is a component of Artificial Intelligence(AI).
Here, is are important events in the history of Natural Language Processing:
1950- NLP started when Alan Turing published an article called “Machine and Intelligence.”
1950- Attempts to automate translation between Russian and English
1960- The work of Chomsky and others on formal language theory and generative syntax
1990- Probabilistic and data-driven models had become quite standard
2000- A Large amount of spoken and textual data become available
Till now we have come to know History, little amount inventor and little definition about NLP, by this your not curious about NLP How does it work and what is the process involved.
I will try to put some effort into your curiousness shall we start….
The field of NLP involves in making computers to perform useful tasks with the natural languages human use. The input and output of the NLP system can be Speech and written Text.
Natural language generation is the process of producing meaningful phrases and sentences in the form of natural language from some internal representation.
Need some Terminology: —
- Phonology − It is the study of organizing sound systematically.
- Morphology − It is a study of the construction of words from primitive meaningful units.
- Morpheme − It is a primitive unit of meaning in a language.
- Syntax − It refers to arranging words to make a sentence. It also involves determining the structural role of words in the sentence and in phrases.
- Semantics − It is concerned with the meaning of words and how to combine words into meaningful phrases and sentences.
- Pragmatics − It deals with using and understanding sentences in different situations and how the interpretation of the sentence is affected.
- Discourse − It deals with how the immediately preceding sentence can affect the interpretation of the next sentence.
- World Knowledge − It includes general knowledge about the world.
You can say Architecture or Model of working in NLP
from the above figure, we can say that for analysis we required the input that is very much available, the next step is the Morphological processing which includes a study of constructing words, the very next step is syntax analysis which includes grammar detection and structure of the sentence formation
here we go with an example like for finding out the syntax analysis we required reg expression. The syntax refers to the principles and rules that govern the sentence structure of any individual language.
So as far now we are discussed theory now let's try to explain with a simple example:
Syntax and Semantics: Syntax is the grammatical structure of the text, Semantic is the meaning that is being conveyed.
for implementing these two methods you require machine-understandable language or a machine level language(java,c,c++, python, R, etc many more).
some techniques need to understand:
for example, a text sequence, “Diamonds were first mined in India”
(General understanding) Tokenization is the task of breaking any sentence into fragments separated by whitespaces, certain characters are removed, like punctuation, digits, emotions, etc., these fragments are called tokens.
Tokens composed of one-word -> Uni-grams
Tokens composed of two consecutive words -> bi-grams
tokens composed of three consecutive words -> tri-grams
what are the n-grams: an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phoneme, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text
s = “Diamonds were first mined in India”
importing necessary packages and we have taken one sample sentence.
here the function says that
sent.lower() is converts into lower case, in the next in the fig says
re.sub(r’[^a-zA-Z0–9\s]’, ‘ ‘, sent)
^a-zA-Z0–9: Matches any character excluding a to z, A to Z, 0 to 9.
\s: Matches with a single white space character (space, newline, return, tab, form) matches any non-white space character.
further coding very well knew all of us if you are well with functions
This is a small piece of code of Tokenization I have tried to explain
Next blog of NLP is based on POS tagging, Parsing, Name entity recognition
Further reading of reg expression please click on
- NLP with python: http://www.nltk.org/book/
***************Have Wonderful reading *******************