text tagging

developing text processing data products: part I

Folks, this is going to be a series of information pieces (more than one blog post about same topic) about text processing. In this series, if intend to discuss some of my experiences and also take this moment to organize the discussion on my blog. In the past, if touched upon some of the text processing recipes purely in an application point of view; however, if have spent more time in my career work with text content in automation and analytics and if owe it to myself to write more on text processing. Without further ado, lets jump into the this series.

Text processing is extracting information that could be used for decision making purposes from a text say like a book/paper/article — WIHTOUT HAVING TO READ IT. I use to read the newspaper to find out about weather forecast (five years ago) but these days if we need information about something we type it in search engine to get results (and ads). But imagine someone need information on day to day basis for decision making purposes, where he/she needs to read many articles or news or books in short period of time. Information is power but timing is important too. Hence the need for text processing data products. 

Some basic things that we can do with text data:

Crawling: extract data/content from a website using a crawlier, which is used to extract the data.

Tokenization: process of splitting a string into tiny objects in an array, based on the space between words.

Stemming: word reduction to a smaller word, like a root of word only to use it to search for all of its variations.

Stop word removal: straight forward, to remove frequently occurring words like a, an, the.

Parsing: process of breaking down a sentence into a set of phrases and further breakdown each phrase into noun, verb, adjective, etc… Also called parts of speech tagging. Cool topic.

These are just my experiences and my knowledge, and as always I try to write in a way that anyone can read and understand. More later.

Advertisements

Text extraction and tagging using Natural Language Processing

Another beautiful product that I came across is Open Calais. It is a web service offered by Thomson Reuters which lets the user to automatically extract semantic information (metadata) from a given story or text document. The following is a demo that gives us a clear understanding of what Open Calais is capable of doing.

We can work with Open Calais by their API functionality. First, user need to register with open calais to receive an authentication key. We need the auth key to connect remotely to Open Calais server so we can use their web service.

Your system needs simplejson. For our ease, we can use the wrapper class for python calais.

# Interacting with Python interface to the OpenCalais API
# Author: Pavan Narayanan (pavankn11@gmail.com), Nov 2013
# This program will make use of Calais and CalaisResponse classes to extract information and produce summary output on text

# Calais class will extract information from the given source — a tweet or a news piece from a weblink
from calais import Calais

# Authenticating the OpenCalais API through the authentication key
authkey = “xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx”
calais = Calais(authkey,submitter=”Pavan Narayanan”)

# use analyze if you have one statement to extract terms (like a tweet)
termExtract = calais.analyze(“George Bush was the President of the United States of America until 2009. Barack Obama is the new President of the United States now.”)

# use analyze_url to extract all important information from the weblink
termExtractURL = calais.analyze_url(“http://www.india-briefing.com/news/india-deepens-trade-investment-russia-7041.html”)

# CalaisResponse will print information after analyzing text information parsed earlier
from calais import CalaisResponse

def termExtractPrint():
termExtract.print_summary()
termExtract.print_topics()
termExtract.print_relations()
termExtract.print_entities()

def termExtractURLPrint():
termExtractURL.print_summary()
termExtractURL.print_topics()
termExtractURL.print_relations()
termExtractURL.print_entities()

print “\nThe Terms Extracted for the tweet/given string are”
termExtractPrint()

print “\nThe Terms Extracted for the given URL are”
termExtractURLPrint()

We get the following output while testing with this following web link

http://www.thehindu.com/news/national/russias-latest-addition-to-indian-military-might/article5358813.ece

The output:

SUMMARY:
Calais Request ID: 7a2abbfb-2f5e-3789-1426-a3b110f33492
External ID: http://www.thehindu.com/news/national/russias-latest-addition-to-indian-military-might/article5358813.ece
Title:
Language: English
Extractions:
21 entities
1 topics
1 language
9 relations
TOPICS:
Politics
ENTITIES:
City: Moscow (0.21)
Country: India (0.73)
Organization: Navy (0.26)
IndustryTerm: protracted carrier reconstruction project (0.26)
Organization: India-Russia Intergovernmental Commission on Military Technical Cooperation (0.21)
Person: A.K. Antony (0.74)
Country: Russia (0.73)
Organization: Russian government (0.33)
PublishedMedium: The Hindu (0.06)
ProvinceOrState: Karnataka (0.06)
Person: D.K. Joshi (0.07)
IndustryTerm: carrier deck (0.07)
Product: MiG-29k (0.07)
Position: Admiral (0.07)
Organization: Indian Navy (0.41)
Position: Chief (0.07)
Position: Defence Minister (0.41)
IndustryTerm: carrier project (0.16)
IndustryTerm: newly-inducted aircraft carrier (0.31)
Organization: Russian Navy (0.26)
City: Sevmash Shipyard (0.52)