opencalais

Text extraction and tagging using Natural Language Processing

Another beautiful product that I came across is Open Calais. It is a web service offered by Thomson Reuters which lets the user to automatically extract semantic information (metadata) from a given story or text document. The following is a demo that gives us a clear understanding of what Open Calais is capable of doing.

We can work with Open Calais by their API functionality. First, user need to register with open calais to receive an authentication key. We need the auth key to connect remotely to Open Calais server so we can use their web service.

Your system needs simplejson. For our ease, we can use the wrapper class for python calais.

# Interacting with Python interface to the OpenCalais API
# Author: Pavan Narayanan (pavankn11@gmail.com), Nov 2013
# This program will make use of Calais and CalaisResponse classes to extract information and produce summary output on text

# Calais class will extract information from the given source — a tweet or a news piece from a weblink
from calais import Calais

# Authenticating the OpenCalais API through the authentication key
authkey = “xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx”
calais = Calais(authkey,submitter=”Pavan Narayanan”)

# use analyze if you have one statement to extract terms (like a tweet)
termExtract = calais.analyze(“George Bush was the President of the United States of America until 2009. Barack Obama is the new President of the United States now.”)

# use analyze_url to extract all important information from the weblink
termExtractURL = calais.analyze_url(“http://www.india-briefing.com/news/india-deepens-trade-investment-russia-7041.html”)

# CalaisResponse will print information after analyzing text information parsed earlier
from calais import CalaisResponse

def termExtractPrint():
termExtract.print_summary()
termExtract.print_topics()
termExtract.print_relations()
termExtract.print_entities()

def termExtractURLPrint():
termExtractURL.print_summary()
termExtractURL.print_topics()
termExtractURL.print_relations()
termExtractURL.print_entities()

print “\nThe Terms Extracted for the tweet/given string are”
termExtractPrint()

print “\nThe Terms Extracted for the given URL are”
termExtractURLPrint()

We get the following output while testing with this following web link

http://www.thehindu.com/news/national/russias-latest-addition-to-indian-military-might/article5358813.ece

The output:

SUMMARY:
Calais Request ID: 7a2abbfb-2f5e-3789-1426-a3b110f33492
External ID: http://www.thehindu.com/news/national/russias-latest-addition-to-indian-military-might/article5358813.ece
Title:
Language: English
Extractions:
21 entities
1 topics
1 language
9 relations
TOPICS:
Politics
ENTITIES:
City: Moscow (0.21)
Country: India (0.73)
Organization: Navy (0.26)
IndustryTerm: protracted carrier reconstruction project (0.26)
Organization: India-Russia Intergovernmental Commission on Military Technical Cooperation (0.21)
Person: A.K. Antony (0.74)
Country: Russia (0.73)
Organization: Russian government (0.33)
PublishedMedium: The Hindu (0.06)
ProvinceOrState: Karnataka (0.06)
Person: D.K. Joshi (0.07)
IndustryTerm: carrier deck (0.07)
Product: MiG-29k (0.07)
Position: Admiral (0.07)
Organization: Indian Navy (0.41)
Position: Chief (0.07)
Position: Defence Minister (0.41)
IndustryTerm: carrier project (0.16)
IndustryTerm: newly-inducted aircraft carrier (0.31)
Organization: Russian Navy (0.26)
City: Sevmash Shipyard (0.52)

Advertisements