developing text processing data products: part II

I want to talk a little bit about challenges, something I wanted to share. The challenge, when working with text is that text data usually are complex bunch of words and sentences, meaning and interpretations could be different based on the geographic location. Not to mention the formatting and special characters that we have to deal with.

Although text can be extracted from web using XML and JSON formats, there is still several stages of cleaning and processing needs to be done before we even begin to analyze. Missing this step, we would end up with “garbage in, garbage out”.

After data cleansing, one of the type of analysis that we can perform is parts of speech tagging. Essentially, parts of speech tagging is breaking down a sentence into noun, verb, adverb, conjunction, etc…  Once we have PoS analysis done, we could use the output for various applications. Some of them are like word count to determine the most occurring noun or investigating the sentiment of a text for instance.

Extracting the information is one challenges. Finding the right software/package is another challenge. I am not trying to market any package/API here, but it appears that every package could have its own style of interpretation and providing different output. Lets look at some of these variations soon. More later.

Leave a comment