Philip Resnik on natural language processing (NLP) and sentiment analysis.
My worry is compounded by the fact that social media sentiment analyses are being presented without the basic caveats you invoke in related polling scenarios. When you analyze social media you have not only a surfeit of conventional accuracy concerns like sampling error and selection bias (how well does the population of people whose posts you're analyzing represent the population you're trying to describe?), but also the problem of "automation bias" — in this case trusting that the automatic text analysis is correct. Yet the very same news organization that reports traditional opinion poll results with error bars and a careful note about the sample size will present Twitter sentiment analysis numbers as raw percentages, without the slightest hint of qualification.
What's the alternative? Twenty years ago the NLP community managed to break past the failures of the knowledge engineering era by making a major methodological shift from knowledge engineering to machine learning and statistical approaches. Instead of building expert knowledge into systems manually, we discovered the power of having human experts annotate or label language data, allowing a supervised learning system to train on examples of the inputs it will see, paired with the answers we want it to produce. (We call such algorithms "supervised" because the training examples include the answers we're looking for.) Today's state of the art NLP still incorporates manually constructed knowledge prudently where it helps, but it is fundamentally an enterprise driven by labeled training data. As Pang and Lee discuss in their widely read survey of the field, sentiment analysis is no exception, and it has correspondingly seen "a large shift in direction towards data-driven approaches", including a "very active line of work" applying supervised text categorization algorithms.
Nonetheless, I've argued recently that NLP's first statistical revolution is now being followed by a second technological revolution, one driven in large part by the needs of large scale social media analysis. The problem is that, faced with an ocean of wildly diverse language, there's no way to annotate enough training data so that supervised machine learning systems work well on whatever you throw at them. As a result, we are seeing the rise of semi-supervised methods. These let you bootstrap your learning using smaller quantities of high quality annotated training examples (that's the "supervised"), together with lots of unannotated examples of the inputs your system will see (that's the "semi").
As for sentiment analysis, by all means, let's continue to be excited about bringing NLP to the masses, and let's get them excited about it, too. But at the same time, let's avoid extravagant claims about computers understanding the meaning of text or the intent behind it. At this stage of the game, machine analysis should be a tool to support human insight, and its proper use should involve a clear recognition of its limitations.