Sentiment Analysis - A Solution Overview





Sentiment Analysis - A Solution Overview Date: Friday , July 02, 2010 Introduction User generated content on the web has grown many fold in the last few years and much of the content is in the form of reviews, commentaries, ratings and now tweets. Users are expressing their opinions through these forms. The various tasks of identifying the opinions, monitoring them, summarizing them and organizing them are collectively termed as sentiment analysis or opinion mining. Sentiment analysis is of real value for companies to manage their brands and reputation. Traditionally brand and reputation management has been done via surveys, focus groups, user conferences and while these are not likely to go away in the near future, the ability to monitor the brands in real-time is a value-add that cannot be over-estimated. Sentiment analysis involves elements of natural language processing (NLP), text mining, machine learning & data analytics. The research in the field of opinion mining has been on going for several years and many models and techniques have been proposed. The theory is well understood and there are also tools and solutions that are available to implement a sentiment analysis system. Companies in the text analytics area are usually the first ones to come up with such solutions but, there is an increasing presence of new startup firms that are creating a buzz in this domain. In this article we don’t delve into the theory and the algorithms involved in Sentiment Analysis but, we will take a look at the entire process from identifying the opinion sources to the visualization of the results. Solution Overview Mining opinions is different from regular text mining as direct keywords cannot be used for searching opinions. This is because sentiments or opinions are not usually directly evident (but, latent). Though sentiments themselves are domain independent, the words used to describe sentiments can vary from domain to domain. Sentiments are expressed in different ways: with overall scores (star ratings), as pros and cons on features/aspects of the object of opinion, rants and raves etc. Opinions are subjective and comparative in nature and multiple opinions are expressed in a single passage. So the process of identifying opinions, classifying them, extracting them and summarizing them is unique and demanding enough that specialized systems are needed. The following steps describe the process of sentiment analysis. The process is independent of the methods employed for analyzing sentiments. These steps can be thought of as describing what the process is and not how the process is implemented. The steps are: 1. Identifying and classifying data sources: Deals with the different structures of opinion-based text such as reviews, editorials, news etc based on sources of opinions. 2. Retrieval and Storage of Data: Deals with the challenges of handling very large volumes of data and the extraction of sentiment phrases from text passages from the identified data sources. 3. Sentiment Classification: This step involves categorization of sentiments as being positive, negative or neutral. 4. Sentiment Summarization: The sentiments are summarized into aggregate scores for positive and negative orientation along with relevant snippets. 5. Visualization: Final step involves building dashboards and trackers that will help the user segment and view the data in a useful manner to get better insights into the sentiments being expressed. Identifying and classifying data sources: Opinions appear all over the web and some sites are more reputed than others. The formats are different and opinions are often interspersed with commerce or other types of content. Separating opinions from other types of text is referred to as Genre Classification. The opinion orientation itself is not known but, only that the passages contain opinions. Opinions are usually subjective statements and techniques such as bag of terms, Parts Of Speech (POS) taggers are used for this classification. A study by Finn et al.i demonstrates that POS tagger methods were superior to other methods. Data sources also can be classified by their reputation, their relevance and popularity and a weight can be ascribed to the source and used when scoring opinions. Retrieval and Storage of Data: Spiders and crawlers are the most common and scalable way of retrieving content. Focused or topical crawlers maybe employed when we are looking for opinions in a certain set of topics. Once the data has been retrieved, it needs to be stored and though disk space is cheap, we can end up with terabytes of data that can add up the storage costs very quickly. The entire retrieval and storage process could be managed using a cloud platform such as Amazon EC2. The pre-processing and cleansing of data can be done on the cloud and only the relevant data i.e. opinions could be stored locally. Sentiment Classification: After we have extracted opinions from the text that we have crawled, we now come to a key step wherein we need to know the sentiment being expressed i.e. we need to know the semantic orientation of the sentiment i.e. whether the sentiment expressed is negative, positive or neutral and when it is not neutral we need to be able to grade it to see how positive or negative that sentiment is. In a typical opinion passage, several sentiments are expressed on several aspects/features of the opinion target and most classifiers work at the phrase or sentence level rather than at the word or document level. NLP or machine learning based approaches are used in this step. NLP techniques involve using opinion words, detecting subjective parts of speech with POS tagger and building sentiment lexicons. A sentiment lexicon differs from lexicons such as WordNet by including the semantic orientation of adjectives, making them opinion words. SentiWordNetii is one such sentiment lexicon and it is publicly available. When detecting sentences with opinions, special patterns such as a noun (NN) following an adjective (JJ) are used for pattern matching. While NLP methods are rule-based, machine learning methods use probabilistic classifiers such as Naļve Bayes and large margin classifiers such as Support Vector Machines. Sentiment classification is treated as a special type of topic classification and by applying it to more than just single words (bi-grams, tri-grams, n-grams), classification of sentiments is possible. Classification of product reviews further classifies sentiments by the features on which the opinion is expressed. In such cases, an additional step wherein the main features themselves are discerned is introduced, and POS tagging (noun and noun phrases) can be used for this purpose. In addition to the orientation, we also need to note the level of the sentiment. This is useful for qualitative analysis as well as to attach a numeric grade to the sentiment. Sentiment Summarization: In this step the classified sentiments are aggregated by their orientation i.e. all negative sentiments are aggregated together into one summary and all positive sentiments into another. A qualitative representative set of sentences is also presented as a snapshot along with the aggregation. We can also attach a score such as a numeric rating to the summary if individual sentiments contain grades. Visualization: While summarization gives us an overall score and sentiment, it can be scarce in details. Also certain angles and perspectives will not be evident in a simple summary. Visualization helps us get better insights into the sentiments. The views are very similar to traditional OLAP views. We can create dashboards, drill-down charts along various dimensions such as geography, customer segment. We can also incorporate tracking tools such as trend graphs and alerts (e.g. a sudden spurt in negative publicity). In order to facilitate some of these views, we need to ensure that the sentiment object that is stored in the database is annotated with meta-information such as source of the review, the topic/product that it belongs to and other properties that can be used in the business analysis. The Technologies and Tools As mentioned at the start of this document, sentiment analysis encompasses several areas and is a complex process but, there are tools available to speed up the development process. There are sentiment analysis services and products available in the market, and using them maybe the optimal path for companies and services whose core business is not analytics or information technology. ScoutLabs, Attensity, and Lexalytics are a few of the names in this area. For organizations that want to build their own sentiment analyzer tools APIs such as LingPipe, OpenNLP, and Evri Sentiment web APIs are available. The Apache Lucene eco-system (Nutch, Hadoop and Mahout) would be a compelling choice for the data retrieval and pre-processing as well as running scalable machine learning algorithms. Flex is a well proven technology to build the widgets needed for the business analysis views. Conclusion While Sentiment Analysis is a very hot topic in text analytics and research has been ongoing for several years, the advent of user generated content that has changed the dynamics of brand management has pitched this technology to the forefront. Though these systems add real value they are not highly accurate at this time. However, the momentum in both the areas of academic research and commercial applications suggests that constant improvements can be expected. The underlying concepts and technologies are complex and even though there are tools and technologies for each of the components of such a system, building an integrated solution would require subject matter expertise in areas such as machine learning and NLP. Unless a company’s core competency lies in these areas, it maybe better to buy than to build. While text analytics companies offer sentiment analysis systems, these are not off-the-shelf solutions and need to be customized. Given the complexity and the evolving nature of this field, partnering with integrators and solution developers could be a quick way of bringing up a solution for most companies. The author is the Senior Architect and Head of SPE (Software Product Engineering) Labs, MindTree