Hi there! We’re pleased to announce the release of our first Dossier. Each Dossier will contain our suggestions regarding different sources for the subjects that we like the most. Text Mining, Computer Vision, Machine Learning, Speech Technologies, High Performance Computing and Big Data. You can expect a couple of Dossier each month.
We suggest you to bookmark each Dossier so you can use it whenever you need for essential documents, papers and articles involving technologies that will shape the future of digital world.
Enjoy, and give your knowledge a boost.
General & Introductory
|Bo Pang and Lillian Lee|
|Why is this here? This survey covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems. The focus is on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis. It includes material on summarization of evaluative text and on broader issues regarding privacy, vulnerability to manipulation, and economic impact that the development of opinion-oriented information-access services gives rise to. To facilitate future work, a discussion of available resources, benchmark datasets, and evaluation campaigns is also provided.|
|Why is this here? This is the first text I ever read about Opinion Mining. Chapter eleven of this book is a short summary of the topic and helps to understand the fundamentals. Besides, this book also provides insight to several related topics which can help to give support in acquiring basic concepts.|
|Why is this here? A comprehensive introductory and survey text. It covers all important topics and the latest developments in the field with over 400 references. It is suitable for students, researchers and practitioners who are interested in social media analysis in general and sentiment analysis in particular. A good starting point if you want to get the big picture.|
Text processing & Feature Extraction
|Why is this here? Mostly all text classification technologies existing nowadays are enabled either by statistical methods, like Machine Learning Algorithms, or symbolic ones (that require much more expert knowledge). Machine Learning methods work over numeric feature vectors. So, when working with text we need to count on a pre-processing step that transforms the string representation into a numeric feature vector. By far, the most common transformation is the ‘bag of words’. One can optionally perform feature selection to be more discriminating about which words to provide as input to the learning algorithm, which can help in improving accuracy and scalability. This paper describes and benchmarks a variety of several feature selection methods which can be useful at the time of choosing one.|
Classification Algorithms & Analysis
|Bin Wang, Bruce Spencer, Charles X. Ling and Harry Zhang|
|Why is this here? When it comes to text classification using Machine Learning, the most successful and used techniques had been supervised ones. This means we require a lot of annotated data to feed our algorithms, and so, they become dependant on how well that data represents the universe of possible pieces of natural language text (or subuniverses, depending on if we are working on a certain domain) and how much data we have. Having annotated data, means possibly that we need to craft it with help of experts in the domain (for example, linguists) which can be an overkilling task. This is because many people has intended over time to reduce the amount of supervised data without reducing the accuracy of the algorithms. This paper shows two contributions in the area of Semi-supervised Classification for Subjectivity that can achieve comparable performance to the supervised learning models. This is a basic approach (consider the date of the publication), although nowadays people is exploring unsupervised approaches using deep learning.|
|Michael Wiegand and Dietrich Klakow.|
|Why is this here? This paper shows the effect of chaining supervised Classifiers with an open-domain Rule-based classifier that can serve as input of supervised data. The beneﬁt of this method is that no labeled training data is required, so it presents another way of avoiding the task of getting annotated data. Still, it allows to capture in-domain knowledge by training the supervised classiﬁer on in-domain features and thus, the resulting self-trained classiﬁer is usually signiﬁcantly better than the open-domain classiﬁer by itself. This publication also shows a comparison between these kind of methods and Semi-supervised approaches.|
|Yingcai Wu, Furu Wei, Shixia Liu, Norman Au, Weiwei Cui, Hong Zhou, and Huamin Qu|
|Why is this here? Unlike factual information, opinions and sentiment data are essentially subjective. An opinion from a single holder is usually not sufficient for obtaining conclusions or taking actions. This means that one usually wants to analyze a quite large amount of data and present it in a summarized way. There are many techniques for this task, from simple charts to complex visualization systems. Since basic approaches might sound trivial I wanted to show how deep one can go in this field. OpinionSeer is an interactive visualization system that could visually analyze a large collection of online hotel customer reviews. The system is built on a visualization-centric opinion mining technique that considers uncertainty for faithfully modeling and analyzing customer opinions. To provide multiple-level exploration, they introduce subjective logic to handle and organize subjective opinions with degrees of uncertainty.|