Social media platforms let users share their opinions through textual or multimedia content. In many settings, this becomes a valuable source of knowledge that can be exploited for specific business objectives. Brands and companies often ask to monitor social media as sources for understanding the stance, opinion, and sentiment of their customers, audience and potential audience. This is crucial for them because it let them understand the trends and future commercial and marketing opportunities.
However, all this relies on a solid and reliable data collection phase, that grants that all the analyses, extractions and predictions are applied on clean, solid and focused data. Indeed, the typical topic-based collection of social media content performed through keyword-based search typically entails very noisy results.
We recently implemented a simple study aiming at cleaning the data collected from social content, within specific domains or related to given topics of interest. We propose a basic method for data cleaning and removal of off-topic content based on supervised machine learning techniques, i.e. classification, over data collected from social media platforms based on keywords regarding a specific topic. We define a general method for this and then we validate it through an experiment of data extraction from Twitter, with respect to a set of famous cultural institutions in Italy, including theaters, museums, and other venues.
For this case, we collaborated with domain experts to label the dataset, and then we evaluated and compared the performance of classifiers that are trained with different feature extraction strategies.
The work has been presented at the KDWEB workshop at the ICWE 2018 conference.
A preprint of the paper can be downloaded and cited as reported here:
Emre Calisir, Marco Brambilla. The Problem of Data Cleaning for Knowledge Extraction from Social Media. KDWeb Workshop 2018, co-located with ICWE 2018, Caceres, Spain, June 2018.
The slides used in the workshop are available online here:
[slideshare id=101101050&doc=kdweb2018-datacleaning-v3-180607091825]