MAVIS Twitter dataset: A collection of tweets and sentiment analysis in Spanish about vaccines and diseases during the period 2015-2018

View Dataset
González, Alejandro Rodríguez;Tuñas, Juan Manuel;Santamaría, Lucia Prieto;Peces-Barba, Diego Fernandez;Ruiz, Ernestina Menasalvas;Jaramillo, Almudena;Cotarelo, Manuel;Fernández, Antonio J. Conejo;Arce, Amalia;Gil, Angel

Description

MAVIS dataset comprises a full knowledge base regarding Twitter messages published in Spanish during the period 2015-2018, in the context of sentiment analysis of specific vaccines and their related diseases. Such diseases and vaccines are summarized as follows: Invasive meningococcal disease (“EMI” in Spanish): Bexsero, Trumenba, Nimenrix Invasive pneumococcal disease (“ENI” in Spanish) Influenza Hepatitis Rotavirus: Rotarix, Rotateq Measles (“Sarampión” in Spanish) and MMR (“Triple vírica” in Spanish) Sepsis Whooping cough (“Tosferina” in Spanish) Chickenpox (“Varicela” in Spanish): Varivax, Varilrix; and Shingles (“Zoster” in Spanish) Human papillomavirus infection (“VPH” in Spanish): Cervarix, Gardasil Tweets have been manually classified as having a negative or non-negative sentiment by 5 experts. Moreover, an automatic classification has been performed by 3 different tools: IBM Watson (now Watson Tone Analyzer, https://www.ibm.com/watson/services/tone-analyzer/), Google Cloud Natural Language (https://cloud.google.com/natural-language), and Meaning Cloud (https://www.meaningcloud.com/). IBM Watson and Google Cloud Natural Language returned a numerical sentiment score ranging from -1 to 1, while Meaning Cloud returned a categorical variable with the values ‘P+’, ‘P’, ‘NEU’, ‘N’ and ‘N+’, which were converted to 1, 2, 3, 4 and 5 respectively. With these variables (IBM Watson, Google Cloud Natural Language, and Meaning Cloud annotations and the experts’ classification as the target label), a machine learning metamodel was developed. Tweets were also annotated with the sentiment output given by this classifier. The provided data includes intrinsic tweets information, intrinsic information regarding the users that posted the tweets, the keywords mentioned in each tweet, and the annotations that the experts, the tools, and the model gave to each tweet. Funding: This dataset was obtained with funding from MSD, Spain under MAVIS Study (VEAP ID: 7789). Current studies using this dataset at the moment of the publication: Rodríguez-González et al., “Creating a metamodel based on machine learning to identify the sentiment of vaccine and disease-related messages in Twitter: the MAVIS study” in 2020 IEEE 33st International Symposium on Computer-Based Medical Systems (CBMS), Jul. 2020, p. 6. DOI: 10.1109/CBMS49503.2020.00053 Rodríguez-González et al., "Identifying Polarity in Tweets from an Imbalanced Dataset about Diseases and Vaccines Using a Meta-Model Based on Machine Learning Techniques" in Applied Sciences, 2020, 10. DOI: 10.3390/app10249019

Citations (0)

Mentions (0)

Metrics

Dataset Index

1.9

FAIR Score

77%

Citations

0

Mentions

0

Metrics Over Time

Publication Details

DOI

Publisher

Zenodo

Assigned Domain

Subfield

Health

Field

Social Sciences

Domain

Social Sciences

Confidence Score

99%

Source

Open Alex

Keywords

twitter, dataset, vaccines, sentiment analysis

Normalization Factors

FT

13.46

CTw

1.00

MTw

1.00