Version 1.0

Clinically-relevant COVID-19 tweets authored by health-care professionals from January to June 2020

View Dataset
Wu, Julia;Sivaraman, Venkatesh;Dheekshita Kumar;Banda, Juan M.;Sontag, David

Description

The rapid evolution of the COVID-19 pandemic has underscored the need to quickly disseminate the latest clinical knowledge during a public-health emergency. One surprisingly effective platform for healthcare professionals (HCPs) to share knowledge and experiences from the front lines has been social media (for example, the "#medtwitter" community on Twitter). However, identifying clinically-relevant content in social media without manual labeling is a challenge because of the sheer volume of irrelevant data. This dataset attempts to automatically extract tweets authored by HCPs and then filter for clinically relevant content. The dataset is derived from a large set of English tweets related to COVID-19 (retweets and bots removed) from January to June 2020 (version 14). We utilize a regex based filter on user names, screen names, and bios to identify likely HCPs, narrowing down from around 52 million tweets to around 1 million. We augment the dataset by including any additional tweets in threads for which at least one tweet is present in the dataset. This results in tweets_level_0.csv. Note that this set contains almost all self-declared HCPs, but also includes some false positives; therefore, we develop an iterative relevance filtering pipeline that uses topic modeling and MetaMap concept annotation to identify and enrich clinically-relevant content. Subsequent files represent the outputs of each iteration of filtering. Please see our preprint for more details about our filtering method. Each CSV file includes the following fields: "id" (the tweet ID, accessible using the Twitter API), "thread_id" (a generated value that is shared by multiple tweets in the same thread), and "date" (the date that the tweet was posted). Due to Twitter policies, we cannot provide the contents of the tweets, and ask that you "hydrate" the tweets using a Twitter API tool such as twarc. Note that some tweets may have been deleted since the collection of our dataset and will no longer be available.

Citations (0)

Mentions (0)

Metrics

Dataset Index

1.8

FAIR Score

85%

Citations

0

Mentions

0

Metrics Over Time

Publication Details

DOI

Publisher

Zenodo

Assigned Domain

Subfield

Health

Field

Social Sciences

Domain

Social Sciences

Confidence Score

98%

Source

Open Alex

Keywords

social mediatwitternlphealth carecovid-19covid19physician

Normalization Factors

FT

15.38

CTw

1.00

MTw

1.00