Scholar Data

Datasets

Malware Repositories and Their Authors on GitHub

This dataset is rooted in a study aimed at unveiling the origins and motivations behind the creation of malware repositories on GitHub. Our research embarks on an innovative journey to dissect the profiles and intentions of GitHub users who have been involved in this dubious activity. Employing a robust methodology, we meticulously identified 14,000 GitHub users linked to malware repositories. By leveraging advanced large language model (LLM) analytics, we classified these individuals into distinct categories based on their perceived intent: 3,339 were deemed Malicious, 3,354 Likely Malicious, and 7,574 Benign, offering a nuanced perspective on the community behind these repositories. Our analysis penetrates the veil of anonymity and obscurity often associated with these GitHub profiles, revealing stark contrasts in their characteristics. Malicious authors were found to typically possess sparse profiles focused on nefarious activities, while Benign authors presented well-rounded profiles, actively contributing to cybersecurity education and research. Those labeled as Likely Malicious exhibited a spectrum of engagement levels, underlining the complexity and diversity within this digital ecosystem. We are offering two datasets in this paper. First, a list of malware repositories - we have collected and extended the malware repositories on the GitHub in 2022 following the original papers. Second, a csv file with the github users information with their maliciousness classfication label. malware_repos.txtPurpose: This file contains a curated list of GitHub repositories identified as containing malware. These repositories were identified following the methodology outlined in the research paper "SourceFinder: Finding Malware Source-Code from Publicly Available Repositories in GitHub."Contents: The file is structured as a simple text file, with each line representing a unique repository in the format username/reponame. This format allows for easy identification and access to each repository on GitHub for further analysis or review.Usage: The list serves as a critical resource for researchers and cybersecurity professionals interested in studying malware, understanding its distribution on platforms like GitHub, or developing defense mechanisms against such malicious content.obfuscated_github_user_dataset.csvPurpose: Accompanying the list of malware repositories, this CSV file contains detailed, albeit obfuscated, profile information of the GitHub users who authored these repositories. The obfuscation process has been applied to protect user privacy and comply with ethical standards, especially given the sensitive nature of associating individuals with potentially malicious activities.Contents: The dataset includes several columns representing different aspects of user profiles, such as obfuscated identifiers (e.g., ID, login, name), contact information (e.g., email, blog), and GitHub-specific metrics (e.g., followers count, number of public repositories). Notably, sensitive information has been masked or replaced with generic placeholders to prevent user identification.Usage: This dataset can be instrumental for researchers analyzing behaviors, patterns, or characteristics of users involved in creating malware repositories on GitHub. It provides a basis for statistical analysis, trend identification, or the development of predictive models, all while upholding the necessary ethical considerations.

Authors

Tania, Nishat Ara ;
Masud, Md Rayhanul ;
Rokon, Md Omar Faruk ;
Zhang, Qian ;
Faloutsos, Michalis

0 Citations0 Mentions77% FAIR1.9 Dataset Index

10.5281/zenodo.108065922024

Malware Repositories and Their Authors on GitHub

Authors

Tania, Nishat Ara ;
Masud, Md Rayhanul ;
Rokon, Md Omar Faruk ;
Zhang, Qian ;
Faloutsos, Michalis

0 Citations0 Mentions77% FAIR1.9 Dataset Index

10.5281/zenodo.108065932024

Automated Organization Profile
Walmart Global Tech

Walmart Global Tech

Current S-Index

Average Dataset Index per Dataset

Total Datasets

Average FAIR Score

Total Citations

Total Mentions

S-Index Interpretation

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

Malware Repositories and Their Authors on GitHub

Malware Repositories and Their Authors on GitHub

Automated Organization ProfileWalmart Global Tech

Walmart Global Tech

Current S-Index

Average Dataset Index per Dataset

Total Datasets

Average FAIR Score

Total Citations

Total Mentions

S-Index Interpretation

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

Malware Repositories and Their Authors on GitHub

Malware Repositories and Their Authors on GitHub

Automated Organization Profile
Walmart Global Tech