VSCode Dataset: Fine-Grained Code Changes (Diff-Level) Linked to Issue Descriptions for Software Traceability Analysis

View Dataset
Zholobetskyi, Stanislav;Andriichuk, Oleh

Description

Dataset: Visual Studio Code Commit–Task TraceabilityProject: microsoft/vscode | Organization: Microsoft | Language: TypeScript, JSON (.ts .json .css)1. CONTEXT AND MOTIVATIONThis dataset supports research into traceability between task descriptions and code changes ina large commercial-grade open-source IDE. Visual Studio Code is Microsoft's primary open-sourceeditor, written almost entirely in TypeScript, with one of the largest GitHub Issues trackers inexistence (180,000+ issues). This dataset contains commits matched to a filtered subset ofissues where a clear commit reference was found, yielding 65,546 unique linked issues. Theproject's professional Microsoft engineering culture, strict commit conventions, and exceptionallylarge commit history (2.6 million commits) make it a distinct counterpoint to community-ledprojects. The low overall linking rate (12.8 %) reflects the fact that the majority of commitsare automated, tooling, and release management activity without explicit issue references.2. COLLECTION METHODOLOGYSource:          GitHub repository — https://github.com/microsoft/vscodeCommits:         Extracted via the GitHub API (all branches, full history)Task linking:    Issue numbers extracted from commit messages using a numeric pattern (#NNNN or                 plain integers referencing GitHub Issues); each resolved reference was verified                 against the GitHub Issues APIIssue content:   Title, description body, and comment thread fetched via GitHub Issues API                 (https://github.com/microsoft/vscode/issues) for each linked issue numberTime range:      2015-11-13 to 2026-04-01Anonymization:   Author names and e-mail addresses replaced with sequential pseudonyms                 (User1, User2, …) prior to publication3. DATASET STRUCTURETable COMMITS — one row per commit  ID            INTEGER  Primary key  SHA           TEXT     Full commit hash  AUTHOR_NAME   TEXT     Anonymized author pseudonym (e.g. User42)  AUTHOR_EMAIL  TEXT     Anonymized e-mail (e.g. [email protected])  CMT_DATE      TEXT     Commit timestamp (ISO-8601 with timezone)  MESSAGE       BLOB     Full commit message text  PATH          BLOB     List of file paths changed in this commit  DIFF          BLOB     Unified diff of the commit  TASK_NAME     TEXT     Linked GitHub issue number (NULL if no link detected)Table TASK — one row per unique linked issue  ID            INTEGER  Primary key (autoincrement)  NAME          TEXT     Issue number (matches TASK_NAME in COMMITS)  TITLE         TEXT     Issue title as fetched from GitHub  DESCRIPTION   TEXT     Issue body text  COMMENTS      TEXT     Serialized comment thread4. BASIC STATISTICSTotal commits:               2,609,179Commits with linked task:      334,683  (12.8 %)Commits without linked task:  2,274,496  (87.2 %)Unique linked issues:           65,546Issues with description text:   60,488  (92.3 %)Unique authors (anonymized):     3,153Date range:                      2015-11-13 — 2026-04-01

Citations (0)

Mentions (0)

Metrics Over Time

Publication Details

DOI

Publisher

Zenodo

Assigned Domain

Subfield

Computer Networks and Communications

Field

Computer Science

Domain

Physical Sciences

Confidence Score

43%

Source

Scholar Data Model