VSCode Dataset: Fine-Grained Code Changes (Diff-Level) Linked to Issue Descriptions for Software Traceability Analysis
View DatasetDescription
Dataset: Visual Studio Code Commit–Task TraceabilityProject: microsoft/vscode | Organization: Microsoft | Language: TypeScript, JSON (.ts .json .css)1. CONTEXT AND MOTIVATIONThis dataset supports research into traceability between task descriptions and code changes ina large commercial-grade open-source IDE. Visual Studio Code is Microsoft's primary open-sourceeditor, written almost entirely in TypeScript, with one of the largest GitHub Issues trackers inexistence (180,000+ issues). This dataset contains commits matched to a filtered subset ofissues where a clear commit reference was found, yielding 65,546 unique linked issues. Theproject's professional Microsoft engineering culture, strict commit conventions, and exceptionallylarge commit history (2.6 million commits) make it a distinct counterpoint to community-ledprojects. The low overall linking rate (12.8 %) reflects the fact that the majority of commitsare automated, tooling, and release management activity without explicit issue references.2. COLLECTION METHODOLOGYSource: GitHub repository — https://github.com/microsoft/vscodeCommits: Extracted via the GitHub API (all branches, full history)Task linking: Issue numbers extracted from commit messages using a numeric pattern (#NNNN or plain integers referencing GitHub Issues); each resolved reference was verified against the GitHub Issues APIIssue content: Title, description body, and comment thread fetched via GitHub Issues API (https://github.com/microsoft/vscode/issues) for each linked issue numberTime range: 2015-11-13 to 2026-04-01Anonymization: Author names and e-mail addresses replaced with sequential pseudonyms (User1, User2, …) prior to publication3. DATASET STRUCTURETable COMMITS — one row per commit ID INTEGER Primary key SHA TEXT Full commit hash AUTHOR_NAME TEXT Anonymized author pseudonym (e.g. User42) AUTHOR_EMAIL TEXT Anonymized e-mail (e.g. [email protected]) CMT_DATE TEXT Commit timestamp (ISO-8601 with timezone) MESSAGE BLOB Full commit message text PATH BLOB List of file paths changed in this commit DIFF BLOB Unified diff of the commit TASK_NAME TEXT Linked GitHub issue number (NULL if no link detected)Table TASK — one row per unique linked issue ID INTEGER Primary key (autoincrement) NAME TEXT Issue number (matches TASK_NAME in COMMITS) TITLE TEXT Issue title as fetched from GitHub DESCRIPTION TEXT Issue body text COMMENTS TEXT Serialized comment thread4. BASIC STATISTICSTotal commits: 2,609,179Commits with linked task: 334,683 (12.8 %)Commits without linked task: 2,274,496 (87.2 %)Unique linked issues: 65,546Issues with description text: 60,488 (92.3 %)Unique authors (anonymized): 3,153Date range: 2015-11-13 — 2026-04-01