Version 2

How solution snippets are presented in answers posted on Stack Overflow and how they could be potentially reused.

View Dataset
Anonymous

Description

Researchers use datasets of Question-Solution pairs to train machine learning models, such as for source code generation. A Question-Solution pair contains two parts: a programming question and its corresponding Solution Snippet. A Solution Snippet is a source code that solves a programming question. These datasets of Question-Solution pairs can be obtained from a number of different platforms. In this study, the information regarding the Question-Solution pairs was specifically obtained from Stack Overflow (SO). However, there are two limitations of datasets of Question-Solution pairs extracted from SO: (1) the Solution Snippets are partially correct and/or the Solution Snippet do not answer the questions, and (2) the information regarding the potential aspects of reusability of Solution Snippets is not available. These limitations can adversely affect the predictability of a machine learning model. Therefore, I conducted an empirical study to categorize various presentations of Solution Snippet in SO answers as well as how Solution Snippets can be adapted for reuse. By doing so, I identified eight categories of how Solution Snippets are presented in SO answers and five categories of how Solution Snippets could be adapted. Based on these results, I concluded several potential reasons why it is not always easy to create datasets of Question-Solution pairs. The first categorization informs that finding the correct location of the Solution Snippet is challenging when several code blocks are within the answer to the question. Subsequently, the researcher must identify which code within that code block is the Solution Snippet. The second categorization informs that most Solution Snippets appear challenging to be adapted for reuse, and how they are potentially adapted is not explicitly stated in them. These insights shed light on how to create better quality datasets in order to improve the predictability of machine learning models.

Citations (0)

Mentions (0)

Metrics

Dataset Index

0.3

FAIR Score

79%

Citations

0

Mentions

0

Metrics Over Time

Publication Details

DOI

Publisher

Zenodo

Assigned Domain

Subfield

Artificial Intelligence

Field

Computer Science

Domain

Physical Sciences

Confidence Score

98%

Source

Open Alex

Keywords

Stack OverflowQualitative researchSolution snippetsCode blocksCode reuse

Normalization Factors

FT

13.46

CTw

1.00

MTw

1.00