How solution snippets are presented in answers posted on Stack Overflow and how they could be potentially reused.
View DatasetDescription
Researchers use datasets of Question-Solution pairs to train machine learning models, such as for source code generation. A Question-Solution pair contains two parts: a programming question and its corresponding Solution Snippet. A Solution Snippet is a source code that solves a programming question. These datasets of Question-Solution pairs can be obtained from a number of different platforms. In this study, the information regarding the Question-Solution pairs was specifically obtained from Stack Overflow (SO). However, there are two limitations of datasets of Question-Solution pairs extracted from SO: (1) the Solution Snippets are partially correct and/or the Solution Snippet do not answer the questions, and (2) the information regarding the potential aspects of reusability of Solution Snippets is not available. These limitations can adversely affect the predictability of a machine learning model. Therefore, I conducted an empirical study to categorize various presentations of Solution Snippet in SO answers as well as how Solution Snippets can be adapted for reuse. By doing so, I identified eight categories of how Solution Snippets are presented in SO answers and five categories of how Solution Snippets could be adapted. Based on these results, I concluded several potential reasons why it is not always easy to create datasets of Question-Solution pairs. The first categorization informs that finding the correct location of the Solution Snippet is challenging when several code blocks are within the answer to the question. Subsequently, the researcher must identify which code within that code block is the Solution Snippet. The second categorization informs that most Solution Snippets appear challenging to be adapted for reuse, and how they are potentially adapted is not explicitly stated in them. These insights shed light on how to create better quality datasets in order to improve the predictability of machine learning models.
Citations (0)
No citations found
Mentions (0)
No mentions found
Metrics Over Time
Publication Details
Subfield
Artificial Intelligence
Field
Computer Science
Domain
Physical Sciences
Confidence Score
98%
Source
Open Alex