Scholar Data

Replication package for the Empirical Software Engineering (EMSE) Journal submitted article: "Analyzing the Ripple Effects of Refactoring"This is a replication package and online appendix for the EMSE Journal paper "Analyzing the Ripple Effects of Refactoring".ContentsThis repository contains the following:INSTALL: Detailed installation instructions for each of the used tools as well as the required Python dependencies.Figures: Graphical content shared in the manuscript.Appendix:posterior-update.pdf: Lineplot displaying the posterior probability distribution of a refactoring showing scarcity in its Ripple Effect.Background:ripple-example-vert.pdf: Graphical representation of the theoretical idea on the Ripple Effect applied to the context of refactoring activityResults:data-analysis-diagram.pdf: Workflow diagram on the data analysos process undergone in the study.RQ1:LCSDistribution.pdf: LCS Distribution by RE duration quartiles.RQ1 - Ripple Effect Distribution by Macro Type - With Lines and Outliers.pdf: LCS Distribution considering outliers.RQ1_RT-NoRef.pdf: Distribution of RE duration by refactoring family type.RQ2:CPDistribution.pdf: Change Proneness Distribution by RE duration quartiles.DPDistribution.pdf: Defect Proneness Distribution by RE duration quartiles.RQ3:cpeloc.pdf: Change Efficiency Distribution by RE duration quartiles.Study Design:data-collection-diagram.pdf: Workflow diagram on the data collection process undergone in the study.Workflow.pdf: Workflow diagram on the study design process undergone in the study.creation-year-hist.pdf: Histogram displaying the distribution of the study context projects based on their creation year.progression-per-age.pdf: Stacked Ridge plot on multiple software attributes distributed in projects with different age.yearly_commit_activity_boxplot.pdf: Multiple boxplot figure displaying the projects' activity level in terms of commit activity.Data A folder containing all raw data extractedCommits Diff: Contains the commit diff data between the subsequent refactoring commits per mined project.Commits Hash: Contains the list of commits with detected refactoring activity per mined project.Issues: Contains the list of issues repported per mined project.Refactor Types: Contains the list of detected refactoring types with their global counts per mined project.Refactoring Commits: Contains the list of commits with detected refactoring activity per mined project with the mined refactoring content retrieved from RefactoringMiner.RefactoringMiner Output: Raw RefactoringMiner output.Zipped analyzed software repositories: Zipped folder with the software repositories cloned at the stage this study was executedUnique projects: List of unique project full names from the PANDORA dataset (The original dataset did not provide a clean list of projects so we made it removing the duplicates)change_proneness_data: Contains the output files from the change and defect proneness calculation process per analyzed project.dev_effort_data: Contains the output files from the Developer's effort calculation process per analyzed projectmerged_results: Contains the output files from the two previous computation projects merged per analyzed project.global_refactoring_counts.json: Contains the summary refactoring counts from all the projects.mined_total_commit_counts.json: Contains the summary counts of commits that reported refactoring activity according to RefactoringMiner.project_refactoring_stats.json: Contains summary counts per project regarding the number of commits analyzed, number of refactorings analyzed, number of refactorings with RE persistence history.basic_statistics_table.csv: Reports summary counts for all the initially considered projects in the study context. These counts include attributes such as Commits, Issues, GitHub Stars, ... among others in order to describe the shape of the project being analyzed in this study.descriptive_stats_table.csv: Reports the summary descriptive statistics resulting from the basic_statistics_table.csvtable, and displayed in the manuscript.```sensitivity_analysis_bug_fixing.xlsx``: Excel sheets with the undergone manual annotation on the efficiency of the adopted strategy to label commits as bug-fixing.bug_fixing_validation_sample.csv: Sampled commits to perform the manual annotations.Results: Data files containing the analyzed results to answer the Research Questions.Raw results: CSV file with the raw results from the analyzed refactoring cases, it contains the majority of the mined data into one single file.RQ1: "To what extent does the RE of a refactoring activity persist in code?"LCSDistribution.jrp: JMP analysis file to compute RQ1.Hypothesis Testing:RefactoringFamilyTypeDunn.xlsx: Resulting outcome from the Dunn's test on the Refactoring Family significance.RefactoringFamilyVSREjmp.jmp: JMP analysis table file with the hypothesis testing on the impact of the refactoring family over the RE.RefactoringFamilyVSREjmp.xlsx: Resulting outcome from the hypothesis testing on the impact of the refactoring family over the RE.RQ2: "What is the long-term effect of refactoring on change and defect proneness?"CPDistribution.jrp: JMP analysis file to compute RQ2 on the impact of the refactoring effect on the change proneness.DPDistribution.jrp: JMP analysis file to compute RQ2 on the impact of the refactoring effect on the defect proneness.Hypothesis Testing:CP.xlsx: Results from the hypothesis testing performed on the impact of the refactoring effect on the change proneness.DP.xlsx: Results from the hypothesis testing performed on the impact of the refactoring effect on the defect proneness.Spearman.xlsx: Spearman Rho correlation test between the RE, and CP and DP.RQ3: "What is the benefit/effort ratio of long-term refactoring?"cpeloc.jrp: JMP analysis file to compute RQ3 on the impact of the RE on the benefit/effort ratio for performing refactoring activity.HT.xlsx: Results from the hypothesis testing performedScripts: A folder containing all the scripts.components/: Contains utils leveraged during the project and main Scripts to run the RefactoringMiner initial data collection.utility.py: Script with the common global variables used all over the rest of scripts.refactoring_miner.py: Script dedicated to the Refactoring Activity Mining with RefactoringMiner.its_miner.py: Script dedicated to the Issue Tracking System data mining.get_commit_diff.py: Version control miner for commit diff extraction.get_github_url.py: Script to obtain GitHub repository URL links.helper.py: Scriopt dedicated to support scripts with shared help functions.00_main.py: Performs initial refactoring data with RefactoringMiner.01_proneness_calculator.py: Calculated the change and defect proneness of the mined refactorings over the entire change history of the analyzed projects.02_get_bcp.py: Calculates the Ripple Effects of the analyzed refactorings through Bayesian Conditional Probability and LCS approach.03_dev_effort.py: Calculates the Change Efficiency detected during the refactoring posterior lifetime of the code in the analyzed Java class.04_merge_crossproject_data.py: Merges the collected data per projects into a cross-project dataset.05_bcp_example_collector.py: Script utilized to mine the specific commit data for the example provided in the Appendix.06_summary_statistics.py: Script utilized to create basic_statistics_table.csv table with the software attributes from the initialized considered projects.07_get_activity.py: Script utilized to create yearlyt_commit_activity_boxplot.pdf figure with the yearly activity from the studied projects in terms of commits.08_sensitivity_analysis.py: Script to perform the sensitivity analysis on the BCP threshold selection.09_fetch_manual_validation_sample.py: Script to perform the sampling of observations used for the bug fixing manual validation experiment.10_check_commit_uniqueness_in_samples.py: Script to check that there are no duplicates in the sample of commits to perform the manual annotations on the experiment.Appendix example: A folder containing all the content generated to provide the demonstration in Appendix B section of the manuscript.bcp-examples.R: R codes to replicate the results presented in the example.metadata_jmeter.csv: Metadata in CSV displaying the input data used for the implementation of this example.metadata_jmeter.xlsx: Metadata in XLSX displaying the input data used for the implementation of this example.posterior-update.pdf: Figure demonstrating the progression of the analyzed example refactoring.LicenseAll generated data is provided under Creative Commons 4.0 Attribution License.All scripts are provided under the MIT License.All the analysed projects must be used in accordance with their respective licenses (shared in each project when applicable).Running the codeNOTE 1: Please, find the DATA_PATH global variable in the components/utility.py script and define the path where the program should create all the needed results.The logic would be that you provide the base path is the location of this replication package in your machine, and you add data as the location for the data files.NOTE 2: The different stages of the study execution are splitted in the main.py script, from the boolean definitions incommons.py practitioners can decide which stages want to be manipulated or re-executed again without affecting the other stages.For a complete execution, set all the boolean global variables to TrueStage 1: MAINExecutes script 00_main.py for running the initial data collection incorporating the following sub-steps:Gets the unique projects from the source dataset.Collects issue-tracking systems' data from GitHub.Collects commits from GitHub.Clones the repository to be analyzed.Runs Refactoring Miner to mine all the refactoring data.Creates the following directories and files within the path DATA_PATH/output_data/:average_time_between_refactorings/commits_diff/commits_hash/developers_effort/interefactoring_commit_period/issuesrefactoring_typesrefactoring_commitsrefactoring_miner_outputsample_refactoring_commitsproject_commit_hashes.jsonsplit_project_commit_hashes.jsonunique_projects.jsonStage 2: CALCULATING DEFECT PRONENESS AND CHANGE PRONENESSExecutes script 01_proneness_calculator.py for running the initial data collection incorporating the following sub-steps:Reorders and locates the refactoring data and mined commits for each project.Calculates the change proneness and defect proneness for each refactoring case (therefore, commits in which it was introduced and Java class affected).Makes logs for each of the projects, therefore connecting the execution with proneness_status_monitoring.py.Creates the following directories and files within the path DATA_PATH/:change_proneness_data/change_proneness_data/{project_name}change_proneness_data/{project_name}/proneness_results.csvchange_proneness_data/{project_name}/proneness_results.pklchange_proneness_data/{project_name}/ordered_commits.csvchange_proneness_data/{project_name}/ordered_commits.pklchange_proneness_data/{project_name}/ordered_refactorings.csvchange_proneness_data/{project_name}/ordered_refactorings.pklchange_proneness_data/{project_name}/processed_refactorings.txtchange_proneness_data/{project_name}/class_change_historyNOTE on this last directory, each existing CSV file consists on the class history change of each refactoring mined byRefactoringMiner in each mined repository accordingly. This helped on making sure that the same class was being analyzed even if it was renamed or changed of path afterwards. Let's dive into a sample file:Sample CSV file name: id1_id2_refactoring-type_project-name:id1: Refactoring identifier, it basically provides raw id counting the order of analyzed refactoringsid2: Refactoring type identified, it counts the number of times a refactoring of the same type as the concerning file has been analyzed so farrefactoring-type: Lowered name of the refactoring type.project-name: Lowered project name.Logic behind Change Proneness calculation:(The notation will follow the one used in the final table so that the reader finds it easier to relate each process with the final outcome)cp: Or "Raw Change Proneness", depicts the changes performed in the affected class in a commit where the same refactoring type was applied on the affected class as compared to the previous similar case. Therefore the first case in the table will provide the number of changes made in the class as compared to the refactoring commit mined by RefactoringMiner in that class, the second row will provide the changes based on the source code of the class at the history point of the first row case, and so on and so forth. $$\displaystyle \mathcal{C}(R_{i}, r_{j}) ={\nu_\mathcal{C}(R_{i})}{r{j-1} \rightarrow r_{j}}$$cp/eloc: "Or Adjusted Change Proneness", depicts the same metric as before but is adjusted based on the effective lines of code found in that Java class (extracted from SCC). So the formula would be rewritten as follows: $$\displaystyle \mathcal{C}(R_{i}, r_{j}) = \frac{{\nu_\mathcal{C}(R_{i})}{r{j-1} \rightarrow r_{j}}}{ELOC_j}$$Logic behind Defect Proneness calculation:Performs fuzzy matching of regular expressions based on the commonly used issue tickets in Issue Tracking Systems such as GitHub or JIRA. The patterns are:[A-Z]{2,}-\d+\d{4,}#\d+If the pattern is found in the commit message, we consider it a defect-inducing commit, and approximately refactoring as well (note this is a best effort).Stage 3: CALCULATING THE RIPPLE EFFECTSWe mainly used two approaches, implemented in the script 02_get_bcp.py:- Bayesian Probability Approximation (bayesian_remaining_probs): - At each refactoring commit where the same Java class has been affected by the same refactoring type as the one collected by Refactoring Miner a fraction of the class is modified - The new cumulative probability of change accounts for both previous codifications and the new changes introduced in the current commit. - The remaining probability tracks how much of the original code still exists over time through probabilistic approximation. $$ \displaystyle P_c = 1 - (1 - P_{c_{prev}}) \times (1 - CR) $$ where: - $P_c$ is the cumulative probability of change at the current commit, - $P_{c_{prev}}$ is the cumulative probability of change from the previous step, - $CR$ is the change ratio (i.e., proportion of modifications in the file).
- Longest Common Subsequence (LCS) Approach (lcs_remaining_probs) (more info) - Def: A LCS is the longest subsequence common to all sequences in a set of sequences. - For each commit $r_j$, the probability of original code persistence is computed as:

$$\displaystyle P_r = S = \frac{|LCS(A, B)|}{\max(|A|, |B|)}$$

where:- $P_r$ is the posterior probability of code persistence,- $S$ is the LCS similarity ratio (same as in compute_similarity),- $A$ represents the original file content at refactoring commit,- $B$ represents the modified file content at commit $r_j$.The files created within the DATA_PATH directory in this stage are the following ones:change_proneness_data/{project_name}/bcp_proneness_results.csvchange_proneness_data/{project_name}/bcp_results.pklbcp_logs/{project_name}.logbcp_estimation_global.logSimilarly, during the process the script bcp_status_monitoring.pycan be launched to get hourly reports on the process.Stage 4: CALCULATING DEVELOPER'S EFFORTHere as well, we mainly used two approaches, implemented in the script 03_dev_effort.py:Inter-Refactoring Touched Lines of Code (tloc)Anchored on the initial refactoring commit mined by Refactoring Miner, it collects the TLOC from each subsequent refactoring of the same type that affected the same Java class.It continues this approach with all subsequent cases, always compared to the anchored refactoring commit version of the Java class.$$\displaystyle TLOC = \sum_{k=i}^{j} \left( |A_k| + |D_k| \right)$$where$|A_k|$ represents the number of added lines in commit $k$ ,$|D_k|$ represents the number of deleted lines in commit $k$ ,The summation considers all commits from $R_i$ to $R_j$ .NOTE: In the results table the summation won't be applied, so each cell in the column will only resemble the actual TLOC in each commit, therefore for the total, the summation should be done.RAW Inter-Refactoring Touched Lines of Code (raw_tloc)It focuses on each subsequent refactoring commit made to the same class with the same refactoring type and computes the TLOC based on the parent commit.Therefore the computation would be as follows:$$\displaystyle Raw TLOC = |A_i| + |D_i|$$where:$|A_i|$ represents the number of added lines in commit $R_i$ ,$|D_i|$ represents the number of deleted lines in commit $R_i$ .NOTE: A summation could be done here as well, but for the definition of this metric it doesn't make that much sense.The files created within the DATA_PATH directory in this stage are the following ones:dev_effort_data/{project_name}.csvdev_effort_data/{project_name}.pkldev_effort_logs/{project_name}.logdev_effort_global.logStage 5: GENERATING THE FINAL RESULTS FILESimple as merging the generated results so far into the same file per per project (DATA_PATH/merged_results/{projec_name}_merged.csv)And then a global merge for all projects ending into DATA_PATH/cross_project_raw_results.csvThe rest of the analysis is done through the JMP Software (more info here)Stage 6: MINING DATA FOR APPENDIX EXAMPLESmall script to gather and build the content displayed in the Appendix to demonstrate the limitations of the Bayesian BCP approach in some corner cases of the study context.Following the execution of the Python script 05_bcp_example_collector.py, the user should execute the commands on the R Script bcp_example.RStage 7: GENERATING SUMMARY STATISTICSThe script 06_summary_statistics.py generates the table with the set of summary attributes describing the shape of each considered project in the study context, the output table can be found in basic_statistics_table.csv.For an aggregated display of summary descriptive statistics, you can run summary-statistics.R script in order to retrieve table descriptive-stats-table.csv, visible in the Study Context section of the manuscript.

This is subset of the data from MIRACLE iCCD all-sky imagers operated by Finnish Meteorological Institute (FMI) and Sodankylä Geophysical Observatory (SGO), University of Oulu. This subset includes 142 events with wave like aurora forms from FMI's internal Selected ASC data set collected from the MIRACLE database. The subset is used in paper: "Radio emissions reveal Alfvénic activity and electron acceleration prior to substorm onset" by S. Y. Wu et al., Nature Communications (accepted 2025).The original images have been captures with a fish-eye lens. They have been stored in 8-bit gray-scale JPG format, have a size of 512 × 512 pixels, corresponding to an average spatial resolution of approximately 1 km near the zenith at an ionospheric height of 110 km. The images captured with 557.7nm narrow-band filter with a cadence of approximately 20 seconds . Observations were collected from five stations: Abisko (ABK) at geographic latitude 68.36°N and geographic longitude 18.82°E, Muonio (MUO) at 68.02°N 23.53°E, Kevo (KEV) at 69.76°N 27.01°E, Kilpisjärvi (KIL) 69.02°N and 20.87°E, and Sodankylä (SOD) 67.42°N 26.39°E. The iCCD imagers at Kilpisjärvi, Kevo, Abisko and Muonio were owned and operated by Finnish Meteorological Institute. The iCCD imager at Sodankylä was owned and operated by Sodankylä Geophysical Observatory, University of Oulu. The image orientation calibrations are made at FMI. More details about MIRACLE and the iCCD imagers see: https://space.fmi.fi/MIRACLE/ASC/index.php

Automated Organization Profile
University of Oulu

University of Oulu

Current S-Index

Average Dataset Index per Dataset

Total Datasets

Average FAIR Score

Total Citations

Total Mentions

S-Index Interpretation

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

Replication Package of the EMSE Journal Article: "Analyzing the Ripple Effects of Refactoring" (Version: v1.0)

Replication Package for "Alone or in Combination? A Practitioner's Perspective on the Use of Generative AI for Static Code Quality Analysis" (Version: 1)

Replication Package for "Alone or in Combination? A Practitioner's Perspective on the Use of Generative AI for Static Code Quality Analysis" (Version: 1)

Contrasting genetic differentiation of urban and rural populations of two grassland lepidopterans across Europe

Contrasting genetic differentiation of urban and rural populations of two grassland lepidopterans across Europe

Data for Potassium Recovery from Mechanochemically Activated Phlogopite via Ultrasound Assisted Leaching

Engineering Disorder in Droplet Packings through Polydispersity and Adhesion

Engineering Disorder in Droplet Packings through Polydispersity and Adhesion

Subset of the MIRACLE iCCD all-sky camera network images at 557.7nm Dec 1996 – Dec 2007

Dataset for "Modeling 129Xe NMR chemical shift sensitivity in carbon nanotube systems"

Automated Organization ProfileUniversity of Oulu

University of Oulu

Current S-Index

Average Dataset Index per Dataset

Total Datasets

Average FAIR Score

Total Citations

Total Mentions

S-Index Interpretation

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

Replication Package of the EMSE Journal Article: "Analyzing the Ripple Effects of Refactoring" (Version: v1.0)

Replication Package for "Alone or in Combination? A Practitioner's Perspective on the Use of Generative AI for Static Code Quality Analysis" (Version: 1)

Replication Package for "Alone or in Combination? A Practitioner's Perspective on the Use of Generative AI for Static Code Quality Analysis" (Version: 1)

Contrasting genetic differentiation of urban and rural populations of two grassland lepidopterans across Europe

Contrasting genetic differentiation of urban and rural populations of two grassland lepidopterans across Europe

Data for Potassium Recovery from Mechanochemically Activated Phlogopite via Ultrasound Assisted Leaching

Engineering Disorder in Droplet Packings through Polydispersity and Adhesion

Engineering Disorder in Droplet Packings through Polydispersity and Adhesion

Subset of the MIRACLE iCCD all-sky camera network images at 557.7nm Dec 1996 – Dec 2007

Dataset for "Modeling 129Xe NMR chemical shift sensitivity in carbon nanotube systems"

Automated Organization Profile
University of Oulu