Automated Author ProfileEsposito, Matteo
University of Oulu
Esposito, Matteo
Current S-Index
Sum of Dataset Indices for all datasets
Average Dataset Index per Dataset
Average Dataset Index per dataset
Total Datasets
Total datasets for this author
Average FAIR Score
Average FAIR Score per dataset
Total Citations
Total citations to the author's datasets
Total Mentions
Total mentions of the author's datasets
S-Index Interpretation
The S-Index (Sharing Index) is a comprehensive metric that represents the cumulative impact of all your datasets. It is calculated as the sum of Dataset Index scores across all your claimed datasets.
What it means:
- A higher S-index indicates greater overall impact of your datasets relative to typical datasets in their fields of research
- The S-Index grows as you add more datasets or as existing datasets gain more citations and mentions
- It provides a single number to track your research data impact over time
Current S-Index: 7.9 (sum of 6 datasets Dataset Index scores)
More information here.
S-Index Over Time
Cumulative Citations Over Time
Cumulative Mentions Over Time
Datasets
CppSATD is the first dedicated C++ SATD comments dataset comprising multi-class SATD comments along with their code context. It is the largest available multi-type SATD dataset with surrounding code context, comprising over 13k manually annotated SATD comments grouped under different types of SATD.silver_standard_cpp.csv: The silver standard dataset comprises a total of 531,367 source code comments from all five repositories identified. We perform an automated annotation for SATD on the entire dataset using the established SATD text patterns from the PENTACET corpus. The automated annotation resulted in 18,973 SATD comments. The remaining 512,394 are treated as NON-SATD comments. This dataset does not involve any human verification but is solely based on the SATD patterns from the PENTACET corpus.gold_standard_cpp.csv: We utilize the SATD comments from the silver standard data. We manually verified the 18,973 comments to validate if they are SATD. To ascertain that the 512,394 are NON-SATD comments at 99% confidence level and 1% margin of error, we take a random sample of 16,125 source code comments from 512,394 comments. Overall, the manual verification resulted in a total of 13,069 SATD comments and 22,029 NON-SATD comments. The 13,069 is further classified into one of the five types of SATD as identified in the seminar work that introduced the concept of SATD.repos.zip: This includes the source code of 5 C++ repositories, from which we collected the data.Official_workshop_document.pdf: This document was created to ensure a consistent understanding regarding SATD types between the authors.
Authors
- Sridharan, Murali ;
- Pham, Phuoc ;
- Esposito, Matteo ;
- Lenarduzzi, Valentina
CppSATD is the first dedicated C++ SATD comments dataset comprising multi-class SATD comments along with their code context. It is the largest available multi-type SATD dataset with surrounding code context, comprising over 13k manually annotated SATD comments grouped under different types of SATD.silver_standard_cpp.csv: The silver standard dataset comprises a total of 531,367 source code comments from all five repositories identified. We perform an automated annotation for SATD on the entire dataset using the established SATD text patterns from the PENTACET corpus. The automated annotation resulted in 18,973 SATD comments. The remaining 512,394 are treated as NON-SATD comments. This dataset does not involve any human verification but is solely based on the SATD patterns from the PENTACET corpus.gold_standard_cpp.csv: We utilize the SATD comments from the silver standard data. We manually verified the 18,973 comments to validate if they are SATD. To ascertain that the 512,394 are NON-SATD comments at 99% confidence level and 1% margin of error, we take a random sample of 16,125 source code comments from 512,394 comments. Overall, the manual verification resulted in a total of 13,069 SATD comments and 22,029 NON-SATD comments. The 13,069 is further classified into one of the five types of SATD as identified in the seminar work that introduced the concept of SATD.repos.zip: This includes the source code of 5 C++ repositories, from which we collected the data.Official_workshop_document.pdf: This document was created to ensure a consistent understanding regarding SATD types between the authors.
Authors
- Sridharan, Murali ;
- Pham, Phuoc ;
- Esposito, Matteo ;
- Lenarduzzi, Valentina
CppSATD is the first dedicated C++ SATD comments dataset comprising multi-class SATD comments along with their code context. It is the largest available multi-type SATD dataset with surrounding code context, comprising over 13k manually annotated SATD comments grouped under different types of SATD.silver_standard_cpp.csv: The silver standard dataset comprises a total of 531,367 source code comments from all five repositories identified. We perform an automated annotation for SATD on the entire dataset using the established SATD text patterns from the PENTACET corpus. The automated annotation resulted in 18,973 SATD comments. The remaining 512,394 are treated as NON-SATD comments. This dataset does not involve any human verification but is solely based on the SATD patterns from the PENTACET corpus.gold_standard_cpp.csv: We utilize the SATD comments from the silver standard data. We manually verified the 18,973 comments to validate if they are SATD. To ascertain that the 512,394 are NON-SATD comments at 99% confidence level and 1% margin of error, we take a random sample of 16,125 source code comments from 512,394 comments. Overall, the manual verification resulted in a total of 13,069 SATD comments and 22,029 NON-SATD comments. The 13,069 is further classified into one of the five types of SATD as identified in the seminar work that introduced the concept of SATD.repos.zip: This includes the source code of 5 C++ repositories, from which we collected the data.
Authors
- Sridharan, Murali ;
- Pham, Phuoc ;
- Esposito, Matteo ;
- Lenarduzzi, Valentina
CppSATD is the first dedicated C++ SATD comments dataset comprising multi-class SATD comments along with their code context. It is the largest available multi-type SATD dataset with surrounding code context, comprising over 13k manually annotated SATD comments grouped under different types of SATD.silver_standard_cpp.csv: The silver standard dataset comprises a total of 531,367 source code comments from all five repositories identified. We perform an automated annotation for SATD on the entire dataset using the established SATD text patterns from the PENTACET corpus. The automated annotation resulted in 18,973 SATD comments. The remaining 512,394 are treated as NON-SATD comments. This dataset does not involve any human verification but is solely based on the SATD patterns from the PENTACET corpus.gold_standard_cpp.csv: We utilize the SATD comments from the silver standard data. We manually verified the 18,973 comments to validate if they are SATD. To ascertain that the 512,394 are NON-SATD comments at 99% confidence level and 1% margin of error, we take a random sample of 16,125 source code comments from 512,394 comments. Overall, the manual verification resulted in a total of 13,069 SATD comments and 22,029 NON-SATD comments. The 13,069 is further classified into one of the five types of SATD as identified in the seminar work that introduced the concept of SATD.repos.zip: This includes the source code of 5 C++ repositories, from which we collected the data.
Authors
- Sridharan, Murali ;
- Pham, Phuoc ;
- Esposito, Matteo ;
- Lenarduzzi, Valentina
LO2 datasetThis is the data repository for the LO2 dataset.Here is an overview of the contents.lo2-data.zipThis is the main dataset. This is the completely unedited output of our data collection process. Note that the uncompressed size is around 540 GB. For more information, see the paper and the data-appendix in this repository.lo2-sample.zipThis is a sample that contains the data used for preliminary analysis. It contains only service logs and the most relevant metrics for the first 100 runs. Furthermore, the metrics are combined on a run level to a single csv to make them easier to utilize. data-appendix.pdfThis document contains further details and stats about the full dataset. These include file size distributions, empty file analysis, log type analysis and the appearance of an unknown file.lo2-scripts.zipVarious scripts for processing the data to create the sample, to conduct the preliminary analysis and to create the statistics seen in the data-appendix.csv_generator.py, csv_merge*.py: These scripts create and combine the metrics into csv files. They need to be run in order. Merging runs to global is very memory intensive.findempty.py: Finds empty files in the folders. As some are expected to be empty, it also counts the unexpected ones. Used in data-appendix.loglead_lo2.py: Script for the preliminary analysis of the logs for error detection. Requires LogLead version 1.2.1.logstats.py: Counts log lines and their type. Used for creating the figure of number of lines per type and service.node_exporter_metrics.txt: Metric descriptions exported from Prometheus (text file).pca.py: The Principal Component Analysis script used for preliminary analysis.reduce_logs.py: Very important for fair analysis as in the beginning of the files there are some initialization rows that leak information regarding correctness.requirements.txt: Required Python libraries to run the scripts.sizedist.py: Creating distributions of file sizes per filename for the data-appendix.Version v2: Fixed LogLead version number and minor changes in scripts
Authors
- Bakhtin, Alexander ;
- Nyyssölä, Jesse ;
- Wang, Yuqing ;
- Ahmad, Noman ;
- Ping, Ke ;
- Esposito, Matteo ;
- Mäntylä, Mika ;
- Taibi, Davide
LO2 datasetThis is the data repository for the LO2 dataset.Here is an overview of the contents.lo2-data.zipThis is the main dataset. This is the completely unedited output of our data collection process. Note that the uncompressed size is around 540 GB. For more information, see the paper and the data-appendix in this repository.lo2-sample.zipThis is a sample that contains the data used for preliminary analysis. It contains only service logs and the most relevant metrics for the first 100 runs. Furthermore, the metrics are combined on a run level to a single csv to make them easier to utilize. data-appendix.pdfThis document contains further details and stats about the full dataset. These include file size distributions, empty file analysis, log type analysis and the appearance of an unknown file.lo2-scripts.zipVarious scripts for processing the data to create the sample, to conduct the preliminary analysis and to create the statistics seen in the data-appendix.csv_generator.py, csv_merge*.py: These scripts create and combine the metrics into csv files. They need to be run in order. Merging runs to global is very memory intensive.findempty.py: Finds empty files in the folders. As some are expected to be empty, it also counts the unexpected ones. Used in data-appendix.loglead_lo2.py: Script for the preliminary analysis of the logs for error detection. Requires LogLead version 1.2.0.logstats.py: Counts log lines and their type. Used for creating the figure of number of lines per type and service.node_exporter_metrics.txt: Metric descriptions exported from Prometheus (text file).pca.py: The Principal Component Analysis script used for preliminary analysis.reduce_logs.py: Very important for fair analysis as in the beginning of the files there are some initialization rows that leak information regarding correctness.requirements.txt: Required Python libraries to run the scripts.sizedist.py: Creating distributions of file sizes per filename for the data-appendix.
Authors
- Bakhtin, Alexander ;
- Nyyssölä, Jesse ;
- Wang, Yuqing ;
- Ahmad, Noman ;
- Ping, Ke ;
- Esposito, Matteo ;
- Mäntylä, Mika ;
- Taibi, Davide