Published on 09 January 2025

SPIKE-QA: A 50K size English dataset for SLM

View Dataset
Li, Jiawen

Description

SPIKE-QA is a human-indicated QA dataset generated by the GPT4o-small model, the dataset is collected as well as merged by the author with Python script. It contains 50,262 pairs of Q&A samples without time information but just single independent questions and answers.(Zero-Shot)The topic covers basic science like physics, chemistry, or math to complex generation problems or some daily chat. The dataset is in the form of a bunch of Excel tables, each of which holds two feature meanings as they are named "Question" and "Answer." The file name SPIKE-QA.csv is the complete dataset in the form of CSV. The data collected by giving a prompt to GPT to ensure the generation is in a form in pairs of tuples, like lis=[("Question1", "Answer1"),("Question2", "Answer2"),...], and transform it with python scriptThe size of the data might not be enough to pre-train an LLM from the start, it only seems to be used for parameter tuning, but paraphrasing the dataset might be one way to change the data into a useful resource. The dataset could also be used for model evaluation due to its diversity and vary length of the samples. The most important thing is accessibility, this dataset is a CSV file, making it easy for beginner to practice.Copy right reserved by the author(ORCID:0009-0002-1449-2803). An alternative of doi for this dataset is 10.34740/kaggle/dsv/10346351.

Citations (0)

Mentions (0)

Metrics

Dataset Index

0.5

FAIR Score

73%

Citations

0

Mentions

0

Metrics Over Time

Publication Details

DOI

Publisher

Zenodo

Assigned Domain

Subfield

General Social Sciences

Field

Social Sciences

Domain

Social Sciences

Confidence Score

32%

Source

Scholar Data Model

Normalization Factors

FT

50.00

CTw

1.00

MTw

1.00