Published on 12 April 2022 |

Version 0.3

Spanish Biomedical Crawled Corpus

View Dataset
Carrino, Casimiro Pio;Silveira-Ocampo, Joaquín;Gonzalez-Agirre, Aitor;Gutiérrez-Fandiño, Asier;Krallinger, Martin;Villegas, Marta

Description

The largest Spanish biomedical and heath corpus to date gathered from a massive Spanish health domain crawler over more than 3,000 URLs were downloaded and preprocessed. All the collected data have been preprocessed to produce the CoWeSe (Corpus Web Salud Español) resource, a large-scale and high-quality corpus intended for biomedical and health NLP in Spanish. Enlarged version with less restrictive document and sentence deduplication. Citation If you use this resource in your work, please cite our paper:

@misc{carrino2021spanish, title={Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models}, author={Casimiro Pio Carrino and Jordi Armengol-Estapé and Ona de Gibert Bonet and Asier Gutiérrez-Fandiño and Aitor Gonzalez-Agirre and Martin Krallinger and Marta Villegas}, year={2021}, eprint={2109.07765}, archivePrefix={arXiv}, primaryClass={cs.CL} } 
Copyright (c) 2022 Secretaría de Estado de Digitalización e Inteligencia Artificial

Citations (0)

Mentions (0)

Metrics

Dataset Index

0.9

FAIR Score

81%

Citations

0

Mentions

0

Metrics Over Time

Publication Details

DOI

Publisher

Zenodo

Assigned Domain

Subfield

Molecular Biology

Field

Biochemistry, Genetics and Molecular Biology

Domain

Life Sciences

Confidence Score

95%

Source

Open Alex

Keywords

SpanishBiomedicalCorpusCrawling

Normalization Factors

FT

30.77

CTw

1.00

MTw

1.00