Published on 01 January 2026
Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering
View DatasetDescription
This repository contains the train and test data used for the experiments involving Claim2Vec Model - the first multilingual embedding model optimized to represent fact-check claims as vectors in an improved semantic embedding space. MultiClaim - Train49K Multilingual claim pairs annotated for their similarity using three large language models as similar or dissimilar. All claim pairs belongs to topic group 1.Content:CID_1, CID_2 - Factchecked claim IDs from the original dataset MultiClaimNetCLAIM_1, CLAIM_2 - Claims in their original languageTranslation_1, Translation_1, Claims in English translationLanguage_1, Language_2 - Language of the claimsLabel - 1/0 similar/dissimilar MultiClaim - TestSubset of clusters from MultiClaim from MultiClaimNet discussing topic group 2. This set composed of 42.4K claims grouped into 16K clusters. Preprint: https://arxiv.org/abs/2604.09812 ReferencesIf you use claim pairs from Claim2Vec research in any publication, project, tool, or in any other form, please cite the following paper:@misc{panchendrarajan2026claim2ve, title={Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering}, author={Rrubaa Panchendrarajan and Arkaitz Zubiaga}, year={2026}, eprint={2604.09812}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.09812}, }