Automated Organization Profile

Bloomberg

Current S-Index

7,198.1

Sum of Dataset Indices for all datasets

Average Dataset Index per Dataset

232.2

Average Dataset Index per dataset

Total Datasets

31

Total datasets in this organization

Average FAIR Score

55.1%

Average FAIR Score per dataset

Total Citations

16,949

Total citations to the organization's datasets

Total Mentions

9

Total mentions of the organization's datasets

S-Index Interpretation

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

MetaGraph for FinNLP Research: A Large-Scale Knowledge Graph of GenAI in Financial NLP (2022–2025)

No description available

Authors

  • Paolo, Pedinotti
0 Citations0 Mentions13% FAIR0.3 Dataset Index
10.5281/zenodo.16795446August 2025

MetaGraph for FinNLP Research: A Large-Scale Knowledge Graph of GenAI in Financial NLP (2022–2025)

No description available

Authors

  • Paolo, Pedinotti
0 Citations0 Mentions13% FAIR0.3 Dataset Index
10.5281/zenodo.16795445August 2025

TempTabQA: Temporal Question Answering for Semi-Structured Tables (Version: 1.0)

This repository contains resources, namely TempTabQA, developed for the paper: Gupta, V., Kandoi, P., Vora, M., Zhang, S., He, Y., Reinanda R., Srikumar V., TempTabQA: Temporal Question Answering for Semi-Structured Tables. In: Proceeding of the The 2023 Conference on Empirical Methods in Natural Language Processing, Dec 2023.TempTabQA is a dataset which comprises 11,454 question-answer pairs extracted from Wikipedia Infobox tables. These question-answer pairs are annotated by human annotators. We provide two test sets instead of one: the Head set with popular frequent domains, and the Tail set with rarer domains. Files to access the annotation follow the below structure:Maindataqapairs: split into train, dev,  head, and tail sets, in both csv and json formatsTables: Wikipedia category and tables metadata in csv, json and html formatsCarefully read the LICENCE for non-academic usage.Note : Wherever required consider the year of 2022 as the build date for the dataset.

Authors

  • Gupta, Vivek ;
  • Zhang, Shuo
0 Citations0 Mentions77% FAIR1.7 Dataset Index
10.5281/zenodo.10022926October 2023

TempTabQA: Temporal Question Answering for Semi-Structured Tables (Version: 1.0)

This repository contains resources, namely TempTabQA, developed for the paper: Gupta, V., Kandoi, P., Vora, M., Zhang, S., He, Y., Reinanda R., Srikumar V., TempTabQA: Temporal Question Answering for Semi-Structured Tables. In: Proceeding of the The 2023 Conference on Empirical Methods in Natural Language Processing, Dec 2023.TempTabQA is a dataset which comprises 11,454 question-answer pairs extracted from Wikipedia Infobox tables. These question-answer pairs are annotated by human annotators. We provide two test sets instead of one: the Head set with popular frequent domains, and the Tail set with rarer domains. Files to access the annotation follow the below structure:Maindataqapairs: split into train, dev,  head, and tail sets, in both csv and json formatsTables: Wikipedia category and tables metadata in csv, json and html formatsCarefully read the LICENCE for non-academic usage.Note : Wherever required consider the year of 2022 as the build date for the dataset.

Authors

  • Gupta, Vivek ;
  • Zhang, Shuo
0 Citations0 Mentions77% FAIR1.7 Dataset Index
10.5281/zenodo.10022927October 2023

The English Headline Treebank corpus

This repository contains the evaluation sets used in

A Benton, T Shi, O İrsoy, and I Malioutov."Weakly Supervised Headline Dependency Parsing". Findings of EMNLP. 2022. 
This dataset contains parse annotations for English news headlines and a script to produce conllu files joined with original headline text. Parse annotations are joined to the corresponding text by running:
 LDC_NYT_DIR="/PATH/TO/UNTARRED/LDC2008T19/" # path to untarred LDC2008T19 python build_eht.py --nyt_dir ${LDC_NYT_DIR} --num_proc 4
This will download the Google sentence compression (GSC) dataset, and build conllu files for GSC examples. If you have the New York Times Annotated Corpus (LDC2008T19) untarred locally, this will also join annotations to the NYT examples (location passed via --nyt_dir). Increase the argument to --num_procs to process more shards from the NYT corpus in parallel and reduce build time. The above was tested with python 3.9.7. The EHT evaluation sets, with gold-annotated POS tags and dependency relations, are built as EHT/gsc.test.conllu and EHT/nyt.test.conllu Silver, projected, trees which we used to train and validate out models are built under GSC_projected. These are not gold parse trees (projected predictions from the article lead sentence), and are shared purely for reproducibility sake.

Authors

  • Benton, Adrian ;
  • Shi, Tianze ;
  • Irsoy, Ozan ;
  • Malioutov, Igor
0 Citations0 Mentions85% FAIR2.1 Dataset Index
10.5281/zenodo.7312046November 2022

The English Headline Treebank corpus

This repository contains the evaluation sets used in

A Benton, T Shi, O İrsoy, and I Malioutov."Weakly Supervised Headline Dependency Parsing". Findings of EMNLP. 2022. 
This dataset contains parse annotations for English news headlines and a script to produce conllu files joined with original headline text. Parse annotations are joined to the corresponding text by running:
 LDC_NYT_DIR="/PATH/TO/UNTARRED/LDC2008T19/" # path to untarred LDC2008T19 python build_eht.py --nyt_dir ${LDC_NYT_DIR} --num_proc 4
This will download the Google sentence compression (GSC) dataset, and build conllu files for GSC examples. If you have the New York Times Annotated Corpus (LDC2008T19) untarred locally, this will also join annotations to the NYT examples (location passed via --nyt_dir). Increase the argument to --num_procs to process more shards from the NYT corpus in parallel and reduce build time. The above was tested with python 3.9.7. The EHT evaluation sets, with gold-annotated POS tags and dependency relations, are built as EHT/gsc.test.conllu and EHT/nyt.test.conllu Silver, projected, trees which we used to train and validate out models are built under GSC_projected. These are not gold parse trees (projected predictions from the article lead sentence), and are shared purely for reproducibility sake.

Authors

  • Benton, Adrian ;
  • Shi, Tianze ;
  • Irsoy, Ozan ;
  • Malioutov, Igor
0 Citations0 Mentions85% FAIR2.1 Dataset Index
10.5281/zenodo.7312045November 2022

Right for the Right Reason: Evidence Extraction for Trustworthy Tabular Reasoning (Version: 1.0.0)

This repository contains resources developed for the paper: Gupta, V., Zhang, S., Vempala, A., He, Y., Choji, T., Srikumar V., Right for the Right Reason: Evidence Extraction for Trustworthy Tabular Reasoning. In: Proceeding of the The Association of Computational Linguistic 2022 (ACL ’22), May 2022". It includes the relevant rows marking for the train set of the InfoTabS dataset (https://infotabs.github.io/) Gupta et. al. 2020 [1]. We followed the protocol of Gupta et al. (2022) [2] which annotated the development and test sets (alpha1, alpha2, alpha3) sets: one table and three distinct hypotheses formed a HIT. We divide the tasks equally into 110 batches, each batch having 51 HITs each having three examples. In total, we collected 81,282 annotations from 90 distinct annotators. Overall, twenty five annotators completed over 1000 tasks, corresponding to 87.75 % of the examples, indicating a tail distribution with the annotations. Overall, 16,248 training set table-hypothesis pairs were successfully labeled with the evidence rows. On average, we obtain 89.49% F1-score with equal precision and recall for annotation agreement when compared with majority vote. It also includes an annotation template used on the mTurk platform for crowdsourcing. The cited datasets were used in this work. The cited datasets were used in this work. Files to access the annotation follow the below structure: annotation_batches batches_test: contain final results “.csv” files for all the development and test set batches (taken from Gupta et. al. 2022) batches_train: contain our annotated results “.csv” files for all the train set batches README.md: contain the readme for the annotation batches details main_template_row_relevant.html: content the annotation template used for each HIT i.e. marking the relevant row for each instance annotation_stats.md: Have details of the annotation statistics release_mturk: contain the release batches details i.e. csv for corresponding batches released Files to recreate the annotation statistics and pre-processed data: results_test: contain the pre-processed batch csv for dev and test set each batch. In the dev and test set. The integrated one computes the agreement stats for all the batches.(taken from Gupta et. al. 2022) results_train: similar to resutls_train expect contain the pre-processed batch csv for train set. scripts: contain the scripts needed to create the csv in the results_test and results_train sets. The script title denotes the function (the statistic it computes) for the scripts. src: the scripts use these python files to create the relevant statistics. References: [1] InfoTabS: Inference on Tables as Semi-structured Data, Vivek Gupta, Maitrey Mehta, Pegah Nokhiz, Vivek Srikumar, ACL 2020 [2] Is My Model Using The Right Evidence? Systematic Probes for Examining Evidence-Based Tabular Reasoning, Vivek Gupta, Riyaz A. Bhat, Atreya Ghosal, Manish Srivastava, Maneesh Singh, Vivek Srikumar, TACL 2022, presented at ACL 2022

Authors

  • Gupta, Vivek ;
  • Zhang, Shuo ;
  • Vempala, Alakananda ;
  • Yujie He ;
  • Choji, Temma ;
  • Srikumar, Vivek
0 Citations0 Mentions77% FAIR1.7 Dataset Index
10.5281/zenodo.6578592May 2022

Right for the Right Reason: Evidence Extraction for Trustworthy Tabular Reasoning (Version: 1.0.0)

This repository contains resources developed for the paper: Gupta, V., Zhang, S., Vempala, A., He, Y., Choji, T., Srikumar V., Right for the Right Reason: Evidence Extraction for Trustworthy Tabular Reasoning. In: Proceeding of the The Association of Computational Linguistic 2022 (ACL ’22), May 2022". It includes the relevant rows marking for the train set of the InfoTabS dataset (https://infotabs.github.io/) Gupta et. al. 2020 [1]. We followed the protocol of Gupta et al. (2022) [2] which annotated the development and test sets (alpha1, alpha2, alpha3) sets: one table and three distinct hypotheses formed a HIT. We divide the tasks equally into 110 batches, each batch having 51 HITs each having three examples. In total, we collected 81,282 annotations from 90 distinct annotators. Overall, twenty five annotators completed over 1000 tasks, corresponding to 87.75 % of the examples, indicating a tail distribution with the annotations. Overall, 16,248 training set table-hypothesis pairs were successfully labeled with the evidence rows. On average, we obtain 89.49% F1-score with equal precision and recall for annotation agreement when compared with majority vote. It also includes an annotation template used on the mTurk platform for crowdsourcing. The cited datasets were used in this work. The cited datasets were used in this work. Files to access the annotation follow the below structure: annotation_batches batches_test: contain final results “.csv” files for all the development and test set batches (taken from Gupta et. al. 2022) batches_train: contain our annotated results “.csv” files for all the train set batches README.md: contain the readme for the annotation batches details main_template_row_relevant.html: content the annotation template used for each HIT i.e. marking the relevant row for each instance annotation_stats.md: Have details of the annotation statistics release_mturk: contain the release batches details i.e. csv for corresponding batches released Files to recreate the annotation statistics and pre-processed data: results_test: contain the pre-processed batch csv for dev and test set each batch. In the dev and test set. The integrated one computes the agreement stats for all the batches.(taken from Gupta et. al. 2022) results_train: similar to resutls_train expect contain the pre-processed batch csv for train set. scripts: contain the scripts needed to create the csv in the results_test and results_train sets. The script title denotes the function (the statistic it computes) for the scripts. src: the scripts use these python files to create the relevant statistics. References: [1] InfoTabS: Inference on Tables as Semi-structured Data, Vivek Gupta, Maitrey Mehta, Pegah Nokhiz, Vivek Srikumar, ACL 2020 [2] Is My Model Using The Right Evidence? Systematic Probes for Examining Evidence-Based Tabular Reasoning, Vivek Gupta, Riyaz A. Bhat, Atreya Ghosal, Manish Srivastava, Maneesh Singh, Vivek Srikumar, TACL 2022, presented at ACL 2022

Authors

  • Gupta, Vivek ;
  • Zhang, Shuo ;
  • Vempala, Alakananda ;
  • Yujie He ;
  • Choji, Temma ;
  • Srikumar, Vivek
0 Citations0 Mentions13% FAIR0.3 Dataset Index
10.5281/zenodo.6578593May 2022

Learning Rich Representation of Keyphrases from Text

In this work, we explore how to learn task-specific language models aimed towards learning rich representation of keyphrases from text documents. We experiment with different masking strategies for pre-training transformer language models (LMs) in discriminative as well as generative settings. In the discriminative setting, we introduce a new pre-training objective - Keyphrase Boundary Infilling with Replacement (KBIR), showing large gains in performance (up to 9.26 points in F1) over SOTA, when LM pre-trained using KBIR is fine-tuned for the task of keyphrase extraction. In the generative setting, we introduce a new pre-training setup for BART - KeyBART, that reproduces the keyphrases related to the input text in the CatSeq format, instead of the denoised original input. This also led to gains in performance (up to 4.33 points in F1@M) over SOTA for keyphrase generation. Additionally, we also fine-tune the pre-trained language models on named entity recognition (NER), question answering (QA), relation extraction (RE), abstractive summarization and achieve comparable performance with that of the SOTA, showing that learning rich representation of keyphrases is indeed beneficial for many other fundamental NLP tasks. As a part of this zip file we release the KBIR model which is continually pre-trained on RoBERTa-Large and also the KeyBART model which is continually pre-trained on BART-Large. Both these models can be used in place of a RoBERTa-Large or BART-Large model in PyTorch codebases and also with HuggingFace.

Authors

  • Kulkarni, Mayank ;
  • Mahata, Debanjan ;
  • Arora, Ravneet ;
  • Bhowmik, Rajarshi
16944 Citations0 Mentions56% FAIR7157.4 Dataset Index
10.5281/zenodo.5781449December 2021

Learning Rich Representation of Keyphrases from Text (Version: 2)

In this work, we explore how to learn task-specific language models aimed towards learning rich representation of keyphrases from text documents. We experiment with different masking strategies for pre-training transformer language models (LMs) in discriminative as well as generative settings. In the discriminative setting, we introduce a new pre-training objective - Keyphrase Boundary Infilling with Replacement (KBIR), showing large gains in performance (up to 9.26 points in F1) over SOTA, when LM pre-trained using KBIR is fine-tuned for the task of keyphrase extraction. In the generative setting, we introduce a new pre-training setup for BART - KeyBART, that reproduces the keyphrases related to the input text in the CatSeq format, instead of the denoised original input. This also led to gains in performance (up to 4.33 points in F1@M) over SOTA for keyphrase generation. Additionally, we also fine-tune the pre-trained language models on named entity recognition (NER), question answering (QA), relation extraction (RE), abstractive summarization and achieve comparable performance with that of the SOTA, showing that learning rich representation of keyphrases is indeed beneficial for many other fundamental NLP tasks. As a part of this zip file we release the KBIR model which is continually pre-trained on RoBERTa-Large and also the KeyBART model which is continually pre-trained on BART-Large. Both these models can be used in place of a RoBERTa-Large or BART-Large model in PyTorch codebases and also with HuggingFace.

Authors

  • Kulkarni, Mayank ;
  • Mahata, Debanjan ;
  • Arora, Ravneet ;
  • Bhowmik, Rajarshi
0 Citations0 Mentions85% FAIR1.8 Dataset Index
10.5281/zenodo.5784384December 2021