Automated Organization ProfileBloomberg
Bloomberg
Current S-Index
Sum of Dataset Indices for all datasets
Average Dataset Index per Dataset
Average Dataset Index per dataset
Total Datasets
Total datasets in this organization
Average FAIR Score
Average FAIR Score per dataset
Total Citations
Total citations to the organization's datasets
Total Mentions
Total mentions of the organization's datasets
S-Index Interpretation
The S-Index (Sharing Index) is a comprehensive metric that represents the cumulative impact of all your datasets. It is calculated as the sum of Dataset Index scores across all your claimed datasets.
What it means:
- A higher S-index indicates greater overall impact of your datasets relative to typical datasets in their fields of research
- The S-Index grows as you add more datasets or as existing datasets gain more citations and mentions
- It provides a single number to track your research data impact over time
Current S-Index: 7198.1 (sum of 31 datasets Dataset Index scores)
More information here.
S-Index Over Time
Cumulative Citations Over Time
Cumulative Mentions Over Time
Datasets
No description available
Authors
- Paolo, Pedinotti
No description available
Authors
- Paolo, Pedinotti
This repository contains resources, namely TempTabQA, developed for the paper: Gupta, V., Kandoi, P., Vora, M., Zhang, S., He, Y., Reinanda R., Srikumar V., TempTabQA: Temporal Question Answering for Semi-Structured Tables. In: Proceeding of the The 2023 Conference on Empirical Methods in Natural Language Processing, Dec 2023.TempTabQA is a dataset which comprises 11,454 question-answer pairs extracted from Wikipedia Infobox tables. These question-answer pairs are annotated by human annotators. We provide two test sets instead of one: the Head set with popular frequent domains, and the Tail set with rarer domains. Files to access the annotation follow the below structure:Maindataqapairs: split into train, dev, head, and tail sets, in both csv and json formatsTables: Wikipedia category and tables metadata in csv, json and html formatsCarefully read the LICENCE for non-academic usage.Note : Wherever required consider the year of 2022 as the build date for the dataset.
Authors
- Gupta, Vivek ;
- Zhang, Shuo
This repository contains resources, namely TempTabQA, developed for the paper: Gupta, V., Kandoi, P., Vora, M., Zhang, S., He, Y., Reinanda R., Srikumar V., TempTabQA: Temporal Question Answering for Semi-Structured Tables. In: Proceeding of the The 2023 Conference on Empirical Methods in Natural Language Processing, Dec 2023.TempTabQA is a dataset which comprises 11,454 question-answer pairs extracted from Wikipedia Infobox tables. These question-answer pairs are annotated by human annotators. We provide two test sets instead of one: the Head set with popular frequent domains, and the Tail set with rarer domains. Files to access the annotation follow the below structure:Maindataqapairs: split into train, dev, head, and tail sets, in both csv and json formatsTables: Wikipedia category and tables metadata in csv, json and html formatsCarefully read the LICENCE for non-academic usage.Note : Wherever required consider the year of 2022 as the build date for the dataset.
Authors
- Gupta, Vivek ;
- Zhang, Shuo
This repository contains the evaluation sets used in
A Benton, T Shi, O İrsoy, and I Malioutov."Weakly Supervised Headline Dependency Parsing". Findings of EMNLP. 2022. This dataset contains parse annotations for English news headlines and a script to produce conllu files joined with original headline text. Parse annotations are joined to the corresponding text by running: LDC_NYT_DIR="/PATH/TO/UNTARRED/LDC2008T19/" # path to untarred LDC2008T19 python build_eht.py --nyt_dir ${LDC_NYT_DIR} --num_proc 4 This will download the Google sentence compression (GSC) dataset, and build conllu files for GSC examples. If you have the New York Times Annotated Corpus (LDC2008T19) untarred locally, this will also join annotations to the NYT examples (location passed via --nyt_dir). Increase the argument to --num_procs to process more shards from the NYT corpus in parallel and reduce build time. The above was tested with python 3.9.7. The EHT evaluation sets, with gold-annotated POS tags and dependency relations, are built as EHT/gsc.test.conllu and EHT/nyt.test.conllu Silver, projected, trees which we used to train and validate out models are built under GSC_projected. These are not gold parse trees (projected predictions from the article lead sentence), and are shared purely for reproducibility sake.
Authors
- Benton, Adrian ;
- Shi, Tianze ;
- Irsoy, Ozan ;
- Malioutov, Igor
This repository contains the evaluation sets used in
A Benton, T Shi, O İrsoy, and I Malioutov."Weakly Supervised Headline Dependency Parsing". Findings of EMNLP. 2022. This dataset contains parse annotations for English news headlines and a script to produce conllu files joined with original headline text. Parse annotations are joined to the corresponding text by running: LDC_NYT_DIR="/PATH/TO/UNTARRED/LDC2008T19/" # path to untarred LDC2008T19 python build_eht.py --nyt_dir ${LDC_NYT_DIR} --num_proc 4 This will download the Google sentence compression (GSC) dataset, and build conllu files for GSC examples. If you have the New York Times Annotated Corpus (LDC2008T19) untarred locally, this will also join annotations to the NYT examples (location passed via --nyt_dir). Increase the argument to --num_procs to process more shards from the NYT corpus in parallel and reduce build time. The above was tested with python 3.9.7. The EHT evaluation sets, with gold-annotated POS tags and dependency relations, are built as EHT/gsc.test.conllu and EHT/nyt.test.conllu Silver, projected, trees which we used to train and validate out models are built under GSC_projected. These are not gold parse trees (projected predictions from the article lead sentence), and are shared purely for reproducibility sake.
Authors
- Benton, Adrian ;
- Shi, Tianze ;
- Irsoy, Ozan ;
- Malioutov, Igor
This repository contains resources developed for the paper: Gupta, V., Zhang, S., Vempala, A., He, Y., Choji, T., Srikumar V., Right for the Right Reason: Evidence Extraction for Trustworthy Tabular Reasoning. In: Proceeding of the The Association of Computational Linguistic 2022 (ACL ’22), May 2022". It includes the relevant rows marking for the train set of the InfoTabS dataset (https://infotabs.github.io/) Gupta et. al. 2020 [1]. We followed the protocol of Gupta et al. (2022) [2] which annotated the development and test sets (alpha1, alpha2, alpha3) sets: one table and three distinct hypotheses formed a HIT. We divide the tasks equally into 110 batches, each batch having 51 HITs each having three examples. In total, we collected 81,282 annotations from 90 distinct annotators. Overall, twenty five annotators completed over 1000 tasks, corresponding to 87.75 % of the examples, indicating a tail distribution with the annotations. Overall, 16,248 training set table-hypothesis pairs were successfully labeled with the evidence rows. On average, we obtain 89.49% F1-score with equal precision and recall for annotation agreement when compared with majority vote. It also includes an annotation template used on the mTurk platform for crowdsourcing. The cited datasets were used in this work. The cited datasets were used in this work. Files to access the annotation follow the below structure: annotation_batches batches_test: contain final results “.csv” files for all the development and test set batches (taken from Gupta et. al. 2022) batches_train: contain our annotated results “.csv” files for all the train set batches README.md: contain the readme for the annotation batches details main_template_row_relevant.html: content the annotation template used for each HIT i.e. marking the relevant row for each instance annotation_stats.md: Have details of the annotation statistics release_mturk: contain the release batches details i.e. csv for corresponding batches released Files to recreate the annotation statistics and pre-processed data: results_test: contain the pre-processed batch csv for dev and test set each batch. In the dev and test set. The integrated one computes the agreement stats for all the batches.(taken from Gupta et. al. 2022) results_train: similar to resutls_train expect contain the pre-processed batch csv for train set. scripts: contain the scripts needed to create the csv in the results_test and results_train sets. The script title denotes the function (the statistic it computes) for the scripts. src: the scripts use these python files to create the relevant statistics. References: [1] InfoTabS: Inference on Tables as Semi-structured Data, Vivek Gupta, Maitrey Mehta, Pegah Nokhiz, Vivek Srikumar, ACL 2020 [2] Is My Model Using The Right Evidence? Systematic Probes for Examining Evidence-Based Tabular Reasoning, Vivek Gupta, Riyaz A. Bhat, Atreya Ghosal, Manish Srivastava, Maneesh Singh, Vivek Srikumar, TACL 2022, presented at ACL 2022
Authors
- Gupta, Vivek ;
- Zhang, Shuo ;
- Vempala, Alakananda ;
- Yujie He ;
- Choji, Temma ;
- Srikumar, Vivek
This repository contains resources developed for the paper: Gupta, V., Zhang, S., Vempala, A., He, Y., Choji, T., Srikumar V., Right for the Right Reason: Evidence Extraction for Trustworthy Tabular Reasoning. In: Proceeding of the The Association of Computational Linguistic 2022 (ACL ’22), May 2022". It includes the relevant rows marking for the train set of the InfoTabS dataset (https://infotabs.github.io/) Gupta et. al. 2020 [1]. We followed the protocol of Gupta et al. (2022) [2] which annotated the development and test sets (alpha1, alpha2, alpha3) sets: one table and three distinct hypotheses formed a HIT. We divide the tasks equally into 110 batches, each batch having 51 HITs each having three examples. In total, we collected 81,282 annotations from 90 distinct annotators. Overall, twenty five annotators completed over 1000 tasks, corresponding to 87.75 % of the examples, indicating a tail distribution with the annotations. Overall, 16,248 training set table-hypothesis pairs were successfully labeled with the evidence rows. On average, we obtain 89.49% F1-score with equal precision and recall for annotation agreement when compared with majority vote. It also includes an annotation template used on the mTurk platform for crowdsourcing. The cited datasets were used in this work. The cited datasets were used in this work. Files to access the annotation follow the below structure: annotation_batches batches_test: contain final results “.csv” files for all the development and test set batches (taken from Gupta et. al. 2022) batches_train: contain our annotated results “.csv” files for all the train set batches README.md: contain the readme for the annotation batches details main_template_row_relevant.html: content the annotation template used for each HIT i.e. marking the relevant row for each instance annotation_stats.md: Have details of the annotation statistics release_mturk: contain the release batches details i.e. csv for corresponding batches released Files to recreate the annotation statistics and pre-processed data: results_test: contain the pre-processed batch csv for dev and test set each batch. In the dev and test set. The integrated one computes the agreement stats for all the batches.(taken from Gupta et. al. 2022) results_train: similar to resutls_train expect contain the pre-processed batch csv for train set. scripts: contain the scripts needed to create the csv in the results_test and results_train sets. The script title denotes the function (the statistic it computes) for the scripts. src: the scripts use these python files to create the relevant statistics. References: [1] InfoTabS: Inference on Tables as Semi-structured Data, Vivek Gupta, Maitrey Mehta, Pegah Nokhiz, Vivek Srikumar, ACL 2020 [2] Is My Model Using The Right Evidence? Systematic Probes for Examining Evidence-Based Tabular Reasoning, Vivek Gupta, Riyaz A. Bhat, Atreya Ghosal, Manish Srivastava, Maneesh Singh, Vivek Srikumar, TACL 2022, presented at ACL 2022
Authors
- Gupta, Vivek ;
- Zhang, Shuo ;
- Vempala, Alakananda ;
- Yujie He ;
- Choji, Temma ;
- Srikumar, Vivek
In this work, we explore how to learn task-specific language models aimed towards learning rich representation of keyphrases from text documents. We experiment with different masking strategies for pre-training transformer language models (LMs) in discriminative as well as generative settings. In the discriminative setting, we introduce a new pre-training objective - Keyphrase Boundary Infilling with Replacement (KBIR), showing large gains in performance (up to 9.26 points in F1) over SOTA, when LM pre-trained using KBIR is fine-tuned for the task of keyphrase extraction. In the generative setting, we introduce a new pre-training setup for BART - KeyBART, that reproduces the keyphrases related to the input text in the CatSeq format, instead of the denoised original input. This also led to gains in performance (up to 4.33 points in F1@M) over SOTA for keyphrase generation. Additionally, we also fine-tune the pre-trained language models on named entity recognition (NER), question answering (QA), relation extraction (RE), abstractive summarization and achieve comparable performance with that of the SOTA, showing that learning rich representation of keyphrases is indeed beneficial for many other fundamental NLP tasks. As a part of this zip file we release the KBIR model which is continually pre-trained on RoBERTa-Large and also the KeyBART model which is continually pre-trained on BART-Large. Both these models can be used in place of a RoBERTa-Large or BART-Large model in PyTorch codebases and also with HuggingFace.
Authors
- Kulkarni, Mayank ;
- Mahata, Debanjan ;
- Arora, Ravneet ;
- Bhowmik, Rajarshi
In this work, we explore how to learn task-specific language models aimed towards learning rich representation of keyphrases from text documents. We experiment with different masking strategies for pre-training transformer language models (LMs) in discriminative as well as generative settings. In the discriminative setting, we introduce a new pre-training objective - Keyphrase Boundary Infilling with Replacement (KBIR), showing large gains in performance (up to 9.26 points in F1) over SOTA, when LM pre-trained using KBIR is fine-tuned for the task of keyphrase extraction. In the generative setting, we introduce a new pre-training setup for BART - KeyBART, that reproduces the keyphrases related to the input text in the CatSeq format, instead of the denoised original input. This also led to gains in performance (up to 4.33 points in F1@M) over SOTA for keyphrase generation. Additionally, we also fine-tune the pre-trained language models on named entity recognition (NER), question answering (QA), relation extraction (RE), abstractive summarization and achieve comparable performance with that of the SOTA, showing that learning rich representation of keyphrases is indeed beneficial for many other fundamental NLP tasks. As a part of this zip file we release the KBIR model which is continually pre-trained on RoBERTa-Large and also the KeyBART model which is continually pre-trained on BART-Large. Both these models can be used in place of a RoBERTa-Large or BART-Large model in PyTorch codebases and also with HuggingFace.
Authors
- Kulkarni, Mayank ;
- Mahata, Debanjan ;
- Arora, Ravneet ;
- Bhowmik, Rajarshi