Automated Organization Profile

Instituto de Medicina Molecular João Lobo Antunes

Current S-Index

8.2

Sum of Dataset Indices for all datasets

Average Dataset Index per Dataset

0.9

Average Dataset Index per dataset

Total Datasets

9

Total datasets in this organization

Average FAIR Score

43.4%

Average FAIR Score per dataset

Total Citations

2

Total citations to the organization's datasets

Total Mentions

0

Total mentions of the organization's datasets

S-Index Interpretation

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

Paired datasets to study alternative splicing regulation by individual RNA-binding proteins

This project stores datasets generated to study the regulation of alternative splicing using deep learning models (e.g., SpliceAI). In particular, these datasets were used to perform ablation studies (sequence perturbations at motif locations) to evaluate their effects on the deep learning model.I used public RNA-Seq data from the ENCODE consortium to identify exons sensitive to the knockdown of RNA-binding proteins (RBPs). The idea is that exons sensitive to RBP knockdowns are more likely to be directly or indirectly regulated by such RBPs, hence providing hints on their regulation mechanisms. Importantly, I also generated paired control exons, which were not alternatively spliced upon RBP knockdown but have similar GC composition and length compared to the knockdown-sensitive exons (target exon and surrounding introns). These control sets were generated to account for potential confounding factors of gene architecture features and, therefore, focus only on RBP binding motifs and their regulatory logic.Information about the filesAfter uncompressing the 'paired_dataset.tar.gz' file, a directory with multiple files will be created with the following structure:0_rMATS_ES_events.tsv.gz: Summary tables of differential splicing analysis, with deltaPSI estimates referring to Ctrl - Knockdown groups. Important columns: 'target_coordinates' refers to the 1-based coordinates of the alternatively spliced exon, and 'group' indicates the individual knockdown experiments where the exon was observed to be alternatively spliced.0_rMATs_ES_non_changing_events.tsv.gz: Summary tables of differential splicing analysis, but in this case, contains all non-changing events (dPSI < |0.025|).1_KD_exons_dPSI0.1.tsv.gz: Table with knockdown-sensitive exons along with values for gene architecture features along the exon triplet (exon upstream, intron upstream, cassette exon, intron downstream exon downstream).1_Ctrl_exons_dPSI0.025.tsv.gz: Same as '1_KD_exons_dPSI0.1.tsv.gz', but for all non-changing events.2_paired_datasets.tsv.gz: Paired datasets in tidy format, where Knockdown-sensitive exons and their Control pairs come in consecutive lines. The 'rbp_name' column refers to the individual knockdown experiment where that exon was observed.2_paired_datasets_negative_dPSI.tsv.gz, 2_paired_datasets_positive_dPSI.tsv.gz: Same as '2_paired_datasets.tsv.gz', but knockdown-sensitive exons are split according to the direction of dPSI observed in the RNA-Seq data (along with the respective control pair).2_paired_datasets_individualRBPs: This folder contains the paired datasets in wide format, where a single line contains both the knockdown-sensitive and control pair. In addition, each paired dataset (knockdown of individual RBP) is written in a separate file.Details of the sh knockdown RNA-Seq analysisBecause in the ENCODE study (Van Nostrand E.L. et al., 2020), authors analyzed knockdown RNA Seq data using an older version of the human genome (hg19) along with old genome annotations (GENCODE v19), I reanalyzed ENCODE data aligned to the hg38 genome build. I used rMATS v4.1.2 on each RBP knockdown experiment to detect differentially spliced events between the two knockdown replicates vs the two control replicates. rMATS was run with GENCODE annotations v44 and specifically tweaked with --cstat 0.05. Significant knockdown-sensitive events were identified with a deltaPSI > |0.1|, using a False Discovery Rate cutoff of 0.05. Non-changing events, assumed as knockdown-agnostic controls, were defined as those exhibiting negligible deltaPSI variation (< |0.025|). To ensure the high quality of the exon sets, further analytical steps were performed. First, I applied a read coverage filter, by retaining events where the median coverage across replicates per condition for the isoform with more read counts was higher than 7. Then, I exclusively focused on exon skipping events in protein-coding genes, and filtered out unannotated exons (pseudoexons) as well as first or last exons of genes. In addition, I excluded duplicate exon skipping events by picking the transcript with the highest biological importance (based on the presence of transcript flags such as MANE selected, CCDS, or APPRIS). A total of 15,235 events were detected across all RBP knockdown experiments (N=72, splicing-associated RBPs with data available for the HepG2 cell line), covering 6,659 unique exons.

Authors

  • Pedro Barbosa
0 Citations0 Mentions13% FAIR0.1 Dataset Index
10.5281/zenodo.11193458May 2024

Paired datasets to study alternative splicing regulation by individual RNA-binding proteins

This project stores datasets generated to study the regulation of alternative splicing using deep learning models (e.g., SpliceAI). In particular, these datasets were used to perform ablation studies (sequence perturbations at motif locations) to evaluate their effects on the deep learning model.I used public RNA-Seq data from the ENCODE consortium to identify exons sensitive to the knockdown of RNA-binding proteins (RBPs). The idea is that exons sensitive to RBP knockdowns are more likely to be directly or indirectly regulated by such RBPs, hence providing hints on their regulation mechanisms. Importantly, I also generated paired control exons, which were not alternatively spliced upon RBP knockdown but have similar GC composition and length compared to the knockdown-sensitive exons (target exon and surrounding introns). These control sets were generated to account for potential confounding factors of gene architecture features and, therefore, focus only on RBP binding motifs and their regulatory logic.Information about the filesAfter uncompressing the 'paired_dataset.tar.gz' file, a directory with multiple files will be created with the following structure:0_rMATS_ES_events.tsv.gz: Summary tables of differential splicing analysis, with deltaPSI estimates referring to Ctrl - Knockdown groups. Important columns: 'target_coordinates' refers to the 1-based coordinates of the alternatively spliced exon, and 'group' indicates the individual knockdown experiments where the exon was observed to be alternatively spliced.0_rMATs_ES_non_changing_events.tsv.gz: Summary tables of differential splicing analysis, but in this case, contains all non-changing events (dPSI < |0.025|).1_KD_exons_dPSI0.1.tsv.gz: Table with knockdown-sensitive exons along with values for gene architecture features along the exon triplet (exon upstream, intron upstream, cassette exon, intron downstream exon downstream).1_Ctrl_exons_dPSI0.025.tsv.gz: Same as '1_KD_exons_dPSI0.1.tsv.gz', but for all non-changing events.2_paired_datasets.tsv.gz: Paired datasets in tidy format, where Knockdown-sensitive exons and their Control pairs come in consecutive lines. The 'rbp_name' column refers to the individual knockdown experiment where that exon was observed.2_paired_datasets_negative_dPSI.tsv.gz, 2_paired_datasets_positive_dPSI.tsv.gz: Same as '2_paired_datasets.tsv.gz', but knockdown-sensitive exons are split according to the direction of dPSI observed in the RNA-Seq data (along with the respective control pair).2_paired_datasets_individualRBPs: This folder contains the paired datasets in wide format, where a single line contains both the knockdown-sensitive and control pair. In addition, each paired dataset (knockdown of individual RBP) is written in a separate file.Details of the sh knockdown RNA-Seq analysisBecause in the ENCODE study (Van Nostrand E.L. et al., 2020), authors analyzed knockdown RNA Seq data using an older version of the human genome (hg19) along with old genome annotations (GENCODE v19), I reanalyzed ENCODE data aligned to the hg38 genome build. I used rMATS v4.1.2 on each RBP knockdown experiment to detect differentially spliced events between the two knockdown replicates vs the two control replicates. rMATS was run with GENCODE annotations v44 and specifically tweaked with --cstat 0.05. Significant knockdown-sensitive events were identified with a deltaPSI > |0.1|, using a False Discovery Rate cutoff of 0.05. Non-changing events, assumed as knockdown-agnostic controls, were defined as those exhibiting negligible deltaPSI variation (< |0.025|). To ensure the high quality of the exon sets, further analytical steps were performed. First, I applied a read coverage filter, by retaining events where the median coverage across replicates per condition for the isoform with more read counts was higher than 7. Then, I exclusively focused on exon skipping events in protein-coding genes, and filtered out unannotated exons (pseudoexons) as well as first or last exons of genes. In addition, I excluded duplicate exon skipping events by picking the transcript with the highest biological importance (based on the presence of transcript flags such as MANE selected, CCDS, or APPRIS). A total of 15,235 events were detected across all RBP knockdown experiments (N=72, splicing-associated RBPs with data available for the HepG2 cell line), covering 6,659 unique exons.

Authors

  • Pedro Barbosa
0 Citations0 Mentions13% FAIR0.1 Dataset Index
10.5281/zenodo.11193459May 2024

Local synthetic datasets generation - manuscript data (Version: 2.0)

Datasets generated under the work "Semantically Rich Local Dataset Generation for Explainable AI in Genomics".Relevant files and directories:      - datasets.tar.gz:  Full copy of the GitHub repository, except it contains the datasets generated in the manuscript.The repo structure is organized as follows:1_hyperparameter_search - Contains the Optuna output and the datasets generated from the top 5 trials of each strategy.2_performanceComparison - Contains the datasets generated for the top trial of each strategy across more seeds.3_ablation_studies - Directory with the output of all experiments that evaluated the impact of some hyperparameters on the evolutionary search.4_generalization - Datasets generated from diverse input sequences.data/cache - Fasta of the human genome along with transcript cache used to extract exon triplets.figures.ipynb - Notebook to generate the manuscript figures.

Authors

  • Pedro Barbosa
0 Citations0 Mentions13% FAIR0.1 Dataset Index
10.5281/zenodo.10607868April 2024

Local synthetic datasets generation - manuscript data (Version: 2.0)

Datasets generated under the work "Semantically Rich Local Dataset Generation for Explainable AI in Genomics".Relevant files and directories:      - datasets.tar.gz:  Full copy of the GitHub repository, except it contains the datasets generated in the manuscript.The repo structure is organized as follows:1_hyperparameter_search - Contains the Optuna output and the datasets generated from the top 5 trials of each strategy.2_performanceComparison - Contains the datasets generated for the top trial of each strategy across more seeds.3_ablation_studies - Directory with the output of all experiments that evaluated the impact of some hyperparameters on the evolutionary search.4_generalization - Datasets generated from diverse input sequences.data/cache - Fasta of the human genome along with transcript cache used to extract exon triplets.figures.ipynb - Notebook to generate the manuscript figures.

Authors

  • Pedro Barbosa
0 Citations0 Mentions69% FAIR0.7 Dataset Index
10.5281/zenodo.10955718April 2024

Local synthetic datasets generation - manuscript data (Version: 1.0)

Datasets generated under the work "Semantically Rich Local Dataset Generation for Explainable AI in Genomics".Relevant files and directories:      - datasets.tar.gz:  Full copy of the GitHub repository, except it contains the datasets generated in the manuscript.The repo structure is organized as follows:1_hyperparameter_search - Contains the Optuna output and the datasets generated from the top 5 trials of each strategy.2_performanceComparison - Contains the datasets generated for the top trial of each strategy across more seeds.3_ablation_studies - Directory with the output of all experiments that evaluated the impact of some hyperparameters on the evolutionary search.4_generalization - Datasets generated from diverse input sequences.data/cache - Fasta of the human genome along with transcript cache used to extract exon triplets.figures.ipynb - Notebook to generate the manuscript figures.

Authors

  • Pedro Barbosa
0 Citations0 Mentions44% FAIR0.5 Dataset Index
10.5281/zenodo.10607869February 2024

LMAS Test Dataset - NIBSC Gut DNA Reference

The twenty bacterial replicons of the National Institute for Biological Standards and Control (NIBSC) Gut Community Standards were used as reference. It includes the following strains: Species Culture collection number Accession numbers Status Gut-Mix-RR Coverage (x) Gut-HiLo-RR Coverage (x) Akkermansia muciniphila DSM 22959 NC_010655 Complete genome 33,33 0,94 Alistipes finegoldii DSM 17242 NC_018011 Complete genome 16,94 4,85 Anaerostipes hadrus DSM 3319 NZ_KB290627 Complete genome 24,04 6,89 Bacteroides thetaiotaomicron DSM 2079 NC_004663 Complete genome 5,99 17,19 Bacteroides uniformis DSM 6597 GCF_000154205 Scaffold 10,81 3,1 Bifidobacterium longum subsp. infantis DSM 20088 NC_011593 Complete genome 29,52 84,63 Bifidobacterium longum subsp. longum DSM 20219 GCF_900104835 Contig 40,19 115,1 Blautia wexlerae DSM 19850 GCF_000484655 Scaffold 11,68 0,34 Clostridium butyricum DSM 10702 GCF_000409755 Contig 11,21 32,09 Collinsella aerofaciens DSM 13712 GCF_902501475 Contig 44,03 12,61 Escherichia coli DSM 1103 CP009072 Complete genome 8,86 25,34 Eubacterium hallii DSM 3353 GCF_000173975 Contig 21,79 6,25 Faecalibacterium prausnitzii DSM 17677 NZ_CP048437 Complete genome 24,66 0,72 Lactobacillus gasseri DSM 20077 NC_008530 Complete genome 66 1,91 Parabacteroides distasonis DSM 20701 NC_009615 Complete genome 10,2 29,26 Prevotella copri DSM 18205 GCF_000157935 Scaffold 19,23 55,11 Prevotella melaninogenica DSM 7089 NC_014370,1 and NC_014371,1 2 Chromosomes 23,49 6,73 Roseburia hominis DSM 16839 NC_015977 Complete genome 18,31 5,24 Roseburia intestinalis DSM 14610 NZ_LR027880 Complete genome 12,04 0,34 Ruminococcus gauvreauii DSM 19829 GCF_000425525 Scaffold 14,04 0,41 The raw sequence data of the mock communities, with an even and staggered distribution of species, is available at: SRR11487941 - Gut-Mix-RR Illumina MiSeq sample SRR11487935 - Gut-Mix-HiLo Illumina MiSeq sample

Authors

  • Mendes, CI
0 Citations0 Mentions13% FAIR0.3 Dataset Index
10.5281/zenodo.7092694September 2022

LMAS Test Dataset - NIBSC Gut DNA Reference

The twenty bacterial replicons of the National Institute for Biological Standards and Control (NIBSC) Gut Community Standards were used as reference. It includes the following strains: Species Culture collection number Accession numbers Status Gut-Mix-RR Coverage (x) Gut-HiLo-RR Coverage (x) Akkermansia muciniphila DSM 22959 NC_010655 Complete genome 33.33 0.94 Alistipes finegoldii DSM 17242 NC_018011 Complete genome 16.94 4.85 Anaerostipes hadrus DSM 3319 NZ_KB290627 Complete genome 24.04 6.89 Bacteroides thetaiotaomicron DSM 2079 NC_004663 Complete genome 5.99 17.19 Bacteroides uniformis DSM 6597 GCF_000154205 Scaffold 10.81 3.10 Bifidobacterium longum subsp. infantis DSM 20088 NC_011593 Complete genome 29.52 84.63 Bifidobacterium longum subsp. longum DSM 20219 GCF_900104835 Contig 40.19 115.10 Blautia wexlerae DSM 19850 GCF_000484655 Scaffold 11.68 0.34 Clostridium butyricum DSM 10702 GCF_000409755 Contig 11.21 32.09 Collinsella aerofaciens DSM 13712 GCF_902501475 Contig 44.03 12.61 Escherichia coli DSM 1103 CP009072 Complete genome 8.86 25.34 Eubacterium hallii DSM 3353 GCF_000173975 Contig 21.79 6.25 Faecalibacterium prausnitzii DSM 17677 NZ_CP048437 Complete genome 24.66 0.72 Lactobacillus gasseri DSM 20077 NC_008530 Complete genome 66.00 1.91 Parabacteroides distasonis DSM 20701 NC_009615 Complete genome 10.20 29.26 Prevotella copri DSM 18205 GCF_000157935 Scaffold 19.23 55.11 Prevotella melaninogenica DSM 7089 NC_014370,1 and NC_014371,1 2 Chromosomes 23.49 6.73 Roseburia hominis DSM 16839 NC_015977 Complete genome 18.31 5.24 Roseburia intestinalis DSM 14610 NZ_LR027880 Complete genome 12.04 0.34 Ruminococcus gauvreauii DSM 19829 GCF_000425525 Scaffold 14.04 0.41 The raw sequence data of the mock communities, with an even and staggered distribution of species, is available at: SRR11487941 - Gut-Mix-RR Illumina MiSeq sample SRR11487935 - Gut-Mix-HiLo Illumina MiSeq sample

Authors

  • Mendes, CI
2 Citations0 Mentions73% FAIR2.5 Dataset Index
10.5281/zenodo.7092693September 2022

LMAS Test Dataset - NIBSC Gut DNA Reference

The twenty bacterial replicons of the National Institute for Biological Standards and Control (NIBSC) Gut Community Standards were used as reference. It includes the following strains: Species Culture collection number Accession numbers Status Gut-Mix-RR Coverage (x) Gut-HiLo-RR Coverage (x) Akkermansia muciniphila DSM 22959 NC_010655 Complete genome 33,33 0,94 Alistipes finegoldii DSM 17242 NC_018011 Complete genome 16,94 4,85 Anaerostipes hadrus DSM 3319 NZ_KB290627 Complete genome 24,04 6,89 Bacteroides thetaiotaomicron DSM 2079 NC_004663 Complete genome 5,99 17,19 Bacteroides uniformis DSM 6597 GCF_000154205 Scaffold 10,81 3,1 Bifidobacterium longum subsp. infantis DSM 20088 NC_011593 Complete genome 29,52 84,63 Bifidobacterium longum subsp. longum DSM 20219 GCF_900104835 Contig 40,19 115,1 Blautia wexlerae DSM 19850 GCF_000484655 Scaffold 11,68 0,34 Clostridium butyricum DSM 10702 GCF_000409755 Contig 11,21 32,09 Collinsella aerofaciens DSM 13712 GCF_902501475 Contig 44,03 12,61 Escherichia coli DSM 1103 CP009072 Complete genome 8,86 25,34 Eubacterium hallii DSM 3353 GCF_000173975 Contig 21,79 6,25 Faecalibacterium prausnitzii DSM 17677 NZ_CP048437 Complete genome 24,66 0,72 Lactobacillus gasseri DSM 20077 NC_008530 Complete genome 66 1,91 Parabacteroides distasonis DSM 20701 NC_009615 Complete genome 10,2 29,26 Prevotella copri DSM 18205 GCF_000157935 Scaffold 19,23 55,11 Prevotella melaninogenica DSM 7089 NC_014370,1 and NC_014371,1 2 Chromosomes 23,49 6,73 Roseburia hominis DSM 16839 NC_015977 Complete genome 18,31 5,24 Roseburia intestinalis DSM 14610 NZ_LR027880 Complete genome 12,04 0,34 Ruminococcus gauvreauii DSM 19829 GCF_000425525 Scaffold 14,04 0,41 The raw sequence data of the mock communities, with an even and staggered distribution of species, is available at: SRR11487941 - Gut-Mix-RR Illumina MiSeq sample SRR11487935 - Gut-Mix-HiLo Illumina MiSeq sample

Authors

  • Mendes, CI
0 Citations0 Mentions77% FAIR1.9 Dataset Index
10.5281/zenodo.7108658September 2022

LMAS Test Dataset - NIBSC Gut DNA Reference

The twenty bacterial replicons of the National Institute for Biological Standards and Control (NIBSC) Gut Community Standards were used as reference. It includes the following strains: Species Culture collection number Accession numbers Status Gut-Mix-RR Coverage (x) Gut-HiLo-RR Coverage (x) Akkermansia muciniphila DSM 22959 NC_010655 Complete genome 33.33 0.94 Alistipes finegoldii DSM 17242 NC_018011 Complete genome 16.94 4.85 Anaerostipes hadrus DSM 3319 NZ_KB290627 Complete genome 24.04 6.89 Bacteroides thetaiotaomicron DSM 2079 NC_004663 Complete genome 5.99 17.19 Bacteroides uniformis DSM 6597 GCF_000154205 Scaffold 10.81 3.10 Bifidobacterium longum subsp. infantis DSM 20088 NC_011593 Complete genome 29.52 84.63 Bifidobacterium longum subsp. longum DSM 20219 GCF_900104835 Contig 40.19 115.10 Blautia wexlerae DSM 19850 GCF_000484655 Scaffold 11.68 0.34 Clostridium butyricum DSM 10702 GCF_000409755 Contig 11.21 32.09 Collinsella aerofaciens DSM 13712 GCF_902501475 Contig 44.03 12.61 Escherichia coli DSM 1103 CP009072 Complete genome 8.86 25.34 Eubacterium hallii DSM 3353 GCF_000173975 Contig 21.79 6.25 Faecalibacterium prausnitzii DSM 17677 NZ_CP048437 Complete genome 24.66 0.72 Lactobacillus gasseri DSM 20077 NC_008530 Complete genome 66.00 1.91 Parabacteroides distasonis DSM 20701 NC_009615 Complete genome 10.20 29.26 Prevotella copri DSM 18205 GCF_000157935 Scaffold 19.23 55.11 Prevotella melaninogenica DSM 7089 NC_014370,1 and NC_014371,1 2 Chromosomes 23.49 6.73 Roseburia hominis DSM 16839 NC_015977 Complete genome 18.31 5.24 Roseburia intestinalis DSM 14610 NZ_LR027880 Complete genome 12.04 0.34 Ruminococcus gauvreauii DSM 19829 GCF_000425525 Scaffold 14.04 0.41 The raw sequence data of the mock communities, with an even and staggered distribution of species, is available at: SRR11487941 - Gut-Mix-RR Illumina MiSeq sample SRR11487935 - Gut-Mix-HiLo Illumina MiSeq sample

Authors

  • Mendes, CI
0 Citations0 Mentions73% FAIR1.8 Dataset Index
10.5281/zenodo.7108669September 2022