Automated Author ProfileDepMap, Broad
DepMap, Broad
Current S-Index
Sum of Dataset Indices for all datasets
Average Dataset Index per Dataset
Average Dataset Index per dataset
Total Datasets
Total datasets for this author
Average FAIR Score
Average FAIR Score per dataset
Total Citations
Total citations to the author's datasets
Total Mentions
Total mentions of the author's datasets
S-Index Interpretation
The S-Index (Sharing Index) is a comprehensive metric that represents the cumulative impact of all your datasets. It is calculated as the sum of Dataset Index scores across all your claimed datasets.
What it means:
- A higher S-index indicates greater overall impact of your datasets relative to typical datasets in their fields of research
- The S-Index grows as you add more datasets or as existing datasets gain more citations and mentions
- It provides a single number to track your research data impact over time
Current S-Index: 103.5 (sum of 45 datasets Dataset Index scores)
More information here.
S-Index Over Time
Cumulative Citations Over Time
Cumulative Mentions Over Time
Datasets
This dataset supports the "Gene expression has more power for predicting in vitro cancer cell vulnerabilities than genomics" preprint by Dempster et al. To generate the figure panels seen in the preprint using these data, use FigurePanelGeneration.ipynb. This study includes five datasets (citations and details in manuscript).Achilles: the Broad Institute's DepMap public 19Q4 CRISPR knockout screens processed with CERESScore: The Sanger Wellcome Institute's Project Score CRISPR knockout screens processed with CERESRNAi: The DEMETER2-processed combined dataset which includes RNAi data from Achilles, DRIVE, and Marcotte breast screens.PRISM: The PRISM pooled in vitro repurposing primary screen of compoundsGDSC17: Cancer drug in vitro drug screens performed by Sanger
The files of most interest to a biologist are Summary.csv. If you are interested in trying machine learning, the files Features.hdf5 and Target.hdf5 contain the data munged in a convenient form for standard supervised machine learning algorithms.
Some large files are in the binary format hdf5 for efficiency in space and read-in. These files each contain three named hdf5 datasets. "dim_0" holds the row/index names as an array of strings, "dim_1" holds the column names as an array of strings, and "data" holds the matrix contents as a 2D array of floats. In python, these files can be read in with:
import pandas as pd import h5py
def read_hdf5(filename): src = h5py.File(filename, 'r') try: dim_0 = [x.decode('utf8') for x in src['dim_0']] dim_1 = [x.decode('utf8') for x in src['dim_1']] data = np.array(src['data'])
return pd.DataFrame(index=dim_0, columns=dim_1, data=data) finally: src.close()
##################################################################Files (not every dataset will have every type of file listed below):##################################################################
AllFeaturePredictions.hdf5: Matrix of cell lines by perturbations, with values indicating the predicted viability using a model with all feature types.
ENAdditionScore.csv: A matrix of perturbations by number of features. Values indicate an elastic net model performance (Pearson correlation of concatenated out-of-sample predictions with the values given in Target.hdf5) using only the top X features, where X is the column header.
FeatureDropScore.csv: Perturbations and predictive performance for a model using all single gene expression features EXCEPT those that had greater than 0.1 feature importance in a model trained with all single gene expression features.
Features.hdf5: A very large matrix of all cell lines by all used CCLE cell features. Continuous features were zscored. Cell lines missing mutation or expression data were dropped. Remaining NA values were imputed to zero. Features types are indicated by the column matrix suffixes: _Exp: expression _Hot: hotspot mutation _Dam: damaging mutation _OtherMut: other mutation _CN: copy number _GSEA: ssGSEA score for an MSigDB gene set _MethTSS: Methylation of transcription start sites _MethCpG: Methylation of CpG islands _Fusion: Gene fusions _Cell: cell tissue properties
NormLRT.csv: the normLRT score for the given perturbation
RFAdditionScore.csv: similar to ENAdditionScore, but using a random forest model.
Summary.csv: A dataframe containing predictive model results. Columns: model: Specifies the collection of features used (Expression, Mutation, Exp+CN, etc) gene: The perturbation (column in Target.hdf5) examined. Actually a compound for the PRISM and GDSC17 datasets. overall_pearson: Pearson correlation of concatenated out-of-sample predictions with the values given in Target.hdf5 feature: the Nth most important feature, found by retraining the model with all cell lines (N = 0-9) feature_importance: the feature importance as assessed by sklearn's RandomForestRegressor
Target.hdf5: A matrix of cell lines by perturbations, with entries indicating post-perturbation viability scores. Note that the scales of the viability effects are different for different datasets. See manuscript methods for details.
PerturbationInfo.csv: Additional drug annotations for the PRISM and GDSC17 datasets
ApproximateCFE.hdf5: A set of Cancer Functional Event cell features based on CCLE data, adapted from Iorio et al. 2016 (10.1016/j.cell.2016.06.017)
DepMapSampleInfo.csv: sample info from DepMap_public_19Q4 data, reproduced here as a convenience.
GeneRelationships.csv: A list of genes and their related (partner) genes, with the type of relationship (self, protein-protein interaction, CORUM complex membership, paralog).
OncoKB_oncogenes.csv: A list of genes that have non-expression-based alterations listed as likely oncogenic or oncogenic by OncoKB as of 9 May 2018.
Authors
- DepMap, Broad
This DepMap Release contains new cell models and data from Whole Genome/Exome Sequencing (Copy Number and Mutation), RNA Sequencing (Expression and Fusions), Genome-wide CRISPR knockout screens. Also included are updated metadata and mapping files for information about cell models and data relationships, respectively. Each release may contain improvements to our pipelines that generate this data so you may notice changes from the last release. For more information, please see README.txt.
Authors
- DepMap, Broad
This DepMap Release contains new cell models and data from Whole Genome/Exome Sequencing (Copy Number and Mutation), RNA Sequencing (Expression and Fusions), Genome-wide CRISPR knockout screens. Also included are updated metadata and mapping files for information about cell models and data relationships, respectively. Each release may contain improvements to our pipelines that generate this data so you may notice changes from the last release. For more information, please see README.txt.
Authors
- DepMap, Broad
These files contain a summary of predictability of CRISPRGeneEffect in Depmap 24Q2 with variable numbers of cell lines provided to the predictive model. Subsets of the CRISPR gene effect matrix were supplied to a random forest model and out of sample performance recorded. The prediction method is similar to that described in https://doi.org/10.1101/2020.02.21.959627, with equivalent code available at https://github.com/broadinstitute/cds-ensemble, commit c7dfcee. The cell lines used in each subset are specified in subsampled_cell_lines.txt.
Authors
- DepMap, Broad
These files contain a summary of predictability of CRISPRGeneEffect in Depmap 24Q2 with variable numbers of cell lines provided to the predictive model. Subsets of the CRISPR gene effect matrix were supplied to a random forest model and out of sample performance recorded. The prediction method is similar to that described in https://doi.org/10.1101/2020.02.21.959627, with equivalent code available at https://github.com/broadinstitute/cds-ensemble, commit c7dfcee. The cell lines used in each subset are specified in subsampled_cell_lines.txt.
Authors
- DepMap, Broad
This data release contains two most recent PRISM Repurposing screens: Repurposing-1M and Repurposing-300. All Repurposing-1M [REP1M] and Repurposing-300 [REP300] compounds (1514 = 1280 REP1M + 234 REP300) were screened in the PRISM assay at a dose of 2.5 μM with a 5-day treatment against 906 cancer cell lines (859 of them passed quality checks -QC- for all tested compound with two high quality replicates). Two PRISM cell line collections were used in the assay: PR500A, which includes only adherent cell lines, and PR500B, which has adherent and suspension cell lines. Together, these PRISM cell line collections form the PR1000 cell line collection. All compounds were run in triplicate, and each plate contained positive (Bortezomib, 20μM) and negative (DMSO) controls. The screen can be considered an extension of the PRISM Repurposing Primary Screen, with PR500A mainly covering the existing cell line panel, while PR500B extends the cell line collection with new subtypes and lineages. For the assay details, please refer to Corsello et al., 2020 (doi.org/10.1038/s43018-019-0018-6) and https://www.theprismlab.org.
Authors
- DepMap, Broad ;
- Kocak, Mustafa
This data release contains two most recent PRISM Repurposing screens: Repurposing-1M and Repurposing-300. All Repurposing-1M [REP1M] and Repurposing-300 [REP300] compounds (1514 = 1280 REP1M + 234 REP300) were screened in the PRISM assay at a dose of 2.5 μM with a 5-day treatment against 906 cancer cell lines (859 of them passed quality checks -QC- for all tested compound with two high quality replicates). Two PRISM cell line collections were used in the assay: PR500A, which includes only adherent cell lines, and PR500B, which has adherent and suspension cell lines. Together, these PRISM cell line collections form the PR1000 cell line collection. All compounds were run in triplicate, and each plate contained positive (Bortezomib, 20μM) and negative (DMSO) controls. The screen can be considered an extension of the PRISM Repurposing Primary Screen, with PR500A mainly covering the existing cell line panel, while PR500B extends the cell line collection with new subtypes and lineages. For the assay details, please refer to Corsello et al., 2020 (doi.org/10.1038/s43018-019-0018-6) and https://www.theprismlab.org.
Authors
- DepMap, Broad ;
- Kocak, Mustafa
This DepMap Release contains new cell models and data from Whole Genome/Exome Sequencing (Copy Number and Mutation), RNA Sequencing (Expression and Fusions), Genome-wide CRISPR knockout screens. Also included are updated metadata and mapping files for information about cell models and data relationships, respectively. Each release may contain improvements to our pipelines that generate this data so you may notice changes from the last release.For more information, please see README.txt.
Authors
- DepMap, Broad
This DepMap Release contains new cell models and data from Whole Genome/Exome Sequencing (Copy Number and Mutation), RNA Sequencing (Expression and Fusions), Genome-wide CRISPR knockout screens. Also included are updated metadata and mapping files for information about cell models and data relationships, respectively. Each release may contain improvements to our pipelines that generate this data so you may notice changes from the last release.For more information, please see README.txt.
Authors
- DepMap, Broad
This dataset supports the "Gene expression has more power for predicting in vitro cancer cell vulnerabilities than genomics" preprint by Dempster et al. To generate the figure panels seen in the preprint using these data, use FigurePanelGeneration.ipynb. This study includes five datasets (citations and details in manuscript).Achilles: the Broad Institute's DepMap public 19Q4 CRISPR knockout screens processed with CERESScore: The Sanger Wellcome Institute's Project Score CRISPR knockout screens processed with CERESRNAi: The DEMETER2-processed combined dataset which includes RNAi data from Achilles, DRIVE, and Marcotte breast screens.PRISM: The PRISM pooled in vitro repurposing primary screen of compoundsGDSC17: Cancer drug in vitro drug screens performed by Sanger
The files of most interest to a biologist are Summary.csv. If you are interested in trying machine learning, the files Features.hdf5 and Target.hdf5 contain the data munged in a convenient form for standard supervised machine learning algorithms.
Some large files are in the binary format hdf5 for efficiency in space and read-in. These files each contain three named hdf5 datasets. "dim_0" holds the row/index names as an array of strings, "dim_1" holds the column names as an array of strings, and "data" holds the matrix contents as a 2D array of floats. In python, these files can be read in with:
import pandas as pd import h5py
def read_hdf5(filename): src = h5py.File(filename, 'r') try: dim_0 = [x.decode('utf8') for x in src['dim_0']] dim_1 = [x.decode('utf8') for x in src['dim_1']] data = np.array(src['data'])
return pd.DataFrame(index=dim_0, columns=dim_1, data=data) finally: src.close()
##################################################################Files (not every dataset will have every type of file listed below):##################################################################
AllFeaturePredictions.hdf5: Matrix of cell lines by perturbations, with values indicating the predicted viability using a model with all feature types.
ENAdditionScore.csv: A matrix of perturbations by number of features. Values indicate an elastic net model performance (Pearson correlation of concatenated out-of-sample predictions with the values given in Target.hdf5) using only the top X features, where X is the column header.
FeatureDropScore.csv: Perturbations and predictive performance for a model using all single gene expression features EXCEPT those that had greater than 0.1 feature importance in a model trained with all single gene expression features.
Features.hdf5: A very large matrix of all cell lines by all used CCLE cell features. Continuous features were zscored. Cell lines missing mutation or expression data were dropped. Remaining NA values were imputed to zero. Features types are indicated by the column matrix suffixes: _Exp: expression _Hot: hotspot mutation _Dam: damaging mutation _OtherMut: other mutation _CN: copy number _GSEA: ssGSEA score for an MSigDB gene set _MethTSS: Methylation of transcription start sites _MethCpG: Methylation of CpG islands _Fusion: Gene fusions _Cell: cell tissue properties
NormLRT.csv: the normLRT score for the given perturbation
RFAdditionScore.csv: similar to ENAdditionScore, but using a random forest model.
Summary.csv: A dataframe containing predictive model results. Columns: model: Specifies the collection of features used (Expression, Mutation, Exp+CN, etc) gene: The perturbation (column in Target.hdf5) examined. Actually a compound for the PRISM and GDSC17 datasets. overall_pearson: Pearson correlation of concatenated out-of-sample predictions with the values given in Target.hdf5 feature: the Nth most important feature, found by retraining the model with all cell lines (N = 0-9) feature_importance: the feature importance as assessed by sklearn's RandomForestRegressor
Target.hdf5: A matrix of cell lines by perturbations, with entries indicating post-perturbation viability scores. Note that the scales of the viability effects are different for different datasets. See manuscript methods for details.
PerturbationInfo.csv: Additional drug annotations for the PRISM and GDSC17 datasets
ApproximateCFE.hdf5: A set of Cancer Functional Event cell features based on CCLE data, adapted from Iorio et al. 2016 (10.1016/j.cell.2016.06.017)
DepMapSampleInfo.csv: sample info from DepMap_public_19Q4 data, reproduced here as a convenience.
GeneRelationships.csv: A list of genes and their related (partner) genes, with the type of relationship (self, protein-protein interaction, CORUM complex membership, paralog).
OncoKB_oncogenes.csv: A list of genes that have non-expression-based alterations listed as likely oncogenic or oncogenic by OncoKB as of 9 May 2018.
Authors
- DepMap, Broad