Scholar Data

Expression vs genomics for predicting dependencies

This dataset supports the "Gene expression has more power for predicting in vitro cancer cell vulnerabilities than genomics" preprint by Dempster et al. To generate the figure panels seen in the preprint using these data, use FigurePanelGeneration.ipynb. This study includes five datasets (citations and details in manuscript).Achilles: the Broad Institute's DepMap public 19Q4 CRISPR knockout screens processed with CERESScore: The Sanger Wellcome Institute's Project Score CRISPR knockout screens processed with CERESRNAi: The DEMETER2-processed combined dataset which includes RNAi data from Achilles, DRIVE, and Marcotte breast screens.PRISM: The PRISM pooled in vitro repurposing primary screen of compoundsGDSC17: Cancer drug in vitro drug screens performed by Sanger
The files of most interest to a biologist are Summary.csv. If you are interested in trying machine learning, the files Features.hdf5 and Target.hdf5 contain the data munged in a convenient form for standard supervised machine learning algorithms.
Some large files are in the binary format hdf5 for efficiency in space and read-in. These files each contain three named hdf5 datasets. "dim_0" holds the row/index names as an array of strings, "dim_1" holds the column names as an array of strings, and "data" holds the matrix contents as a 2D array of floats. In python, these files can be read in with:
import pandas as pd import h5py
def read_hdf5(filename): src = h5py.File(filename, 'r') try: dim_0 = [x.decode('utf8') for x in src['dim_0']] dim_1 = [x.decode('utf8') for x in src['dim_1']] data = np.array(src['data'])
return pd.DataFrame(index=dim_0, columns=dim_1, data=data) finally: src.close()
##################################################################Files (not every dataset will have every type of file listed below):##################################################################
AllFeaturePredictions.hdf5: Matrix of cell lines by perturbations, with values indicating the predicted viability using a model with all feature types.

ENAdditionScore.csv: A matrix of perturbations by number of features. Values indicate an elastic net model performance (Pearson correlation of concatenated out-of-sample predictions with the values given in Target.hdf5) using only the top X features, where X is the column header.
FeatureDropScore.csv: Perturbations and predictive performance for a model using all single gene expression features EXCEPT those that had greater than 0.1 feature importance in a model trained with all single gene expression features.
Features.hdf5: A very large matrix of all cell lines by all used CCLE cell features. Continuous features were zscored. Cell lines missing mutation or expression data were dropped. Remaining NA values were imputed to zero. Features types are indicated by the column matrix suffixes: _Exp: expression _Hot: hotspot mutation _Dam: damaging mutation _OtherMut: other mutation _CN: copy number _GSEA: ssGSEA score for an MSigDB gene set _MethTSS: Methylation of transcription start sites _MethCpG: Methylation of CpG islands _Fusion: Gene fusions _Cell: cell tissue properties
NormLRT.csv: the normLRT score for the given perturbation
RFAdditionScore.csv: similar to ENAdditionScore, but using a random forest model.
Summary.csv: A dataframe containing predictive model results. Columns: model: Specifies the collection of features used (Expression, Mutation, Exp+CN, etc) gene: The perturbation (column in Target.hdf5) examined. Actually a compound for the PRISM and GDSC17 datasets. overall_pearson: Pearson correlation of concatenated out-of-sample predictions with the values given in Target.hdf5 feature: the Nth most important feature, found by retraining the model with all cell lines (N = 0-9) feature_importance: the feature importance as assessed by sklearn's RandomForestRegressor
Target.hdf5: A matrix of cell lines by perturbations, with entries indicating post-perturbation viability scores. Note that the scales of the viability effects are different for different datasets. See manuscript methods for details.
PerturbationInfo.csv: Additional drug annotations for the PRISM and GDSC17 datasets
ApproximateCFE.hdf5: A set of Cancer Functional Event cell features based on CCLE data, adapted from Iorio et al. 2016 (10.1016/j.cell.2016.06.017)
DepMapSampleInfo.csv: sample info from DepMap_public_19Q4 data, reproduced here as a convenience.
GeneRelationships.csv: A list of genes and their related (partner) genes, with the type of relationship (self, protein-protein interaction, CORUM complex membership, paralog).
OncoKB_oncogenes.csv: A list of genes that have non-expression-based alterations listed as likely oncogenic or oncogenic by OncoKB as of 9 May 2018.

Authors

DepMap, Broad

0 Citations0 Mentions85% FAIR0.1 Dataset Index

10.6084/m9.figshare.258434502024

DepMap 24Q4 Public

This DepMap Release contains new cell models and data from Whole Genome/Exome Sequencing (Copy Number and Mutation), RNA Sequencing (Expression and Fusions), Genome-wide CRISPR knockout screens. Also included are updated metadata and mapping files for information about cell models and data relationships, respectively. Each release may contain improvements to our pipelines that generate this data so you may notice changes from the last release. For more information, please see README.txt.

Authors

DepMap, Broad

29 Citations0 Mentions85% FAIR10.4 Dataset Index

10.25452/figshare.plus.27993248.v12024

DepMap 24Q4 Public

Authors

DepMap, Broad

1 Citation0 Mentions85% FAIR0.1 Dataset Index

10.25452/figshare.plus.279932482024

DepMap Predictability with Subsampling

These files contain a summary of predictability of CRISPRGeneEffect in Depmap 24Q2 with variable numbers of cell lines provided to the predictive model. Subsets of the CRISPR gene effect matrix were supplied to a random forest model and out of sample performance recorded. The prediction method is similar to that described in https://doi.org/10.1101/2020.02.21.959627, with equivalent code available at https://github.com/broadinstitute/cds-ensemble, commit c7dfcee. The cell lines used in each subset are specified in subsampled_cell_lines.txt.

Authors

DepMap, Broad

0 Citations0 Mentions15% FAIR0.2 Dataset Index

10.6084/m9.figshare.26955886.v12024

DepMap Predictability with Subsampling

Authors

DepMap, Broad

0 Citations0 Mentions85% FAIR0.1 Dataset Index

10.6084/m9.figshare.269558862024

Repurposing Public 24Q2

This data release contains two most recent PRISM Repurposing screens: Repurposing-1M and Repurposing-300. All Repurposing-1M [REP1M] and Repurposing-300 [REP300] compounds (1514 = 1280 REP1M + 234 REP300) were screened in the PRISM assay at a dose of 2.5 μM with a 5-day treatment against 906 cancer cell lines (859 of them passed quality checks -QC- for all tested compound with two high quality replicates). Two PRISM cell line collections were used in the assay: PR500A, which includes only adherent cell lines, and PR500B, which has adherent and suspension cell lines. Together, these PRISM cell line collections form the PR1000 cell line collection. All compounds were run in triplicate, and each plate contained positive (Bortezomib, 20μM) and negative (DMSO) controls. The screen can be considered an extension of the PRISM Repurposing Primary Screen, with PR500A mainly covering the existing cell line panel, while PR500B extends the cell line collection with new subtypes and lineages. For the assay details, please refer to Corsello et al., 2020 (doi.org/10.1038/s43018-019-0018-6) and https://www.theprismlab.org.

Authors

DepMap, Broad ;
Kocak, Mustafa

1 Citation0 Mentions85% FAIR0.6 Dataset Index

10.6084/m9.figshare.25917643.v12024

Repurposing Public 24Q2

Authors

DepMap, Broad ;
Kocak, Mustafa

0 Citations0 Mentions85% FAIR0.1 Dataset Index

10.6084/m9.figshare.259176432024

DepMap 24Q2 Public

Authors

DepMap, Broad

31 Citations1 Mention85% FAIR12.7 Dataset Index

10.25452/figshare.plus.25880521.v12024

DepMap 24Q2 Public

Authors

DepMap, Broad

0 Citations0 Mentions85% FAIR0.9 Dataset Index

10.25452/figshare.plus.258805212024

Expression vs genomics for predicting dependencies

Authors

DepMap, Broad

0 Citations0 Mentions85% FAIR0.1 Dataset Index

10.6084/m9.figshare.25843450.v12024

Automated Author Profile
DepMap, Broad

DepMap, Broad

Current S-Index

Average Dataset Index per Dataset

Total Datasets

Average FAIR Score

Total Citations

Total Mentions

S-Index Interpretation

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

Expression vs genomics for predicting dependencies

DepMap 24Q4 Public

DepMap 24Q4 Public

DepMap Predictability with Subsampling

DepMap Predictability with Subsampling

Repurposing Public 24Q2

Repurposing Public 24Q2

DepMap 24Q2 Public

DepMap 24Q2 Public

Expression vs genomics for predicting dependencies

Automated Author ProfileDepMap, Broad

DepMap, Broad

Current S-Index

Average Dataset Index per Dataset

Total Datasets

Average FAIR Score

Total Citations

Total Mentions

S-Index Interpretation

S-Index Over Time

Cumulative Citations Over Time

Cumulative Mentions Over Time

Datasets

Expression vs genomics for predicting dependencies

DepMap 24Q4 Public

DepMap 24Q4 Public

DepMap Predictability with Subsampling

DepMap Predictability with Subsampling

Repurposing Public 24Q2

Repurposing Public 24Q2

DepMap 24Q2 Public

DepMap 24Q2 Public

Expression vs genomics for predicting dependencies

Automated Author Profile
DepMap, Broad