diff --git a/README.md b/README.md index dd32568..21466bf 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,5 @@ +[![scorecard-score](https://github.com/recursionpharma/octo-guard-badges/blob/trunk/badges/repo/rxrx-datasets/maturity_score.svg?raw=true)](https://infosec-docs.prod.rxrx.io/octoguard/scorecards/rxrx-datasets) +[![scorecard-status](https://github.com/recursionpharma/octo-guard-badges/blob/trunk/badges/repo/rxrx-datasets/scorecard_status.svg?raw=true)](https://infosec-docs.prod.rxrx.io/octoguard/scorecards/rxrx-datasets) # RxRx Datasets This serves as a master repository for information about datasets released @@ -7,3 +9,4 @@ for public research by [Recursion Pharmaceuticals](recursionpharma.com). - [RxRx2 -- Morphological Imaging Dataset of immune perturbations](/rxrx2) - [RxRx19a -- Morphological Imaging Dataset of SARS-CoV-2 viral infection](/rxrx19a) - [RxRx19b -- Morphological Imaging of the COVID-19-associated cytokine storm](/rxrx19b) +- [RxRx3 -- Phenomics Map of Biology](/rxrx3) diff --git a/rxrx1/README.md b/rxrx1/README.md index 859187c..209d26b 100644 --- a/rxrx1/README.md +++ b/rxrx1/README.md @@ -1,23 +1,65 @@ -# RxRx1 Data Dictionary - -For more information about the dataset please visit [RxRx.ai](https://www.rxrx.ai/). - - * `recursion_dataset_license.pdf`: the license under which the dataset is released. - * `images/*`: the image data in 8-bit png format. The image paths, such as `U2OS-01/Plate1/B02_s2_w3.png`, can be read as: - * Experiment Name: Cell type and batch number (U2OS batch 1) - * Plate number (1) - * Well location on plate (column B, row 2) - * Site (2) - - * `rxrx1.csv`: The metadata for the dataset with the following columns: - * `site_id`: unique identifier of a given site - * `well_id`: unique identifier of a given well - * `cell_type`: the cell type - * `dataset`: the split that this site belongs; train or test - * `experiment`: the experiment name, same as explained above - * `plate`: plate number within the experiment - * `well`: location on the plate - * `site`: indication of the location in the well where image was taken (1 or 2) - * `well_type`: indicates if the well is a `treatment`, `negative_control`, or `positive_control` - * `sirna`: the siRNA (ThermoFisher ID) that was introduced into the well - * `sirna_id`: the siRNAs mapped to integers for ease of classification tasks \ No newline at end of file +# RxRx1 + +RxRx1 is the first dataset released by [Recursion][recursion] in the [RxRx.ai][rxrx] series +and was the topic of the NeurIPS 2019 CellSignal competition. It contains 125,510 images +of 6-channel fluorescent cellular microscopy, taken from four kinds of cells perturbed by +1,138 siRNAs. The goal of the competition was to train models that could identify which +siRNA was used in a given image taken from an experimental batch not seen in the training data. +For more information about RxRx1 please visit [RxRx.ai][rxrx1]. + +RxRx1 is part of a larger set of Recursion datasets that can be found at [RxRx.ai][rxrx] and on +[GitHub][github]. For questions about this dataset and others please email +[info@rxrx.ai](mailto:info@rxrx.ai). + + +## Metadata + +The metadata can be found in `metadata.csv` and downloaded [from here][download]. The schema of the metadata +is as follows: + +| Attribute | Description | +|----------------|---------------------------------------------------------------------------------| +| site_id | Unique identifier of a given site | +| well_id | Unique identifier of a given well | +| cell_type | Cell type tested | +| dataset | The split that this site belongs to; `train` or `test` | +| experiment | The experiment name, same as explained above | +| plate | Plate number within the experiment | +| well | Location on the plate | +| site | Indication of the location in the well where image was taken (1 or 2) | +| well_type | Indicates if the well is a treatment, `negative_control`, or `positive_control` | +| sirna | The siRNA (ThermoFisher ID) that was introduced into the well | +| sirna_id | The siRNAs mapped to integers for ease of classification tasks | + +## Images + +The images are found in `images/*` and can be downloaded [from here][download] (*n.b* this is 47GB). +The images are 512x512 8-bit `png` files. The image paths, such as `HUVEC-1/Plate1/M23_s2_w3.png`, +can be read as: + +- Experiment Name: Cell type and experiment number (HUVEC experiment 1) +- Plate Number: 1 +- Well location on plate: column M, row 23 +- Site: 2 +- Channel: 3 + +All six channels (`w1` - `w6`) make up a single image of a given site. + +Physical resolution: 0.65 micron/pixel. + +## Changelog +- June 2019: original release for CellSignal; train images only +- December 2019: updated to include test images after completion of CellSignal competition +- August 2020: file organization updated and license changed to CC-BY-NC-SA + +## License +Creative Commons License + +This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. + + +[download]: https://rxrx.ai/rxrx1#Download +[github]: https://github.com/recursionpharma/rxrx-datasets +[recursion]: http://recursionpharma.com +[rxrx]: http://rxrx.ai +[rxrx1]: https://rxrx.ai/rxrx1 diff --git a/rxrx19a/README.md b/rxrx19a/README.md index 1eddeec..6347a5b 100644 --- a/rxrx19a/README.md +++ b/rxrx19a/README.md @@ -9,7 +9,7 @@ and [Functional immune mapping with deep-learning enabled phenomics applied to immunomodulatory and COVID-19 drug discovery][paper2]. -RxRx19a is part of a larger set of Recursion datasets that can be found at [RxRx.ai][rxrx] and on [GitHub][github]. +RxRx19a is part of a larger set of [Recursion][recursion] datasets that can be found at [RxRx.ai][rxrx] and on [GitHub][github]. For questions about this dataset and others please email [info@rxrx.ai](mailto:info@rxrx.ai). ## Metadata @@ -41,8 +41,18 @@ Well location on plate (column AA, row 2) Site (2) Channel (3) -All five channels (`w1` - `w5`) make up an single image of a given `site`. +All five channels (`w1` - `w5`) make up an single image of a given `site`. Each channel images a single +cellular stain: +| channel | stain | +|----------|-----------------------------------------| +| `w1` | Hoechst 33342 (nucleus) | +| `w2` | Concanavalin A (membrane glycoproteins) | +| `w3` | Phalloidin (Actin) | +| `w4` | Syto14 (RNA) | +| `w5` | Wheat germ agglutinin (Golgi) | + +Physical resolution: 0.65 micron/pixel. ## Deep Learning Embeddings diff --git a/rxrx19b/README.md b/rxrx19b/README.md index 1670237..844764f 100644 --- a/rxrx19b/README.md +++ b/rxrx19b/README.md @@ -5,7 +5,7 @@ from a high-dimensional human cellular assay for COVID-19 associated disease. Rx cytokine storm. For more information about RxRx19b please visit [RxRx.ai][rxrx19b] and the associated preprint, [Functional immune mapping with deep-learning enabled phenomics applied to immunomodulatory and COVID-19 drug discovery][paper2]. -RxRx19b is part of a larger set of Recursion datasets that can be found at [RxRx.ai][rxrx] and on [GitHub][github]. +RxRx19b is part of a larger set of [Recursion][recursion] datasets that can be found at [RxRx.ai][rxrx] and on [GitHub][github]. For questions about this dataset and others please email [info@rxrx.ai](mailto:info@rxrx.ai). ## Metadata @@ -21,7 +21,7 @@ The metadata can be found in `metadata.csv` and downloaded [from here][download] | plate | Plate number within the experiment | | well | Location on the plate | | site | Indication of the location in the well where image was taken (always 1 in RxRx19b) | -| disease_condition | The disease condition tested in the well (healthy cytokine cocktail, severe cytokine storm cocktail, or no cytokines) | +| disease_condition | The disease condition tested in the well (`healthy`, healthy cytokine cocktail; `storm-severe`, severe cytokine storm cocktail; or blank, no cytokines) | | treatment | Compound tested in the well (if any) | | treatment_conc | Compound concentration tested (in uM) | | SMILES | Formula of tested compound (as CXSMILES/ChemAxon Extended SMILES) | @@ -40,6 +40,7 @@ Channel (3) All six channels (`w1` - `w6`) make up an single image of a given `site`. +Physical resolution: 0.65 micron/pixel. ## Deep Learning Embeddings diff --git a/rxrx2/README.md b/rxrx2/README.md index 8dd64a8..111775a 100644 --- a/rxrx2/README.md +++ b/rxrx2/README.md @@ -2,12 +2,11 @@ For more information about RxRx2 please visit [RxRx.ai][rxrx2] and read the asscociated paper, [Functional immune mapping with deep-learning enabled phenomics applied to immunomodulatory and COVID-19 drug discovery][paper]. -RxRx2 was produced by [Recursion][recursion] and is part of a larger set of datasets than can be found at [RxRx.ai][rxrx]. - +RxRx2 is part of a larger set of [Recursion][recursion] datasets that can be found at [RxRx.ai][rxrx] and on [GitHub][github]. For questions about this dataset and others please email [info@rxrx.ai](mailto:info@rxrx.ai). ## Metadata -The metadata can be found in `rxrx2-metadata.csv` and downloaded [from here][download]. The schema of the metadata is as follows: +The metadata can be found in `metadata.csv` and downloaded [from here][download]. The schema of the metadata is as follows: | Attribute | Description | |-------------------|-----------------------------------------------------------------------------| @@ -24,7 +23,7 @@ The metadata can be found in `rxrx2-metadata.csv` and downloaded [from here][dow ## Images -The images are found in `images/*` and can be downloaded [from here][download] (*n.b.* this is 195GB). +The images are found in `images/*` and can be downloaded [from here][download] (*n.b.* this is 185GB). The image data are 1024x1024 8-bit `png` files. The image paths, such as `HUVEC-1/Plate1/AA02_s2_w3.png`, can be read as: Experiment Name: Cell type and experiment number (HUVEC experiment 1) @@ -35,24 +34,29 @@ Channel (3) All six channels (`w1` - `w6`) make up an single image of a given `site`. +Physical resolution: 0.65 micron/pixel. ## Deep Learning Embeddings -The deep learning embeddings can be found in `rxrx2-embeddings.csv` and downloaded [from here][download] (*n.b.* this is 78MB). +The deep learning embeddings can be found in `embeddings.csv` and downloaded [from here][download] (*n.b.* this is 76MB). Each row in the csv has a `site_id` as described in the metadata schema. The remaining 1024 columns is the embedding for that respective site. +## Changelog: +- August 2020: initial release + ## License -Creative Commons License
+Creative Commons License -This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. +This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. +[github]: https://github.com/recursionpharma/rxrx-datasets/ [rxrx]: http://rxrx.ai [rxrx2]: https://rxrx.ai/rxrx2 [paper]: https://www.biorxiv.org/content/10.1101/2020.08.02.233064v1 diff --git a/rxrx3/README.md b/rxrx3/README.md new file mode 100644 index 0000000..f9e4d3b --- /dev/null +++ b/rxrx3/README.md @@ -0,0 +1,84 @@ +# RxRx3 + +At Recursion, we build maps of biology and chemistry to explore uncharted areas of disease biology, unravel its complexity, and industrialize drug discovery. Just as a map helps to navigate the physical world, our maps are designed to help us understand as much as we can about the connectedness of human biology so we can navigate the path to new medicines more efficiently. + +Our maps are built using image-based high-dimensional data generated in-house. We conduct up to 2.2 million experiments every week in our highly automated labs, where we use deep learning models to embed high dimensional representations of billions of images of human cells that have been manipulated by CRISPR/Cas9-mediated gene knockouts, compounds, or other reagents. This allows us to create representations that can be compared and contrasted to predict trillions of relationships across biology and chemistry — even without physically testing all of the possible combinations. Recursion's Maps and associated applications help navigate complex biology and chemistry by revealing relationships across genes and chemical compounds. + +RxRx3 is a publicly available map of biology that represents a small subset – less than 1% – of Recursion’s total dataset. MolRec™️ is a simple demo example of such an application that can be built on this type of map. + +Please use the following format to cite this dataset as a whole: + +We used the RxRx3 dataset (Fay et al. (2023). RxRx3: Phenomics Map of Biology. bioRxiv 2023.02.07.527350), available from [RxRx.ai][rxrx]. + +For more information about RxRx3 please visit [RxRx.ai/rxrx3][rxrx3] + +RxRx3 is part of a larger set of [Recursion][recursion] datasets that can be found at [RxRx.ai][rxrx] and on [GitHub][github]. For questions about this dataset and others please email [info@rxrx.ai](mailto:info@rxrx.ai). + +# Compound and Gene identifiers + +The RxRx3 release contains 17,063 genes, as well as 1,674 known chemical entities at 8 doses each. 16,328 of these genes are anonymized in the dataset, enabling people to explore and learn from this massive dataset while protecting Recursion’s business interests. Recursion may de-anonymize genes in this dataset in the future. If you'd like to understand more about how to get access to unblinded genes please email [info@rxrx.ai](mailto:info@rxrx.ai). + +## Metadata + +The metadata can be found in `metadata.csv` and downloaded [from here][download]. The schema of the metadata is as follows: + +| Attribute | Description | +|-------------------|-----------------------------------------------------------------------------------------------------------------------| +| well_id | Experiment Name - Plate - Well (compound-004_1_AA04 or gene-088_9_Z43) | +| experiment_name | Experiment Name: Experiment number (compound-004 or gene-088) +| plate | Plate number in the experiment (1-48) | | +| address | Well location on the plate - "A01" to "AF48". | +| gene | Unblinded or anonymized gene name, or a control | +| treatment | Compound synonym or gene-name - guide-number (Narlaprevir or _guide_1) +| SMILES | Canonical SMILES or blank for non-compounds +| concentration | Compound concentration tested (in uM) | +| perturbation_type | CRISPR or COMPOUND | +| cell_type | HUVEC | | + + +### Metadata Example + +To help understand the metadata, we have included some samples that some some of the more complex parts of the format to allow parser testing and validation + + well_id,experiment_name,plate,address,gene,treatment,SMILES,concentration,perturbation_type,cell_type + gene-079_8_H29,gene-079,8,H29,RPLP2,RPLP2_guide_4,,,CRISPR,HUVEC + gene-045_4_AD27,gene-045,4,AD27,RXRX3-43938,RXRX3-43938_guide_6,,,CRISPR,HUVEC + gene-060_9_P28,gene-060,9,P28,EMPTY_control,EMPTY_control,,,CRISPR,HUVEC + compound-001_19_D20,compound-001,19,D20,,Dequalinium,"CC1=[N+](CCCCCCCCCC[N+]2=C(C)C=C(N)C3=CC=CC=C23)C2=CC=CC=C2C(N)=C1 |c:1,13,21,29,31,35,t:16,19,23,27|",0.25,COMPOUND,HUVEC + compound-001_11_U08,compound-001,11,U08,,EMPTY_control,,,COMPOUND,HUVEC + compound-004_43_B08,compound-004,43,B08,,CRISPR_control,,,COMPOUND,HUVEC + +## Images + +The images are found in `images/*` and can be downloaded [from here][download] (this is ~ 83 TB). +The image data are 2048x2048 16-bit `png` files. These can be downloaded by experiment plate. We provide a tar file for each experiment and plate, with each image from the experiment plate in the tar file. The image file names, such as `AA02_s1_w3.png`, can be read as: + +Well location on plate (column AA, row 2) +Site (1) +Channel (3) + +All six channels (`w1` - `w6`) make up an single image of a given `site`. Note there is one site only for every well address. + +Physical resolution: 0.65 micron/pixel. + +## Deep Learning Embeddings + +The deep learning embeddings are provided as `embeddings.tar` and can be downloaded [from here][download] (this is ~ 1.82 GB). + +Embeddings path is similar to the images path structure. Ex: `gene-001/Plate1/embeddings.parquet` + +Each row in the parquet file has a `well_id` as described in the metadata schema. The remaining 128 columns are the embedding for that respective well + + +## Changelog: +- January 2023: initial release + +## License + +This work is licensed under Recursion Non-Commercial End User License Agreement + +[github]: https://github.com/recursionpharma/rxrx-datasets/ +[rxrx]: https://rxrx.ai +[rxrx3]: https://rxrx.ai/rxrx3 +[recursion]: https://recursion.com +[download]: https://rxrx3.rxrx.ai/downloads