**[Full text of the proposal](https://www.synapse.org/Portal/filehandle?ownerId=syn5659209&ownerType=ENTITY&xsrfToken=1EA1466FCA55F7EAE33833333900F1BC&fileName=Idea1.pdf&preview=false&wikiId=414654)**
The authors wish to thank the reviewers for the insightful comments.
###Anonymous Review 1 and Authors Response
_ **Impact: ** The proposed model of differentially private regression is an area that is rapidly developing in machine learning, and the measured data has the potential to attract machine learning researchers to problems in oncology. The demonstration of a shared predictor without sharing data would have large impact in the way medical studies are done._
_**Feasibility: ** I like the idea of obtaining more information from an already collected set of samples. This reduces the overall study risk, and the proposed measurement of new features (gene expression) and new labels (drug sensitivity) nicely benefits from already existing data. In this sense, the study seems very feasible._
_**Overall evaluation:** On 300 biobanked samples of patients with Acute Myeloid Leukemia (AML), measure:_
_- drug sensitivity to 525 small molecule inhibitors_
_- RNA sequencing to obtain gene expression._
_The proposed measurements complements other measurements on the same sample, e.g. exome sequencing, and clinical data. Public data is also available on AML on other samples._
_The goal is to demonstrate that a differentially private predictor can be used for drug sensitivity._
_A couple of issues:_
_- The proposal did not provide evidence that gene expression is predictive of drug sensitivity for AML in the non-private setting. I am unfamiliar with the literature, and was wondering whether this is the task with the best signal to noise ratio. Since privacy preserving computation may potentially involve a loss in predictive performance, choosing the task carefully seems prudent._
**Response:** This is a very good observation. Some risk taking is necessary since it is not possible to test the data before collecting them. We highly recommend taking the risk, because of the high expected impact upon success. A feasible contingency plan for the challenge is to run the competition with existing public cell-line data, where possibility of success has already been shown (Honkela et al., 2016) and the task would be to maximally improve the predictions.
Some background for choosing AML as the case study: While past efforts in predicting treatment response and outcome for AML patients have primarily focused on cytogenetics, results from recent large scale genomic studies have shown AML to be a complex disease with several hundred genes potentially impacted by mutation. The use of genomic profiles to predict drug sensitivity may therefore be hampered by the diverse mutational spectrum of AML. However, the mutated genes may have redundant roles by affecting the same signalling pathways that may more easily be identified by common gene expression patterns. Thus, AML patients with different
mutational backgrounds may share common gene expression profiles, reducing the overall heterogeneity between patients and potentially enabling better prediction for drug response and outcome. The use of gene expression signatures has only recently been applied for AML patient prognostication (Ng et al, 2016; Li et al, 2013). However, there are few studies using gene expression to predict AML drug sensitivity, possibly due to lack of matching functional drug sensitivity data with gene expression profiles. New data sets, however, are becoming available, which show the utility of gene expression profiles in predicting drug response in AML (Kontro et al, 2016).
_ - It is unclear how the project aims to conduct challenges while maintaining privacy of the data. Such an architecture for secure multi-party computation is not widely available, and potential participants would not be able to attempt the challenge without the significant undertaking of constructing the infrastructure._
**Response:** Again a good point. There was not enough space to explain all the details in the proposal, and some details will need more planning. Here is a brief outline:
- The challenge will be run over a few iterations to allow learning from others to develop the solutions.
- The participants will be given direct access only to mock data (simulated and public data) to test their systems.
- Access to the real data will be provided only through provably differentially private interfaces; we will provide some standard interfaces but the participants may also submit new ones which will be peer-reviewed before allowing their use.
- The final predictions will be evaluated under suitable metrics (using ones from previous DREAM challenges) under varying levels of privacy. Only the final assessment results will be released to prevent leaking private data. (Controlling the amount of leaked information from repeated releases from multiple mechanisms under different levels of privacy would be essentially impossible.)