Privacy

Listed below are tools developed by the iDASH privacy research team.

 

Click on the name in the left-hand column to download or access the tool.

Name Description

Count Perturbation Allowing User Preferences

 

This is a tool for exploring parameter settings of a count query perturbation mechanism that allows inclusion of user preferences with regards to scale and location of the perturbation. The mechanism it is for extends existing study design count query tools by offering differential privacy as well as allowing for user input. This tool is intended as a reference implementation of the mechanism as well as a tool for system designers to explore options that the mechanism offers.

 

Platform: Web service

 

Tutorial: http://ptg.ucsd.edu/cq

 

Citation: Vinterbo SA, Sarwate AD, and Boxwala AA. “Protecting count queries in study design,” J Am Med Inform Assoc. 2012 Sep 1;19(5):750-7. Epub 2012 Apr 17.

Differentially Private Data Queries

 

DPDQ is software that aims at being a lightweight, scalable, and easily deployable networked system for making information in datasets available in a secure and differentially private manner, as well as serve as a platform for a growing number of differentially private methods.

 

Platform: Python

 

Datasets that can be used: Any dataset that can be stored in a relational database.

 

Tutorial: http://ptg.ucsd.edu/~staal/dpdq/dpdq.html

 

Citation (web): http://ptg.ucsd.edu/~staal/dpdq/ 

Differentially private genome data dissemination through top-down specialization

 

We developed a novel approach for disseminating genomic data while satisfying differential privacy. The proposed algorithm splits raw genome sequences into blocks, subdivides the blocks in a top-down fashion, and finally adds noise to counts to protect privacy.

Platform: Windows and Matlab

Datasets that can be used with it: Can be used with case control datasets in http://www.humangenomeprivacy.org/2015/competition-tasks.html

Citation: Wang S, Mohammed N, Chen R. Differentially private genome data dissemination through top-down specialization. BMC medical informatics and decision making 2014;14 Suppl 1:S2 doi: 10.1186/1472-6947-14-S1-S2[published Online First: Epub Date]|.
 

Differentially Private Logistic Regression

 

This is an R package that implements differentially private logistic regression using the differentially private empirical risk minimization technique by Chaudhuri et al. (Chaudhuri K., Monteleoni C., and Sarwate, A. Differentially Private Empirical Risk Minimization. JMLR, 2011, 12, 1069-1109). It is designed for researchers that work with human subjects data that wish to publish their logistic regression parameters with associated privacy guarantees.

 

Platform: R

 

Datasets that can be used: Any dataset that can be imported into R and suitable for logistic regression

 

Tutorial: http://cran.r-project.org/web/packages/PrivateLR/PrivateLR.pdf

Differentially Private Projected Histograms

(Download package from within R using instructions on page)

 

This is an R package that provides a way to pre-process data for any predictive modeling method such that the result of this modeling is differentially private. It is based on the publication listed below. Once this pre-processing has been done, any subsequent analysis will automatically be differentially private. This allows creation of for example multinomial logistic regression models, for which there currently exists no available specialized differentially private implementation. It is designed for researchers that work with human subjects data that wish to publish their logistic regression parameters with associated privacy guarantees.

 

Platform: R

 

Datasets that can be used: Any dataset that can be imported into R and suitable for predictive modeling

 

Tutorial: http://ptg.ucsd.edu/~staal/R/pph.html

 

Citation: Vinterbo SA. “Differentially private projected histograms: Construction and use for prediction,” In Proceedings of ECML-PKDD 2012, September 24-28, Bristol UK. Volume 7524 of Lecture Notes in Artificial Intelligence series (LNAI), pages 19­34. Springer Verlag, 2012. ISBN 978-3-642-33485-6.

PRECISE

 


 

PRECISE is a privacy preserving cloud-assisted quality improvement service for healthcare based on homomorphic encryption and garbled circuit.

 

Platform: Windows/Linux

 

Datasets that can be used: clinical data like MIMIC II

 

Citation: "PRECISE:PRivacy-prEserving Cloud-assisted quality Improvement Service in hEalthcare” presented at Translational Bioinformatics Conference. 2014 Oct 24-17. Qinghdao, China.

Privacy Preserving SVM (PPSVM)

 

 

Privacy-preserving Support Vector Machine (PPSVM) is an innovative tool for collaborative research across distributed data repositories to build a global Support Vector Machine without sharing data. The system aggregates local kernel matrices computed from vertically segmented data to build accurate models with better privacy protection.

 

Platform: Web service. 

 

Datasets that can be used: http://privacy.ucsd.edu:8080/ppsvm/testdata.html

 

Tutorial: http://privacy.ucsd.edu:8080/ppsvm/

 

Citation: Que J, Jiang X, and Ohno-Machado L. “A Collaborative Framework for Distributed Privacy-Preserving Support Vector Machine Learning,” AMIA Annual Symposium, 1350-1359, 2012.

Spectral Swapping

 

Spectral Swapping implements a data perturbation technique based on permuting the left-singular vectors of the data. Randomly permuting the columns (attribute values) in a data matrix is a historical method for making linking row records to other data as it destroys multivariate patterns in the data table. Unfortunately, it also renders data unsuited for multivariate analysis. By performing the random permutation on the left singular vectors instead, randomization is achieved while preserving correlations. This tool is based on Lasko et al. (Lasko, TA, Vinterbo, SA. Spectral Anonymization of Data. IEEE Transactions on Knowledge and Data Engineering, pp. 437-446, March, 2010), and is intended for researchers who want to randomize data using techniques from this publication.

 

Platform: Linux / Macintosh / Windows. Python installation is required.

 

Datasets that can be used: Any data matrix that can be imported into Python

 

Tutorial: http://ptg.ucsd.edu/~staal/ss/

WIDGET (Web Interface for Dynamic Genome-privacy EvaluaTion)

 

We built a companion webservice using JavaScript and R Shiny server technology to dynamically illustrate the performance of the participating teams: https://humangenomeprivacy.ucsd-dbmi.org.  A user can assess these models in finer granularity than what is reported in this human genome privacy challenge report.

 

Platform: Linux / Macintosh / Windows

 

Datasets that can be used: Personal Genome Project (PGP) Data

 

Tutorial: https://humangenomeprivacy.ucsd-dbmi.org/

WITNESS

 

Information accessibility plays an important role in biomedical research. Stakeholders like patients, researchers, and data owners can all benefit from the use of healthcare data. Combining visualizations, secure communication, and an interactive user interface offered by HTLM 5 and R Studio Shiny Server, we developed a Web InTerface for iNtEractive Survival Studies (WITNESS), which enables practical web-based data analysis without sharing patient-level data through encrypted communication channels. In WITNESS, users can query a subset of data to obtain the corresponding statistics (e.g., mean, standard deviation, etc.) and perform survival analysis. Our system helps users conduct analyses without tedious data transmission and environment setup procedures. Utilizing WITNESS in simulated data modeled after a large study on the cardiovascular health, our study demonstrated the feasibility of facilitating data analysis through an interactive interface without having to transmit patient-level data to a central analysis server. A demo of WITNESS can be found at https://witness.ucsd-dbmi.org. A virtual machine (VM) for WITNESS is available at https://witness.ucsd-dbmi.org/download.html for users who wish to set up their own WITNESS webservice.

 

Platform: Web service.

 

Datasets that can be used: The synthesized data set was provided by the Center for American Indian Health Research, College of Public Health, University of Oklahoma Health Sciences Center

 

Tutorial:  https://witness.ucsd-dbmi.org/about.html

 

CitationWang, S.*, Pedreiro, M., Wei, W.*, Jiang, X., Wang, W., Zhang, Y., Lee, E., Ohno-Machado, L. “A Web InTerface for iNtEractive Survival Studies (WITNESS): Enabling Shared Access for Survival Data Analysis.” AMIA, 2014. Submitted.