SHARE: Statistical Health Information Release with Differential Privacy

Recent studies and advisory reports to the government have pointed out that information sharing with appropriate privacy protection to enable health research is one of the most critical challenges of our time. Current de-identification approaches or microdata (i.e. original records) release are subject to various re-identification and disclosure risks and do not provide sufficient protection for our patients. A complementary approach is to release statistical macrodata (i.e. derived statistics), which can also be used to construct synthetic data that mimic the original data. Differential privacy has emerged in recent years as one of the strongest provable privacy guarantees for statistical data release. However, it remains a challenge to efficiently and effectively release a dataset (consistent with a set of statistics) that ensures differential privacy while guaranteeing data utiity for targeted applications. Applying differential privacy to health data presents additional new challenges due to the high dimensionality, high correlation, and cross- institution distribution in health datasets that support cross-sectional, longitudinal, and cross-institutional studies. The absence of practical data sharing software with rigorous privacy guarantees has made data providers uncomfortable with sharing data. Lack of datasets has severely hindered medical and informatics research in general. We propose to collaborate with the NCBC iDASH (integrating Data for Analysis, Anonymization, and Sharing) center to develop a Statistical Health informAtion RElease (SHARE) toolkit with differential privacy. The specific aims are: 1) develop and evaluate novel methods for releasing statistical health data with differential privacy to address the high-dimensionality, self-correlation, and cross-institution distribution of data, 2) use the SHARE toolkit for clinical dataset construction and use case studies using Emory Analytical Information Warehouse (AIW) and UCSD Clinical Data Warehouse for Research (CDWR) and demonstrate its utility for cohort discovery queries and hospital readmission study, and 3) deploy the SHARE toolkit at Emory and iDASH as well as Atlanta Clinical & Translational Science Institute (ACTSI) and UCSD Clinical and Translational Research Institute (CTRI). The techniques and software tools envisioned by SHARE will facilitate health information sharing for health research and have a direct impact on predictive health and translational medicine as well as informatics practice.

SHARE: Statistical Health Information Release with Differential Privacy is funded by the National Institute of General Medical Sciences, part of the National Institutes of Health (R01GM114612)


Principal Investigator

Li Xiong

Li Xiong is an Associate Professor in the Department of Mathematics and Computer Science and the Department of Biomedical Informatics at Emory University where she directs the Assured Information Management and Sharing (AIMS) research group. She holds a PhD from Georgia Institute of Technology, an MS from Johns Hopkins University, and a BS from University of Science and Technology of China, all in Computer Science. She also worked as a software engineer in IT industry for several years prior to pursuing her doctorate. Her areas of research are in data privacy and security, distributed and spatio-temporal data management, and health informatics. She has published about 80 papers in peer reviewed journals and conferences with two best paper awards. She is a recent recipient of the Career Enhancement Fellowship by Woodrow Wilson Foundation. Her research has been supported by the National Science Foundation (NSF), the Air Force Office of Scientific Research (AFOSR), the Patient-Centered Outcomes Research Institute (PCORI), the National Institutute of Health (NIH), and industry awards including Cisco Research Award and IBM Faculty Innovation Award.