Data Collections

Notice: At this time, the funding of the iDASH program is over. Unfortunately, we have to sunset the data within iDASH. All data within any of the communities in iDASH will no longer be accessible after 07/30/2017. If you have any questions, please don't hesitate to contact us at . Data owners who need help finding a new repository for their data or who are interested in a fee-based hosting service can also contact us at

 

Non-PHI/PII Data Collections

The table below lists short descriptions of the data sets available through the iDASH repository.

Notes:

  • Some data collections require a MIDAS account and acceptance of a community agreement—indicated as “Requires login and community agreement.”
  • Some data collections require a MIDAS account and the community owner to grant access—indicated as “PRIVATE.”

 

Community Name

Type/Category

Data Set Description

AbsCNseq R package

Data Description:

This dataset consists of de-identified samples of sequencing data. absCNseq is a statistical method to estimate tumor purity, ploidy and absolute copy numbers from next generation sequencing data. The algorithm is also available for download.
Persistent URL: http://dx.doi.org/10.15147/J2WC7H

 

File Structure: 

A single file absCNseq.RData contains the entire dataset.

 

Data Format:

The data is formatted to be analyzed by R package.

BREAST - RIDER Breast MRI


Requires login and community agreement

DICOM images

Data Description:

RIDER Breast MRI Community is composed of de-identified images of phantom MRI with repeat dynamic contrast enhanced MRI from five patients being treated at M.D. Anderson Cancer Center. Data were downloaded from The Cancer Imaging Archive (TCIA).
Persistent URL: http://dx.doi.org/10.15147/J2RP47

 

File Structure:

The DICOM files are separated in folders by study subject and then by exam. An additional MS excel file with relevant ontology-based radiology findings data is also provided in a separate "Radiology_Findings" folder.

 

Data Format:

The data is formatted into DICOM images. A DICOM viewer is required to access the files. Image annotations can be viewed using the open-source DICOM viewer OsiriX Imaging Software.

A web-based DICOM metadata extraction tool is available by selecting the DICOM file and click "View" in the left side menu bar.

BREAST-MRI

Requires login and community agreement

DICOM images

Data Description:

The dataset is composed of 148 breast MRI studies derived from 88 patients including cases considered high-risk, ductal carcinoma in-situ, fibroids and lobular carcinomas.
Persistent URL: http://dx.doi.org/10.15147/J2MW29

 

File Structure:

The compressed files are separated by study subject. An additional MS excel file with relevant ontology-based radiology findings data is also provided in a separate "Radiology_Findings" folder.

 

Data Format:

The compressed files include data formatted into DICOM images. A DICOM viewer is required to open the files. Image annotations can be viewed using the open-source DICOM viewer OsiriX Imaging Software.

Clinical Data Requests for Research Text

Description:

Analysis of 62 data request logs created by the researchers at University of California San Diego between September 2011 and July 2013.
Persistent URL: http://dx.doi.org/10.15147/J2Z59T

 

File Structure:

The data request logs are separate by project in separate text file.

 

Data Format:

The data is formatted into MS Word for Windows text format.

Clinical Notes and Reports Text

Data Description:

This dataset consists of medical transcription samples including clinical notes and radiology reports. Examples of the transcriptions include admission and discharge notes, surgical transcriptions, outpatient clinical encounter, emergency visit notes, echocardiogram, CT scan, MRI, nuclear medicine, radiographs, ultrasound and radiological procedures reports. Data were obtained from MedicalTranscriptionSamples.com.
Persistent URL: http://dx.doi.org/10.15147/J2H59S

 

File Structure:

The files are separated by type of transcription.

 

Data Format:

The data is formatted into text files.

CT Colonography

DICOM images

Data Description:

The CT Colonoscopy project is clinically validating widespread use of computerized tomographic colonography (CTC) in screening a population for the detection of colorectal neoplasia. The study addresses aspects of central importance to the clinical application of CTC in several inter-related yet independent parts that will be conducted in parallel. 

  • In Part I, the clinical performance of the CTC examination will be prospectively compared in a blinded fashion to colonoscopy.
  • In Part II, optimization of the CT technique will be performed in view of new technological advances in CT technology.
  • In Part III, lesion detection will be optimized by studying the morphologic features of critical lesion types and in the development of a database for computer-assisted diagnosis.
  • In Part IV, patient preferences and cost-effectiveness implications of observed performance outcomes will be evaluated using a predictive model.

Persistent URL: http://dx.doi.org/10.15147/J2CC7V

 

File Structure:

The compressed files are separated by study subject. An additional MS excel file with relevant ontology-based radiology findings data is also provided in a separate "Radiology_Findings" folder.

 

Data Format:

The compressed files include data formatted into DICOM images. A DICOM viewer is required to open the files. Image annotations can be viewed using the open-source DICOM viewer OsiriX Imaging Software.

DMITRI Study Data Set


PRIVATE

Heterogeneous patient-centric data

Data Description:

DMITRI1 is a collection of multimodal data for 17 Type 1 Diabetes Mellitus (T1DM) patients. Data sets include ~72 hours of continuous monitoring by sensors and medical devices (continuous glucose monitor or CGM, insulin pump, heart rate monitor, activity monitor, sensors for temperature and galvanic skin response, sleep monitor), as well as medical histories (including demographics, comorbidities, etc), dozens of clinical lab results (including HbA1c, c-peptide, etc), and scores for several questionnaires (Diabetes Numeracy Test, Diabetes Distress Scale, etc). Subjects in this pilot study were involved in a variety of real-world physical exercise and classroom activities, with annotated schedules for the subjects. Additional data including photographic nutrition diaries and SNP genotyping will also be made available in due time.
Persistent URL: http://dx.doi.org/10.15147/J27P4K

 

File Structure:

In the DMITRI1 community on iDASH, access the folder "DMITRI Study Data Set." Therein, find several files with data from all subjects, as well as two folders with data for specific subjects (DMITRI_101-118 and DMITRI 108). Regarding files with all subjects’ data, time-series data from most devices from all subjects are available in "DMITRI_Device_TimeSeries5min_20130429.xls", integrated and time-synchronized. Medical history, clinical lab, and questionnaire score data are found in the file "DMITRI_Clinical_Master_20130429.xls". Please refer to the respective data dictionaries (DMITRIPilotStudy_DataDictionary_Device_20130429.xls and DMITRIPilotStudy_DataDictionary_REDCap.csv) for more information. Within DMITRI_101-118, separate folders contain subject-specific data, with more information available in the READme txt file within that folder.

 

Data Formats:

Raw data from the devices underwent extensive preprocessing including retrieval and export from proprietary software programs, calculation of representative values for 5-minute intervals, and time-synchronization. This results in a single large table with all time-series device data, "DMITRI_Device_TimeSeries5min_20130429.xls". Medical history, clinical lab results, and questionnaire data were collected using REDCap, and exported into one table, "DMITRI_Clinical_Master_20130429.xls". Study data in various degrees of "rawness" are found in the subject-specific folders in DMITRI_101-118 – see the READme file therein for more details.

iDASH Webinars PDF and MP4

Data Description:

iDASH External and Journal Club webinars are available for download. iDASH Internal webinars require a login to view.
Persistent URL: http://dx.doi.org/10.15147/J2BC7J

 

File Structure:

Webinars organized in folders by date.

 

Data Format:

The webinar recordings are in MP4 format and the slides are in PDF.

Informed Consent Templates PDF

Data Description:

This dataset consists of informed consent templates used in the iCONS (informed CONsent for clinical record and Sample use in research) project. 
 

 

File Structure:

The templates are separated into different text files.

 

Data Format:

The data is formatted into MS Word for Windows and PDF text formats.

Kawasaki Disease Biomarker XLS

Data Description: 

The dataset consists of 88 protein analytes in inflammatory and cardiovascular pathways measured by Luminex antibody-coated bead in two cohorts of sex- and age-matched acute Kawasaki disease and febrile control subjects, consisting of 28 (Cohort 1) and 44 (Cohort 2) subjects, respectively. Demographic and baseline laboratory data for these subjects is also included. This dataset can be downloaded by making a request to the author (atremoulet@ucsd.edu).

 

File Structure: 

A single XLS file contains the entire dataset.

 

Data Format: 

The data is formatted into a MS Excel format file.

KD-NLP Python

Data Description:

Kawasaki Disease - Natural Language Processing (KD-NLP) was developed at the Department of Biomedical Informatics, UC San Diego.

The objective of the project is to create an effective NLP tool for early and rapid detection of subjects with a high suspicion for KD from text in clinical notes in the electronic medical record. The KD-NLP screening tool is intended to identify patients with at least three clinical signs of KD in whom the diagnosis of KD should be considered.

 

File Structure:

The source code files are separated in the folder ./KD-NLP.

The KD annotation guideline are under the folder ./AnnotationGuidelines

Tools used in this project can be found under the folder ./tools.

 

Data Format:

Most source code were written in Python. Please read README for how to install this software. File LICENSE contains copyright information of this software.

Laboratory Data XML

Data Description:

The dataset was generated by incorporating data from multiple institutions within United States and Brazil. It is composed by nearly 1,000,000 laboratory test results derived from about 13,000 patients including physiologic, serologic and biochemical assays.
Persistent URL: http://dx.doi.org/10.15147/J23W2N

 

File Structure:

A single XML file contains the entire dataset.


Data Format:

The data is formatted into a XML format file.

Lung Image Database Consortium (LIDC)


Requires login and community agreement

Images

Data Description:

The LIDC image collection consists of diagnostic lung cancer screening chest CT scans with marked-up annotated lesions. It is a resource for development, training, and evaluation of computer-assisted diagnostic (CAD) methods and tools for lung cancer detection.
Persistent URL: http://dx.doi.org/10.15147/J20594

 

File Structure:

The compressed files are separated by study subject. An additional MS excel file with relevant ontology-based radiology findings data is also provided in a separate "Radiology_Findings" folder.

 

Data Format:

The compressed files include data formatted into DICOM images. A DICOM viewer is required to open the files. Image annotations can be viewed using the open-source DICOM viewer OsiriX Imaging Software.

Observational Cohort Event Analysis and Notification System (OCEANS) CSV

Data Description:

This dataset consists of 20.000 subjects medical data including creatinine values, diabetes status, coronary artery disease status and demographics information. This dataset was generated to evaluate OCEANS. The aim is to develop novel statistical methods to detect adverse event signals in observational cohort data from electronic health records, clinical registries, and administrative databases. This toolbox was developed to enable rapid integration and dissemination of statistical modules to support retrospective and prospective static or automated real-time medical product surveillance.
Persistent URL: http://dx.doi.org/10.15147/J2VC76

 

File Structure:

A single CSV file contains the entire dataset.

 

Data Format:

The data is formatted into CSV.

 

Pain Prediction Data CSV and text

Data Description:

This dataset consists of de-identified unsynchronized time series observations and readings from sensors including blood pressure measurements, heart rate and laboratory test results derived from 53 patients followed at UCSD medical center Intensive Care Units. All data was collected from the UCSD electronic health record. The aim of this project is to develop a novel approach for assessing pain in patients using a principal component analysis (PCA)-based local detector. Our algorithm produce a single index to indicate the increase in pain level based on unsynchronized, sparse and noisy time series data collected from electronic flowsheets. The algorithm is available for download in a separate folder “Public/Tool”.
Persistent URL: http://dx.doi.org/10.15147/J2QP4X

 

File Structure:

The files are organized by study subject. There is a medication, attribute and pain data file for each subject.

 

Data Format:

The data is formatted into CSV and text files. 

Physical activity sensor data

GPS, accelerometer data

Data Description:

This dataset contains motion sensor data of 16 physical activities (walking, jogging, stair climbing, etc.) collected on 16 adults using an iPod touch device (Apple Inc.). The data sampling rate was 30 Hz. The collection time for an activity varied from 20 seconds to 17 minutes. The total size of the dataset is 31.5 MB. 
Persistent URL: http://dx.doi.org/10.15147/J2KW20

 

File Structure:

The root directory "iDash_activity_dataset" contains 16 subdirectories, for 16 participants, respectively. Each subdirectory contains 16 CSV-formatted data files, for 16 activities, respectively. The file name (e.g., 400m_brisk_walk_pocket.csv) indicates the activity type and the position of the device during data collection. There are two possible position: "pocket" (front pant pocket), and "arm" (armband).   
Some participants were unable to finish all activities, so a file might be empty. This is indicated by a "(blank)" suffix in the file name, e.g., Sit_and_walk_test_pocket_(blank).csv.   
Non-empty files may contain signal noise, particularly in the beginning and the end of the data files. This is because at the beginning of data collection, the researchers had to manually start data recording and place the device in the correct position; and at the end of an activity the researchers had to manually collect the device and stop data recording. Therefore, trimming may be needed for accurate classification.  

 

Data Format:

Each data file follows the same format. A row is a sample vector of 12 numbers taken at the same time instant. The 12 numbers are: the 3-axis linear acceleration, 3-axis gravity, 3-axis rotation rate vector, roll, pitch, and yaw.
  
To be more accurate, we explain how these data were captured using iOS API. Let's denote the 12-dimension sample as S=[AccelerationX, AccelerationY, AccelerationZ, GravityX, GravityY, GravityZ, RotationRateX, RotationRateY, RotationRateZ, Roll, Pitch, Yaw]. AccelerationX, AccelerationY, and AccelerationZ were captured using iOS library API property CMDeviceMotion.userAcceleration (.x, .y, and .z, respectively); GravityX, GravityY, and GravityZ values were taken using CMDeviceMotion.gravity (.x, .y, and .z, respectively); RotationRateX, RotationRateY, RotationRateZ were captured using API property CMDeviceMotion.rotationRate (.x, .y, and .z, respectively); Roll, Pitch, Yaw were sensed using CMDeviceMotion.attitude (.roll, .pitch, .yaw, respectively). For more information on these sensory values, please consult the iOS API documents.

Prostate Diagnostic Imaging DICOM Images Data Description:

This community consists of de-identified prostate MRI images acquired on a 1.5 T device with combined surface and endorectal coil, including dynamic contrast-enhanced acquisitions obtained prior to, during and after I.V. administration of Gadolinium-DTPA (pentetic acid).
Persistant URL: http://dx.doi.org/10.15147/J23019

 

File Structure:

The compressed files are separated by study subject. An additional MS excel file with relevant radiology and pathology findings data is also provided in a separate "Radiology_Findings" folder.

 

Data Format:

The compressed files include data formatted into DICOM images. A DICOM viewer is required to open the files. 

Radiology Teaching Files

JPEG Images

Data Description:

This dataset consists of several radiology teaching cases including abdominal, chest and cardiac studies.
Persistent URL: http://dx.doi.org/10.15147/J2G59G

 

File Structure:

The JPEG files are separated in folders by case. An additional MS excel file with the diagnosis data is also provided in a separate "Radiology_Findings" folder.

 

Data Format:

The data is formatted into JPEG images. 

 

RIDER Lung CT


Requires login and community agreement

CT images

Data Description:

RIDER Lung CT consists of 46 de-identified lung CT studies comprising 15,716 images derived from 32 patients being treated at the Memorial Sloan-Kettering Cancer Center.
Persistent URL: http://dx.doi.org/10.15147/J26P48

 

File Structure:

The compressed files are separated by study subject. An additional MS excel file with relevant ontology-based radiology findings data is also provided in a separate "Radiology_Findings" folder.

 

Data Format:

The compressed files include data formatted into DICOM images. A DICOM viewer is required to open the files. Image annotations can be viewed using the open-source DICOM viewer OsiriX Imaging Software.

Trends in BMI publication CSV

Data Description: 

To understand information dissemination and citation trends in biomedical informatics, we analyzed the evolution of the number of articles per major biomedical informatics topic, download/online view frequencies, and citation patterns (using Web of Science) for articles published from 2009 to 2012 in JAMIA. Manual assignment of topics by two experts and one arbitrator is also included in this data set. For more details, please refer to our recent JAMIA publication [http://jamia.bmj.com/content/20/e2/e198.long].

 

File Structure:

A single MS excel XLSX file contains the entire dataset.

 

Data Format:

The data is formatted into XLSX format.