Data integrity is the foundation of a data engineer. As a data engineer in a human brain imaging laboratory with a mission to end Alzheimer disease, I build and design the data infrastructure and pipelines to transfer and transform large, complex, datasets into an accessible format for leading researchers across the world. The database is comprised of over 50,000 raw imaging scans and thousands of processed data for each scan type dating back decades. I work alongside cross-functional teams for data enrichment and make sure we are taking full advantage of the data collected from the human volunteers who make this research possible.
Most of the data collection, processing, and storage is performed on-premises with internal data centers, API platforms, and high-performance computing environments. Figure 1 below illustrates the general workflow as a data engineer in this lab!
Figure 1. My general workflow as a data engineer in the Alzheimer disease research laboratory at Washington University School of Medicine.
Figure 2. Video snapshot of the data transformation R script I designed for the formal bi-annual data distribution for researchers world-wide.
Varying protocols and techniques is common in decade long research studies, and accurate data transformation for these studies are critical. New to the world of human brain imaging data, I dove into the complex database, asked questions, and gained not only a grasp of its infrastructure, but a deep understanding of the data and its history.
With this knowledge, I designed an automated data transformation workflow using R, increasing workflow efficiency by 75% (Figure 2). This R script automatically cleans the messy data and joins a multitude of files into the format required for the data warehouse.
Figure 3. Original database model
The human brain imaging data is stored on an online archiving platform with the following data hierarchy: IRB project -> participant -> scan visit (Figure 3). IRB Project is defined by the unique Institutional Review Boards Number (IRB). With a large database, it was difficult to maintain each of these projects individually. In addition, the same participant could be identified by a variety of different labels between these project. This siloed data structure led to loss or misplacement of participant data and complex data sharing especially when targeting specific participant data.
As an opportunity arose to redesign the data infrastructure in our API platform, I put together a multi-functional team to tackle this challenge of designing and implementing an efficient database model to improve workflows for our clinical coordinators, data processing team, and data scientists.
Figure 4 illustrates a simplified version of the new database model. We defined a centralized hub for all imaging data using a unified subject label to easily locate data for a participant. As subsets of data are required, mini hubs are created with the specified data.
Figure 4. Simplified illustration of the improved database model.
With a centralized hub, there is no longer a need to search the many different projects in the API platform for data transfer, maintenance, and processing data.
The removal of varying participant labels makes sure no data is lost.
The creation of mini hubs for specific data
equates to easier and faster data
sharing to researchers world-wide.
© 2021 Aylin Dincer