A Standardized Mega-Biobank Phenomic Library From The Million Veteran Program

Pixabay License | Source: Gerd Altmann , No changes made.
Advertisement img

Often International Classification of Diseases (ICD) codes are used as a proxy for clinical phenotypes, however their intended purpose is for billing, determining treatment, and statistics collection and therefore may not always be the best representation of phenotype. Initiatives such as PheKB and others support rule based phenotype algorithms, and emerging statistical approaches to phenotyping that many use ICD codes and other sources of information in a collaborative way. These phenotype algorithms are not yet standardized, lack advanced database capabilities and lack built-in visualization tools.

Recognizing this deficit, Kelly Cho, Edmon Begoli and colleagues on behalf of the VA Million Veteran Program (MVP) developed an infrastructure supporting a phenomics library using data from the Veterans Health Administration (VHA) electronic health record (EHR). They published this work in AMIA Joint Summits on Translational Science proceedings.

The  MVP is a mega-biobank cohort of US Veterans launched in 2011 that combines data from surveys, EHRs, genomics, and biospecimens. It is still the largest enrolled and ongoing mega-cohort biobank within the US with over 800,000 participants as of Fall 2019. 

Featured Partners

Search, visualization and sharing objectives were mainly met by building the infrastructure upon CKAN, a tool for making open data websites. CKAN allows the management and publication of data collections. Governments, research institutions, and organizations that need to share a lot of data use the service.

The database was built using Observational Medical Outcomes Partnership (OMOP) fields from the outset with a limited number of locally defined fields. This should facilitate interoperability. The mapping to OMOP may not always be straightforward.

The authors note that the standardized phenotype definitions may allow automated phenotype cataloging using article scraping techniques. Standardization also facilitates visualization. The use of CKAN allows individual phenotype record citation, meaning that any potential confusion regarding alternatively defined phenotypes can be eliminated. 

“The linkage of large longitudinal VA EHR and other biomarker and omics data is one of the strengths of MVP mega-biobank. Such comprehensive data coverage and the scale of the large population in VA and MVP provide unprecedented opportunities for new discoveries in both biomedical research and infrastructure development for scalable solutions. Developing an optimal data management (cataloging, storing, searching, sharing, and archiving) structure of the EHR-based phenomic library is a critical factor in expediting research towards translational science,” concluded the authors.


  1. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7233040/