Integrating genomic data information in EHR, a structured approach

Cecilia Mascia, Paolo Uva, Gianluigi Zanetti
14th Annual Meeting of the Bioinformatics Italian Society - 2017
* Motivations * Next generation sequencing (NGS) techniques represent a great opportunity for enhancing patient care and treatments through the detection of individual's genomic variations in order to personalize diagnosis and therapies. Moreover, Whole-Exome and Whole-Genome Sequencing (WES-WGS) have become more and more common during the last years, thanks mainly to the progress in high-throughput technologies occurred in parallels with a reduction in costs of DNA sequencing. Thus, genomic data are going to play an increasingly important role within the healthcare sphere of the near future. Electronic Health Records (EHRs) are meant to be the center that stores and provides all the information related to each healthcare event occurred during an individual's life. Genomic information should be integrated into EHRs too, but as a matter of fact, modern EHR systems are not capable to handle such data mostly because of two main aspects. The former is linked to the data itself, which could potentially be very large (up to hundreds of gigabytes) and have an irregular degree of nesting (like laboratory test outputs). The latter is about how the data are generated and manipulated, because the analysis process includes several steps that are dependent upon a multitude of software parameters and databases of resources that shall be tracked. This work addresses the issue of integrating genomic structured data into EHR presenting a new model based on the OpenEHR archetypes approach. In particular, the model should:(a) allow an efficient re-use of the genomic data saved in a structured and computer-parsable manner; (b) be capable to keep track of the precise steps performed and of the external reference used during analysis. * Methods * The first phase of the work has been a preliminary analysis of the context followed by a look at the current available options. Typically, NGS processes follow a common path where the input is a patient’s sample for the DNA extraction and the output is a list of differences observed between the individual's sequence and a reference's one. Over three million of variants can be detected in a single human genome, thus a subsequent phase of computer-aided annotation is needed to filtering and prioritize the potential disease-causing variations. That is when the model should be applied for structuring the data: just before reaching the clinical side, information should be reorganized and recorded into the patient’s EHR in a way that ensures the maintaining of the history of those data and of the protocol. At present, some relevant activities in terms of integration between genomic data and EHR are ongoing, but the requirement to keep track of the data provenance still remains a critical point. The chosen OpenEHR approach to build the data-model is based on a multi-level modeling approach, known as the “archetype methodology”. The modeling process follows, even in a recursive way, the following steps: a) definition of the data to be represented; b) browsing the available model repositories for re-usable archetypes and, possibly, mapping existing archetype nodes with clinical concept attributes; c) specializing existing archetypes or, if necessary, creating new ones with one of the available modeling tool. Here, the preferred tool for the archetype creation has been LinkEHR Studio from the Universitat Politècnica de València and VeraTech for Health ( * Results * The context analysis has led to identify as the starting point for the modeling phase the Variant Call Format (VCF) containing information about the variants detected at specific positions in the genome. First, the minimal set of attributes included in the model matches with the series of mandatory fields of the VCF specification. Then, the model has been refined through the inclusion of other context items emerged after a domain expert consultation. Having the list of attributes that describe a sequence variation, a search of already existing and re-usable archetypes has been done by browsing the official OpenEHR repository, the Clinical Knowledge Manager ( This search revealed an objective lack of models related to the genomic context. Thus some new archetypes needed to be developed on the basis of the data analysis, in particular, an entry to describe the genetic test obtained specializing an already existing observation archetype and some clusters to be used to specify further details; instead, existing archetypes like those related to the specimen or to the device, can be used without particular changes. Then, the model has been positively validated against a real use case application related to the identification of causative genes for rare diseases. Future work will concern the submission of the model to the OpenEHR community for a formal review. If accepted, the model would therefore be published in the CKM to be available for clinical usage.

BibTex references

  author       = {Mascia, C. and Uva, P. and Zanetti, G.},
  title        = {Integrating genomic data information in EHR, a structured approach},
  booktitle    = {14th Annual Meeting of the Bioinformatics Italian Society},
  year         = {2017},
  keywords     = {Genomic data, OpenEHR, Structured data, Electronic Health Record, Variant Calling},
  url          = {},

Other publications in the database

» Cecilia Mascia
» Paolo Uva
» Gianluigi Zanetti