Identify Individuals from Their Genomic Sequences

Marc Whitlow — Fri, 01 Mar 2013 03:52:00 +0000

Dr. Yaniv Erlich has demonstrated that individuals can be identified from their publicly available genomic sequences, their age and their state of residence, in a recently published article in Science.¹ This was accomplished using the almost direct correlation between male last names and short tandem repeats (STR) in their Y-chromosome. Genetic genealogy sites such as Sorenson Molecular Genealogy Foundation and ysearch.org provided Dr. Erlich’s group the tools needed to match the Y-chormosome STRs with last names; see methodology below. The Health Insurance Portability and Accountability Act of 1996 (HIPAA) does not preclude this personal information on individuals being publicly available, in what is commonly referred to as “anonymous” or “de-identified data”. It may come as a surprise that you can be identified by your genomic sequence, and two pieces of your anonymous data.

Public Policy in the Age of Big Data

Public policy needs to find a way of protecting peoples’ health records while still allowing the biomedical research community access to medical records for the potential identification of new diagnostic and therapeutic products. What Dr. Erlich’s work has shown is that restricting the information available on public genomic data does not ensure anonymity of an individual’s genome. Further restrictions would only lower the value of the data provided to the biomedical research community. The solution to this problem is to shift the burden of protecting the anonymity of the data from the repository to the user.

The identity and intent of the users of publicly available genomic data needs to be recorded with the data repository.
The users of the data need to be identified using protocols that are the best practices in the security industry, beyond the standard email confirmation. Out-of-network identity confirmation or the use of devices that remain in the control of the user should be employed.
The user authentication should be done every time a user requests information from a public medical repository.

Potential Benefits to the Individual Supplying Their Genomic Sequences

One of the guiding principles of research on humans is that there must be a potential benefit to the individuals participating in the research. Researchers using public genomic data should try to identify potential benefits to the individuals whose information they are using. This information could be returned to the individual through the genomic repository.

Marketing based on an Individual’s Genome

Prescription drug marketing is already regulated by the Food and Drug Administration (FDA). What if a company wanted to market to people that have a particular genetic signature? Should they be able to use public or private genomic data to identify these individuals, and market to them directly, or to their physicians? I would think that some individuals would want this and others would find it invasive. There needs to be a way of allowing individuals to opt-in to contact by various organizations through the repository. Once again we see that the repository is acting as an intermediary between the individual and organizations analyzing their genome.

Expansion Genomics Repository Responsibilities

What I’m suggesting is that genomic repositories become hubs that manage more than the depositing individual’s data. Instead of allowing anyone access to the data, repositories should ensure that they know who is requesting data and for what propose. Finally, the repositories should distribute information about the potential benefits of new technologies to the individual whose information they have been entrusted with.

Identifying Individuals Methodology

Dr. Ehlich’s group started from public sequence data of an individual. From the supporting meta data they were able to obtain:

The individual’s year of birth
sex
state of residence

They developed a short tandem repeat (STR) profiler² that allowed them to identify the Y-chromosome STR of male individuals, and fed this information into a surname search on the genetic genealogy sites.

Having the most probable surname, year of birth and the state of residence they used publicly available web services, such as USearch.com and PeopleFinders.com to locate the best potential matches, and did subsequent followup studies using public resources to confirm their results.

Of the 1000 Y-chromosome STR data sets Dr. Ehlich’s group examined, they were able to “completely” identify almost 5% of the individuals.

Reference

Identifying Personal Genomes by Surname Inference. Gymrek M., McGuire A.L., Golan D., Halperin E., Erlich Y. Science. 339, 321-4.
lobSTR: A Short Tandem Repeat Profiler for Personal Genomes. Gymrek M., Golan D., Saharon S., Erlich Y. Genome Research 22, 1154-1162.

Complexity of Biology in the Light of Stefano Rossetto’s Consolamini

Marc Whitlow — Tue, 07 Feb 2012 03:49:53 +0000

On Friday night I had the pleasure of listening to the first modern performance of Stefano Rossetto’s Consolamini, consolamini popule meus, a Christmas motet in 50 parts, conducted by Davitt Moroney. The most striking part of this piece of music is its complexity. With 50 voices it is virtually impossible to separate out any single phrase. Prior to the performance, Davitt Moroney had warned us that it was impossible to do separate out a single phrase, and at the same time assured us that every phrase was present. Yet we know that the composer arranged these voices to have a powerful effect on the audience. At times the only reminder of an individual voice was when a brief “s” made by an individual would rise above the massive harmony of this piece. To an untrained listener, such as myself, these brief indications of the presences of individuals appeared randomly during the performance.

The parallels to our understanding of biological processes are enlightening and therapeutic, and certainly not the intended message of the composer Stefano Rossetto. Biology is a complex process with many “voices” contributing to an observed behavior, such as disease progression. Modern biology has becoming exceedingly good at separating out the individual voices. But, just as the constructive and destructive harmonization of individual voices leads to an inspiring experience, a multitude of separate biological processes combine to create a limb bud or a disease condition. The challenge for us is to incorporate more of the complexity and dynamic behavior of biology into our thinking, models, diagnosis and treatments of disease.

If you have the opportunity to experience a live performance of a 40 to 50 voice Renaissance motet, I would highly recommend it.

Colabrativ, Inc. » Observations