Identify Individuals from Their Genomic Sequences

Dr. Yaniv Erlich has demonstrated that individuals can be identified from their publicly available genomic sequences, their age and their state of residence, in a recently published article in Science.1 This was accomplished using the almost direct correlation between male last names and short tandem repeats (STR) in their Y-chromosome. Genetic genealogy sites such as Sorenson Molecular Genealogy Foundation and provided Dr. Erlich’s group the tools needed to match the Y-chormosome STRs with last names; see methodology below. The Health Insurance Portability and Accountability Act of 1996 (HIPAA) does not preclude this personal information on individuals being publicly available, in what is commonly referred to as “anonymous” or “de-identified data”. It may come as a surprise that you can be identified by your genomic sequence, and two pieces of your anonymous data.

Public Policy in the Age of Big Data

Public policy needs to find a way of protecting peoples’ health records while still allowing the biomedical research community access to medical records for the potential identification of new diagnostic and therapeutic products. What Dr. Erlich’s work has shown is that restricting the information available on public genomic data does not ensure anonymity of an individual’s genome. Further restrictions would only lower the value of the data provided to the biomedical research community. The solution to this problem is to shift the burden of protecting the anonymity of the data from the repository to the user.

  • The identity and intent of the users of publicly available genomic data needs to be recorded with the data repository.
  • The users of the data need to be identified using protocols that are the best practices in the security industry, beyond the standard email confirmation. Out-of-network identity confirmation or the use of devices that remain in the control of the user should be employed.
  • The user authentication should be done every time a user requests information from a public medical repository.

Potential Benefits to the Individual Supplying Their Genomic Sequences

One of the guiding principles of research on humans is that there must be a potential benefit to the individuals participating in the research. Researchers using public genomic data should try to identify potential benefits to the individuals whose information they are using. This information could be returned to the individual through the genomic repository.

Marketing based on an Individual’s Genome

Prescription drug marketing is already regulated by the Food and Drug Administration (FDA). What if a company wanted to market to people that have a particular genetic signature? Should they be able to use public or private genomic data to identify these individuals, and market to them directly, or to their physicians? I would think that some individuals would want this and others would find it invasive. There needs to be a way of allowing individuals to opt-in to contact by various organizations through the repository. Once again we see that the repository is acting as an intermediary between the individual and organizations analyzing their genome.

Expansion Genomics Repository Responsibilities

What I’m suggesting is that genomic repositories become hubs that manage more than the depositing individual’s data. Instead of allowing anyone access to the data, repositories should ensure that they know who is requesting data and for what propose. Finally, the repositories should distribute information about the potential benefits of new technologies to the individual whose information they have been entrusted with.

Identifying Individuals Methodology

Dr. Ehlich’s group started from public sequence data of an individual. From the supporting meta data they were able to obtain:

  1. The individual’s year of birth
  2. sex
  3. state of residence

They developed a short tandem repeat (STR) profiler2 that allowed them to identify the Y-chromosome STR of male individuals, and fed this information into a surname search on the genetic genealogy sites.

Having the most probable surname, year of birth and the state of residence they used publicly available web services, such as and to locate the best potential matches, and did subsequent followup studies using public resources to confirm their results.

Of the 1000 Y-chromosome STR data sets Dr. Ehlich’s group examined, they were able to “completely” identify almost 5% of the individuals.


  1. Identifying Personal Genomes by Surname Inference. Gymrek M., McGuire A.L., Golan D., Halperin E., Erlich Y. Science. 339, 321-4.
  2. lobSTR: A Short Tandem Repeat Profiler for Personal Genomes. Gymrek M., Golan D., Saharon S., Erlich Y. Genome Research 22, 1154-1162.
This entry was posted in Observations, Science and tagged , , . Bookmark the permalink.

2 Responses to Identify Individuals from Their Genomic Sequences

  1. principally says:

    Beautiful essay, got the enjoyment of studying

  2. Hello! I’m at work surfing around your blog from my new iphone! Just wanted to say I love reading your blog and look forward to all your posts! Keep up the superb work!