Obama鈥檚鈥淧recision Medicine鈥� Initiative and Privacy

Jay Stanley, Senior Policy Analyst, ACLU Speech, Privacy, and Technology Project

September 18, 2015

Yesterday we got a lot of new detail on the likely shape of President Obama鈥檚 proposed 鈥溾€� (or PMI). This project envisions creating a research database of genetic data on one million volunteers, an idea that promises potentially huge medical benefits, but also raises significant privacy questions.

The initiative was first announced in the president鈥檚 address last January, and then launched with fanfare a few weeks later in a White House of genetics experts, patients, academics, and government officials. To their credit, White House officials have recognized the privacy challenges, and have sought input from privacy experts from the beginning (the ACLU was invited to the program launch, and to several meetings since to discuss the privacy issues involved). In July, the White House produced a that outlined some basic 鈥減rivacy and trust鈥� principles to which the program is supposed to adhere.

But many of the crucial details have still been lacking, making it hard to know how to evaluate what has been essentially just a vision. The privacy document is quite good and covers all the bases, but seems to defer any hard choices between using data in every possible way for research, and protecting privacy. For example, in meetings I attended top health officials waxed enthusiastic about the potential advantages of crowdsourcing genetic databases 鈥� leaving them wide open so that a thousand people can explore them, which they said in their experience inevitably yields discoveries that don鈥檛 come to light if data access is granted to only a small cadre of certified professional 鈥渞esearchers.鈥� Certainly that comports with everything we know about the advantages of open-source and crowdsourced data. At the same time, the document envisions strict rules and limits around how people鈥檚 data will be used and shared. It鈥檚 not clear to me how those two visions will co-exist.

It鈥檚 been hard to make judgments without more details, which is why I鈥檝e been eagerly awaiting the recommendations of an that was tasked with developing recommendations for some of the nuts and bolts of how this initiative would actually work. Yesterday, in a public meeting broadcast live via conference call, the advisory group formally presented its recommendations to NIH Director Francis Collins. At the end of the meeting, Collins announced that he was accepting the recommendations, clearing the way for implementation according to the report鈥檚 blueprint.

These are very complex issues and I haven鈥檛 had time to properly digest them, but here are some significant features of the program outlined in the (a was also presented at the meeting):

Volunteers whose information is entered into the database (which the program likes to call 鈥減articipants鈥�) are envisioned as being drawn mainly from health providers such as Kaiser Permanente. Any other individual will also be able to directly volunteer to be included.
The data obtained from each participant will include not just their genome, but also Electronic Health Record data鈥攅ssentially their complete medical records, including such things as narrative documents, EKG and EEG waveform data, scan imagery, and 鈥渕obile health鈥� data from wearable sensors. It will also include a bio-sample of their actual tissue鈥攎ost likely a blood sample鈥攁nd the results of a baseline physical exam. Basically, all available medical data that could prove useful. One reason that the system will seek to draw from established health providers like Kaiser is that they already have all this information on their patients in one place.
The million-person Cohort is envisioned as being longitudinal鈥攊t will feature an ongoing relationship with participants, including continuous information collection.
Information and findings will also be fed back to participants鈥攂oth aggregate scientific findings, and also findings of individual relevance.
The database would be open for exploration by any researchers鈥攁nyone from academic professionals to high school students.
Any new data that results from (for example) running a new algorithm on the Cohort would have to be shared back with the project and available to others. This is good; this database will belong to the public and its fruits should likewise belong to the public.
The report details a governance system that includes significant input from program participants. Also good.

Perhaps most significantly for privacy, the report recommends that the program 鈥渟hould create and use de-identified data for research whenever feasible to do so.鈥� At the same time, it also wants participants to be 鈥渞e-contactable.鈥� In its key paragraphs on privacy, the report recognizes the complexities involved:

A national cohort that includes a highly interactive approach to communicating with and soliciting input from study participants will necessarily have to operate in two data management modes, while respecting participant preferences and terms of consent. The 鈥渇ully identified鈥� mode of operations will be needed for messaging, study appointment reminders, phone interactions, etc鈥�.

Aggregate data assembled for analysis will need to be de-identified by removal of standard classes of personal identifiers such as those specified by HIPAA Limited Data Set and Safe Harbor provisions. These are imperfect privacy standards, however, and the clinical and research-generated data are expected to be rich in features that make each individual鈥檚 contribution unique. Uniqueness is not synonymous with re-identification (which requires, in addition, a naming source), but the proliferation of data mining methods and potential naming sources (voter lists, public registries, social media postings, ancestry web sites, etc.) means that technology alone will be insufficient to address issues of data privacy for the PMI cohort. Expert testimony presented at [a program workshop] brought forth the view that de-identification should not be thought of as a guarantee of anonymity, but rather simply 鈥渁nother disincentive to attempting re-identification of individuals.鈥� Acceptable use policies with substantial enforceable sanctions will need to be developed or adapted from other similar research efforts to complement the technical approaches to deidentification of data.

In short, it may be possible to re-identify participants from medical records in the database, but those who attempt to do so will be subject to unspecified 鈥減enalties.鈥�

Ultimately, the report thus punts on the hardest details for now with a recommendation that the program 鈥渆ngage data privacy experts to create an effective combination of technology and policy to minimize risks to re-identification of de-identified data.鈥� On yesterday鈥檚 call, as in prior meetings, I have certainly been favorably impressed with the thoughtfulness and thoroughness with which White House and NIH staff have approached the policy issues raised by this project, including the privacy implications as well as a number of other knotty issues it raises. That said, strictly from a privacy point of view, there remain some significant questions for those contemplating volunteering for this program. It does not look as though this will be an airtight, privacy-protective system where subjects鈥� data will be technologically guaranteed private. And of course as with any large data store in today鈥檚 world the cybersecurity questions are considerable. A fair amount of trust will have to be placed by participants in those who run this program.

Of course, many people will be inspired to volunteer for this program out of a desire to help researchers fight diseases鈥攄iseases that have already affected them or people they love, or out of an abstract desire to contribute to humanity. Those are motivations we can all honor. Scientists say there鈥檚 real potential for this kind of database to revolutionize many areas of medicine. The exploitation of medical data for good is not like using big data to try to spot terrorists, a misguided effort where the privacy downsides are vastly eclipsed by the (unlikely) benefits. In a chart included in the report yesterday, the authors estimate that with a population of a million people, there will be 6,400 cases of Parkinson鈥檚 within 5 years, for example, 18,000 cases of Lupus, 32,600 cases of breast cancer, and similar numbers for many other conditions. That will allow a lot of exploration of genetic and environmental causes of disease. Such possibilities are something that we privacy advocates do not fail to take into consideration when judging uses of data.

And not everyone feels they need airtight privacy, even for their medical records and the sensitive information they so often contain. Some people are already making their genomes public.

But it鈥檚 also important for people to have a clear understanding of what the privacy risks might be, both so that those risks can be ameliorated where possible, and also so that individuals can make a fully informed decision about whether they want to participate. We want volunteers to go in with their eyes wide open. The proposal outlined yesterday, and the project overall as it unfolds, will have to be studied and analyzed closely by privacy advocates.

吃瓜大本营