The Massachusetts Supreme Judicial Court (SJC) recently decided an interesting case at the intersection of privacy and public records, raising some issues regarding the effects on privacy interests of the analysis of large datasets.1 Although this was a public records case, the court’s analysis will likely bear on private entities’ data protection obligations under regimes like those established by California’s CCPA, the EU’s GDPR, and any federal privacy legislation that may be forthcoming.
BackgroundThe Massachusetts Department of Public Health (DPH) maintains a database of vital records for the Commonwealth—births, marriages, divorces and deaths—and makes information from that database available to the public. Anyone can walk into the DPH research room in Boston, enter a name on a computer terminal, and obtain that person’s vital records. The researcher can write the information down, but can’t print it out or get it in electronic form.
The information in these public records is, of course, public. And there’s no question that it’s inefficient to make someone interested in several records—or all the records—go through the laborious process of entering name after name and copying down the results.
So, a press organization put in a public records request for an electronic copy of the entire database.
Even though the information at issue was almost by definition “public,” the case was not a slam dunk. Instead, the SJC—relying on concerns about the ability to analyze changes in data over time, as well as concerns about the potential privacy afforded by “security through obscurity”—sent the case back to the trial court, rather than granting the organization’s request.2
The Perils of Data Analysis
While recognizing that the contents of any one version of the public vital records database will be public, the SJC identified several situations where comparison of different versions of the database could reveal sensitive information. “A side-by-side comparison of the same person’s data at different points in time might reveal, for example, the [names of the] biological parents … of an individual who has since been adopted, the name of a putative father whose nonpaternity has since been established, and the previous name and sex of an individual who has since completed sex reassignment surgery.” 482 Mass. at 437.
The court found that the trial court had erred in refusing to consider the potential revelations of sensitive information arising from later analysis of multiple iterations of the database: “Today’s current information may be tomorrow’s record protected [by statute] from public view … or tomorrow’s indication that an entry has been removed [from public view pursuant to statute. The courts must not] ignore that an index from the present is entwined with indices from the future.” Id. The SJC directed the trial court to make factual findings about how effectively such potential comparisons of multiple iterations of the vital statistics database could reveal sensitive information and whether that risk warrants withholding the electronic database from public disclosure. Id. at 437-38.
Concerns about the potential discovery of sensitive information from non-sensitive datasets are not new.3 The SJC’s recent decision, however, appears to be one of a relatively small number of decisions requiring a trial court to undertake a detailed factual review, as a prerequisite for the release of otherwise public information, of the ways in which analysis of multiple datasets—even if each one is, separately, publicly available—might lead to the disclosure of information that is supposed to be kept confidential.4
Although a public records case, the decision remains noteworthy for private entities subject to data protection laws. Specifically, as entities consider potential protocols for deidentification to enhance privacy and business flexibility by rendering personal information not “personal,” the SJC’s decision highlights concerns that even seemingly benign datasets—datasets that themselves do not appear to contain any confidential information—might nonetheless permit the reconstruction or revelation of deeply personal information that could be traced back to specific individuals, based either on other information available in the dataset or from publicly available outside sources. The CCPA, for example, defines the personal information it covers as including information that “is capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household.” Cal. Civ. Code § 1798.140(o).
Similarly, the GDPR defines “personal data” to include information pertaining not just to an “identified” data subject, but also to an “identifiable” one. GDPR, Art. 4(1). And, the GDPR specifically refers to “pseudonymization” as a way to protect otherwise personal data so that it can no longer “be attributed to a specific data subject.” Lurking in both the CCPA and the GDPR, therefore, are potentially significant issues regarding what personal, private information can be derived from seemingly anonymous, deidentified, or already-public data.
As implementation of the CCPA and the GDPR continues, we can expect a growing focus on whether and how seemingly innocuous, anonymous, deidentified information can be analyzed—either in connection with other, public information, or in connection with other information that someone requesting the data might reasonably have access to, whether public or not. This same issue may well arise in any federal-level privacy legislation. The range of issues raised by anonymization, pseudonymization, and deidentification, therefore, warrant ongoing attention.
1 Boston Globe Media Partners, LLC v. Department of Public Health, 482 Mass. 427 (2019)
2 The case involved a number of other interesting issues relating to public records access and personal privacy. In this note, though, we focus only on the issues of anonymization and potential reidentification.
3 Nearly ten years ago, Georgetown Law professor Paul Ohm suggested that analytical techniques had already rendered—or would, in due course, render—most such efforts unsuccessful. See Paul Ohm, Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization, 57 U.C.L.A. L. REV. 1701, 1705 (2010) (noting highly publicized reidentifications of certain medical information, and Netflix viewing information).
4 The issue arose with respect to state bar records in Sander v. State Bar of California, 58 Cal. 4th 300, 326, 314 P.3d 488, 507 (2013). And, in the specific context of personal health information, HIPAA has long contemplated that anonymization techniques can provide adequate protection of privacy. See 45 C.F.R. §164.514.