A group of Danish researchers, led by Aarhus University graduate student Emil O. W. Kirkegaard, recently publicly released a dataset of nearly 70,000 users of the online dating site OkCupid, including usernames, age, gender, location, what kind of relationship (or sex) they’re interested in, personality traits, and answers to thousands of profiling questions used by the site.
When asked whether the researchers attempted to anonymize the dataset, Kirkegaard replied bluntly: “No. Data is already public.” This sentiment is repeated in the accompanying draft paper, “The OKCupid dataset: A very large public dataset of dating site users,” posted to the online peer-review forums of Open Differential Psychology, an open-access online journal also run by Kirkegaard:
Some may object to the ethics of gathering and releasing this data. However, all the data found in the dataset are or were already publicly available, so releasing this dataset merely presents it in a more useful form.
To those concerned about privacy, research ethics, and the growing practice of publicly releasing large data sets, this logic of “but the data is already public” is an all-too-familiar refrain used to gloss over thorny ethical concerns,.
In response to this problematic data release, CIPR director Michael Zimmer published an editorial in Wired: “OkCupid Study Reveals the Perils Of Big-Data Science” (Wired, May 14, 2016). He states, in part:
The OkCupid data release reminds us that the ethical, research, and regulatory communities must work together to find consensus and minimize harm. We must address the conceptual muddles present in big data research. We must reframe the inherent ethical dilemmas in these projects. We must expand educational and outreach efforts. And we must continue to develop policy guidance focused on the unique challenges of big data studies. That is the only way can ensure innovative research—like the kind Kirkegaard hopes to pursue—can take place while protecting the rights of people an the ethical integrity of research broadly.
Zimmer also appeared on the WUWM Milwaukee Public Radio show Lake Effect to discuss “Big Data Research Creates Ethical Concerns”, noting that:
So when a researcher like this says, ‘Well this stuff was already public,’ what he kind of really means is like, ‘This stuff was visible to other users who happen to also create a profile,’ and those aren’t the same thing,” says Zimmer. “Psychologically I think it’s important for users when they sign up for this thing to have this assumption, or these set of expectations, that I know this data is kind of public but it’s meant for this community… Doing this kind of research sometimes violates that assumption.