03-Ethics (Privacy and Dual Use)#

title

Goals of this lecture#

  • Case studies 2-3: data collection and privacy violations.

  • Refresher: Why ethics?

  • Ethical issues: the fundamentals.

  • What legal regulations exist?

  • What should individual researchers do?

  • Trade-offs between different ethical considerations.

Case Studies#

Case Study 2: An addition app sharing sensitive data#

Opioid addiction apps sharing data#

Telehealth apps growing in popularity:

As a result of the COVID-19 pandemic and efforts to reduce transmission in the U.S, telehealth services and apps offering opioid addiction treatment have surged in popularity.

But these apps may be accessing and sharing sensitive data:

Despite the vast reach and sensitive nature of these services, the research found that the majority of the apps accessed unique identifiers about the user’s device and, in some cases, shared that data with third parties.

Source: TechCrunch

What information is accessed, and how?#

Information about the device:

Five of the apps also access the devices’ phone number…and two access a users’ list of installed apps.

Information about a user’s location:

Seven of the apps request permission to make Bluetooth connections…can be used to track users in real-world locations.

Source: TechCrunch

Case Study 3: A not-for-profit sharing mental health information#

The need for mental health support#

Crisis Text Line (CTL) is non-profit suicide prevention hotline.

  • Connects individuals (e.g., those suffering from depression) with volunteers.

  • In the process, records and collects data about these conversations.

    • Has called this “the largest mental health data set in the world”.

  • This data can be used to help identify especially urgent cases.

Source: Scientific American

Sharing data with third-party companies#

Recent reporting found that Crisis Text Line was sharing anonymized data with a third-party company, Loris.ai.

  • Loris uses data to improve customer management software.

  • In return, Loris shares revenue with CTL.

Technically, CTL users “consent” to this disclosure.

  • But unclear whether their consent is meaningful: disclosure form is 50+ pages, and users are deeply suffering when they read it.

Source: Scientific American

Data sharing has serious consequences#

In many cases, sharing this data can cause harm.

In 2018, U.S. border authorities refused entry to several Canadians who had survived suicide attempts, based on information in a police database.

Source: Scientific American

Privacy and mental health more generally#

Of 132 studies surveyed:

The researchers in 85 percent of the studies didn’t address, either in study design, or in reporting results, how the technologies could be used in negative ways.

Source: Scientific American

Refresher: why ethics for CSS?#

Reason 1: CSS has impact#

In CSS, our work can have real-world impact.

  • Is this impact positive?

  • Does this work have the potential to cause harm?

  • How does the potential harm compare to current harms?

Reason 2: CSS involves hard decisions#

We also face many difficult questions.

Hard question(s) of the day:

  1. Does this work violate privacy or have harmful potential for dual use?

  2. What kind of regulations and enforcement mechanisms are (or should be) in place to protect against this?

  3. What can/should I do as an individual researcher?

The data-driven world#

  • Increasingly, our activities leave behind a digital trace:

Digital trace data are “records of activity (trace data) undertaken through an online information system (thus, digital).

  • The availability of data is part of the promise of “Big Data”.

  • But it also raises a number of conceptual and ethical issues.

Source: Hollingshead et al. (2021)

Data linkage#

Data linkage refers to the unification of corresponding datasets to enable a richer informational account of the unit of analysis.

For example, the CDC provides links between different data sources:

title

Source: Hollingshead et al. (2021)

De-anonymization#

Data linkage is a really powerful tool in our toolkit, but it also raises issues, such as de-anonymization.

De-anonymization is a data mining strategy in which anonymous data is cross-referenced with other data sources to re-identify the anonymous data source.

Sometimes, purpotedly anonymized datasets can be de-anonymized using data linkage. This happened famously in 2008 with the Netflix Prize Dataset.

deanonymization.png

Differential privacy as potential solution?#

Differential privacy is an approach to ensuring data security and de-identification.

A dataset/algorithm is “differentially private” if by looking at the output, one cannot tell whether any given individual’s data was included or not.

Differential privacy tries to mathematically formalize the notion of privacy to overcome the data linkage problem.

These “linkage attacks” motivate the need for a robust definition of privacy – one that is immune to attacks using auxiliary knowledge, including knowledge that a data curator cannot predict the availability of.

Counterpoint: differential privacy tends to decrease accuracy, as it involves injecting noise into a dataset.

  • We will revisit this later in the lecture.

Relevant research: Latanya Sweeney’s work on de-anonymization and data linkage.

Check-in#

Reflection: how many digital accounts do you have (social media, Amazon, Netflix, etc.)? Do you think someone could deduce your identity by linking these accounts together?

Data archiving#

The recent replicability crisis in science has prompted a movement towards “open science”.

  • This includes making data publicly available.

  • In fact, many scientific journals now require that an anonymized dataset be made available to other researchers.

While good for transparency, this also raises ethical concerns around consent.

More generally, the widespread practice of data sharing by companies means that our data may be made accessible to third parties.

Check-in#

Can you think of examples of how your digital data might be used for multiple purposes you didn’t consent to?

Ownership#

Who owns the data in your “digital trace”?

Consider the following kinds of data:

  • Location data.

  • Purchasing history.

  • Social media contacts and activity.

  • Chat messages.

  • Netflix activity.

Reflection: Who “owns” these data? Who should own it?

Not a new issue!#

Henrietta Lacks was an African-American woman whose cancer cells were used for the HeLa line, an “immortal cell line” used for scientific research.

  • The “HeLa line” has facilitated many discoveries, including the polio vaccine.

  • But Henrietta Lacks was not told about this use, nor did she consent.

This controversy inspired the Common Rule in medicine, which requires that doctors inform patients whether their medical data (tissues, etc.) will be used for research.

lacks.jpeg

Data Privacy: Researcher considerations#

What can (and should) we do as individual researchers?

title

“Ten simple rules”#

Zook et al. (2017) describe “ten simple rules” to follow for responsible Big Data research.

  • Here, we will cover the rules most relevant to bias and privacy.

As we read each rule, reflect:

  • How can I use this rule in my own research?

  • Is this rule precise enough to be useful?

Rule 1: Data are people and can do harm.#

One of the most fundamental rules of responsible big data research is the steadfast recognition that most data represent or impact people.

This also connects to our lecture on bias:

Harm can also result when seemingly innocuous datasets about population-wide effects are used to shape the lives of individuals or stigmatize groups, often without procedural recourse.

Recommendation:

Start with the assumption that data are people (until proven otherwise), and use it to guide your analysis.

Rule 2: Privacy ≠ binary value#

Privacy is not just a binary value:

Looking at a single Instagram photo by an individual has different ethical implications than looking at someone’s full history of all social media posts.

Sharing is okay in some contexts and not in others:

Likewise, distributing health records is a necessary part of receiving health care, but this same sharing brings new ethical concerns when it goes beyond providers to marketers.

Recommendation:

Situate and contextualize your data to anticipate privacy breaches and minimize harm.

Rule 3: Guard against reidentification#

Many examples of “anonymized” data being de-anonymized:

When datasets thought to be anonymized are combined with other variables, it may result in unexpected reidentification, much like a chemical reaction resulting from the addition of a final ingredient.

“Big data” allows for combining many different features to identify a person:

Surprising to many, unlabeled network graphs—such as location and movement, DNA profiles, call records from mobile phone data, and even high-resolution satellite images of the earth—can be used to reidentify people.

Recommendation:

Identify possible vectors of reidentification in your data. Work to minimize them in your published results to the greatest extent possible.

Rule 4: Ethical data sharing#

Data sharing is an important part of research:

For example, in rare genetic disease research, biological samples are shared in the hope of finding cures, making dissemination a condition of participation.

But we must be careful about not violating the terms of our consent:

However, we caution that even when broad consent was obtained upfront, researchers should consider the best interests of the human participant, proactively considering the likelihood of privacy breaches and reidentification issues.

Recommendation:

Share data as specified in research protocols, but proactively address concerns of potential harm from informally collected big data.

Rule 8: Design for auditability#

An important part of responsible data practices is building systems that can be easily audited:

Responsible internal auditing processes flow easily into audit systems and also keep track of factors that might contribute to problematic outcomes.

One benefit of this is forcing researchers to debate the tough questions:

Designing for auditability also brings direct benefits to researchers by providing a mechanism for double-checking work and forcing oneself to be explicit about decisions, increasing understandability and replicability.

Recommendation:

Plan for and welcome audits of your big data practices.

Data statements#

If you compile and release a dataset, you can attach a data statement.

A data statement is a characterization of a dataset that provides context to allow developers and users to better understand how experimental results might generalize, how software might be appropriately deployed, and what biases might be reflected in systems built on the software.

This is one way to make your data more transparent.

Bender & Friedman, 2018

What does this involve?#

Motivation for creation of dataset:

  • Why was it created?

  • What other tasks could the dataaset be used for?

Dataset composition:

  • What is each “instance”––i.e., “row”––in the dataset?

  • How many instances are there?

Dataset maintenance:

  • Who is responsible for maintaining the dataset?

  • Who should be contacted with questions?

Legal considerations:

  • Were individuals informed about data collection? Did they consent?

  • Could this dataset expose people to harm or legal action?

  • Does this dataset comply with the GDPR legal standards?

Trade-offs: privacy and fairness in conflict#

In these lectures, we’ve discussed two pillars of ethical CSS:

  1. Fairness.

  2. Privacy.

Both of these are challenging on their own. Yet the problem is made more complex when we consider both criteria at once.

Differential privacy can have disparate impact#

Accuracy of DP models drops much more for the underrepresented classes and subgroups

Why disparate impact?#

Differential privacy works by effectively adding noise to observations in a dataset.

  • This typically decreases accuracy across the board.

  • However, because some groups are systematically under-represented, accuracy is most impacted for these groups.

Fundamentally, this connects back to issues of biased data.

Algorithmic fairness can create privacy risks#

One approach to algorithmic fairness is forcing a model to produce identical behavior across groups.

  • E.g., demographic parity.

But to accomplish this, the model must weight some data points more than others. This can lead to data leakage:

We show that fairness comes at the cost of privacy, and this cost is not distributed equally: the information leakage of fair models increases signifcantly on the unprivileged subgroups, which are the ones for whom we need fair learning.

Source: Chang & Shokri, 2021

Conclusion#

  • Work in Computational Social Science has the potential for real-world impact.

  • As researchers, we must be conscious of the potential for harmful impact––and try to mitigate or avoid this harm.

  • Two pillars of ethical research are fairness and privacy.