03-Ethics (Privacy and Dual Use)

03-Ethics (Privacy and Dual Use)#

title

Goals of this lecture#

Case studies 2-3: data collection and privacy violations.
Refresher: Why ethics?
Ethical issues: the fundamentals.
What legal regulations exist?
What should individual researchers do?
Trade-offs between different ethical considerations.

Case Studies#

Refresher: why ethics for CSS?#

Reason 1: CSS has impact#

In CSS, our work can have real-world impact.

Is this impact positive?
Does this work have the potential to cause harm?
How does the potential harm compare to current harms?

Reason 2: CSS involves hard decisions#

We also face many difficult questions.

Hard question(s) of the day:

Does this work violate privacy or have harmful potential for dual use?
What kind of regulations and enforcement mechanisms are (or should be) in place to protect against this?
What can/should I do as an individual researcher?

The data-driven world#

Increasingly, our activities leave behind a digital trace:

Digital trace data are “records of activity (trace data) undertaken through an online information system (thus, digital).

The availability of data is part of the promise of “Big Data”.
But it also raises a number of conceptual and ethical issues.

Source: Hollingshead et al. (2021)

Data linkage#

Data linkage refers to the unification of corresponding datasets to enable a richer informational account of the unit of analysis.

For example, the CDC provides links between different data sources:

title

Source: Hollingshead et al. (2021)

De-anonymization#

Data linkage is a really powerful tool in our toolkit, but it also raises issues, such as de-anonymization.

De-anonymization is a data mining strategy in which anonymous data is cross-referenced with other data sources to re-identify the anonymous data source.

Sometimes, purpotedly anonymized datasets can be de-anonymized using data linkage. This happened famously in 2008 with the Netflix Prize Dataset.

Differential privacy as potential solution?#

Differential privacy is an approach to ensuring data security and de-identification.

A dataset/algorithm is “differentially private” if by looking at the output, one cannot tell whether any given individual’s data was included or not.

Differential privacy tries to mathematically formalize the notion of privacy to overcome the data linkage problem.

These “linkage attacks” motivate the need for a robust definition of privacy – one that is immune to attacks using auxiliary knowledge, including knowledge that a data curator cannot predict the availability of.

Counterpoint: differential privacy tends to decrease accuracy, as it involves injecting noise into a dataset.

We will revisit this later in the lecture.

Relevant research: Latanya Sweeney’s work on de-anonymization and data linkage.

Check-in#

Reflection: how many digital accounts do you have (social media, Amazon, Netflix, etc.)? Do you think someone could deduce your identity by linking these accounts together?

Data archiving#

The recent replicability crisis in science has prompted a movement towards “open science”.

This includes making data publicly available.
In fact, many scientific journals now require that an anonymized dataset be made available to other researchers.

While good for transparency, this also raises ethical concerns around consent.

More generally, the widespread practice of data sharing by companies means that our data may be made accessible to third parties.

Check-in#

Can you think of examples of how your digital data might be used for multiple purposes you didn’t consent to?

Ownership#

Who owns the data in your “digital trace”?

Consider the following kinds of data:

Location data.
Purchasing history.
Social media contacts and activity.
Chat messages.
Netflix activity.

Reflection: Who “owns” these data? Who should own it?

Not a new issue!#

Henrietta Lacks was an African-American woman whose cancer cells were used for the HeLa line, an “immortal cell line” used for scientific research.

The “HeLa line” has facilitated many discoveries, including the polio vaccine.
But Henrietta Lacks was not told about this use, nor did she consent.

This controversy inspired the Common Rule in medicine, which requires that doctors inform patients whether their medical data (tissues, etc.) will be used for research.

Data Privacy: Legal considerations#

Before we consider what we can/should do as individuals, it’s important to understand the landscape of legal regulations surrounding data ethics and privacy.

What regulations exist in the USA?#

Within the USA, no single law to regulate all kinds of data.

HIPAA: laws about how health information can be shared.
FERPA: laws about how educational information (e.g., student records) can be shared.
ECRA: laws about how electronic communications information can be accessed and shared.

Though specific states have more comprehensive laws.

Source: New York Times, 2021

Privacy regulations in the state of California#

The California Privacy Rights Act (CPRA) includes stipulations about data privacy.

Private right of action: individuals can sue companies for privacy violations.
Global opt-out.

remove one’s self from data sharing by device or browser, instead of being forced to opt out on each site individually

Right to correct inaccurate personal information.
Right to enhanced transparency.

Regulations in the EU#

General Data Protection Regulation (GDPR): very extensive regulations about data privacy in the European Union.

Regulations include (but not limited to):

Data protection principles: limit the purpose and extent of data (i.e., data minimization).
Accountability: teams must have dedicated technical staff for data protection.
- Many teams must have a Data Protection Officer (DPO).
Consent: must be freely given, can be withdrawn by data subjects at anytime.
Privacy rights: data subjects given a set of rights.
- Includes right to be informed, right to erasure, right to data portability, and more.

title

What do privacy advocates want?#

Advocate for various rights:

Rights over data collection: access to data that companies have collected about you (and request deletion, etc.).
Opt-in consent: opt in, not out, of data collection.
Data minimization: only collect data absolutely necessary for product.
Non-discrimination: service should not be altered for people who opt-out.

This also requires some enforcement mechanism:

Source: New York Times, 2021

Data Privacy: Researcher considerations#

What can (and should) we do as individual researchers?

title

“Ten simple rules”#

Zook et al. (2017) describe “ten simple rules” to follow for responsible Big Data research.

Here, we will cover the rules most relevant to bias and privacy.

As we read each rule, reflect:

How can I use this rule in my own research?
Is this rule precise enough to be useful?

Rule 1: Data are people and can do harm.#

One of the most fundamental rules of responsible big data research is the steadfast recognition that most data represent or impact people.

This also connects to our lecture on bias:

Harm can also result when seemingly innocuous datasets about population-wide effects are used to shape the lives of individuals or stigmatize groups, often without procedural recourse.

Recommendation:

Start with the assumption that data are people (until proven otherwise), and use it to guide your analysis.

Rule 2: Privacy ≠ binary value#

Privacy is not just a binary value:

Looking at a single Instagram photo by an individual has different ethical implications than looking at someone’s full history of all social media posts.

Sharing is okay in some contexts and not in others:

Likewise, distributing health records is a necessary part of receiving health care, but this same sharing brings new ethical concerns when it goes beyond providers to marketers.

Recommendation:

Situate and contextualize your data to anticipate privacy breaches and minimize harm.

Rule 3: Guard against reidentification#

Many examples of “anonymized” data being de-anonymized:

When datasets thought to be anonymized are combined with other variables, it may result in unexpected reidentification, much like a chemical reaction resulting from the addition of a final ingredient.

“Big data” allows for combining many different features to identify a person:

Surprising to many, unlabeled network graphs—such as location and movement, DNA profiles, call records from mobile phone data, and even high-resolution satellite images of the earth—can be used to reidentify people.

Recommendation:

Identify possible vectors of reidentification in your data. Work to minimize them in your published results to the greatest extent possible.

Rule 8: Design for auditability#

An important part of responsible data practices is building systems that can be easily audited:

Responsible internal auditing processes flow easily into audit systems and also keep track of factors that might contribute to problematic outcomes.

One benefit of this is forcing researchers to debate the tough questions:

Designing for auditability also brings direct benefits to researchers by providing a mechanism for double-checking work and forcing oneself to be explicit about decisions, increasing understandability and replicability.

Recommendation:

Plan for and welcome audits of your big data practices.

Data statements#

If you compile and release a dataset, you can attach a data statement.

A data statement is a characterization of a dataset that provides context to allow developers and users to better understand how experimental results might generalize, how software might be appropriately deployed, and what biases might be reflected in systems built on the software.

This is one way to make your data more transparent.

Bender & Friedman, 2018

What does this involve?#

Motivation for creation of dataset:

Why was it created?
What other tasks could the dataaset be used for?

Dataset composition:

What is each “instance”––i.e., “row”––in the dataset?
How many instances are there?

Dataset maintenance:

Who is responsible for maintaining the dataset?
Who should be contacted with questions?

Legal considerations:

Were individuals informed about data collection? Did they consent?
Could this dataset expose people to harm or legal action?
Does this dataset comply with the GDPR legal standards?

Trade-offs: privacy and fairness in conflict#

In these lectures, we’ve discussed two pillars of ethical CSS:

Fairness.
Privacy.

Both of these are challenging on their own. Yet the problem is made more complex when we consider both criteria at once.

Differential privacy can have disparate impact#

As discussed before, differential privacy (DP) is an approach to minimizing the risk of de-anonymization.
Yet recent research (Bagdasaryan et al. (2019) has found that in some many, implementing DP leads to disparate impact.
- Models trained with DP have systematically lower accuracy for under-represented groups.

Accuracy of DP models drops much more for the underrepresented classes and subgroups

Why disparate impact?#

Differential privacy works by effectively adding noise to observations in a dataset.

This typically decreases accuracy across the board.
However, because some groups are systematically under-represented, accuracy is most impacted for these groups.

Fundamentally, this connects back to issues of biased data.

Datasets often under-represent certain groups, which can lead to harmful impacts down the road.
One solution is creating balanced data, e.g., gender shades (Buolamwini & Gebru, 2018).

Algorithmic fairness can create privacy risks#

One approach to algorithmic fairness is forcing a model to produce identical behavior across groups.

E.g., demographic parity.

But to accomplish this, the model must weight some data points more than others. This can lead to data leakage:

We show that fairness comes at the cost of privacy, and this cost is not distributed equally: the information leakage of fair models increases signifcantly on the unprivileged subgroups, which are the ones for whom we need fair learning.

Source: Chang & Shokri, 2021

Navigating trade-offs#

Ethical research in CSS must keep both criteria in mind:

Fairness.
Privacy.

But when these criteria trade-off, researchers must set their priorities.

Reflection: Which criterion seems more important to you, and why?

Conclusion#

Work in Computational Social Science has the potential for real-world impact.
As researchers, we must be conscious of the potential for harmful impact––and try to mitigate or avoid this harm.
Two pillars of ethical research are fairness and privacy.

03-Ethics (Privacy and Dual Use)

Contents

03-Ethics (Privacy and Dual Use)#

Goals of this lecture#

Case Studies#

Case Study 2: An addition app sharing sensitive data#

Opioid addiction apps sharing data#

What information is accessed, and how?#

Is this legal?#

Case Study 3: A not-for-profit sharing mental health information#

The need for mental health support#

Sharing data with third-party companies#

Data sharing has serious consequences#

Privacy and mental health more generally#

Refresher: why ethics for CSS?#

Reason 1: CSS has impact#

Reason 2: CSS involves hard decisions#

The data-driven world#

Data linkage#

De-anonymization#

Differential privacy as potential solution?#

Check-in#

Data archiving#

Data sharing and consent#

Check-in#

Ownership#

Not a new issue!#

Data Privacy: Legal considerations#

What regulations exist in the USA?#

Privacy regulations in the state of California#

Regulations in the EU#

What do privacy advocates want?#

Data Privacy: Researcher considerations#

“Ten simple rules”#

Rule 1: Data are people and can do harm.#

Rule 2: Privacy ≠ binary value#

Rule 3: Guard against reidentification#

Rule 4: Ethical data sharing#

Rule 8: Design for auditability#

Data statements#

What does this involve?#

Trade-offs: privacy and fairness in conflict#

Differential privacy can have disparate impact#

Why disparate impact?#

Algorithmic fairness can create privacy risks#

Navigating trade-offs#

Conclusion#