SUMMARY – DATA SCIENCE AND ETHICS
Introduction
Why should we care about ethics when it comes to data science?
- It is expected from society (especially gen Z cares about social justice and ethics)
- There are huge potential risks
o For humans: physical and mental well-being, privacy and discrimination
o For businesses: reputational and financial risks
- But, there are also many potential benefits to caring about ethics
o It can improve the accuracy and fairness of the data and the model, but can also be a
marketing instrument
- Digitalization in general and the use of AI will become part of the future
Data scientists and business students are not inherently unethical, but they are not trained to think about it
Data science ethics
= about what is right and wrong when doing data science
Data science can be used for good intentions (reduce crime, improve medical diagnoses, increase
profitability), but also for bad ones (data leaks, discrimination)
Responsible AI = the development and application of AI that is aligned with moral values in society - about
what is right and wrong when you’re developing or using AI
Utilitarianism vs. deontology
Utilitarianism: consequentialism, focuses on the result of the act – you chose the action that results in the
highest net benefit (to you or to a group of people) because the result justifies the act
The action is moral if the consequences are – therefore, the theory justifies immoral things
Deontology: there are some things that you cannot do, the action should be moral as well, the ends do not
justify the means
Aristotle: “Moral behavior can be found at the mean between two extremes – excess (using it all without
any concerns for the ethical consequences) and deficiency (not using it at all)”
Can be applied to data science as well : use all available data for any possible application (without
concern for privacy, discrimination or transparency) vs. using no data at all
Data science equilibrium
We have to find a balance between the ethical concerns and the utility of
data science
If the ethical concerns are bigger, we’ll need bigger ethical practices
to keep the balance
The more to the left, the stronger the need for data science ethics practices
Link of the trolley problem to data science: AI is used in many self-driving cars
Data, algorithms and models
Data = facts or information, especially when examined and used to find out things or to make decisions
Algorithm = a set of rules that must be followed when solving a particular problem
Prediction or AI model = the decision-making formula, which has been learnt from data by a prediction/AI
algorithm
Personal data = data that is relating to an identifiable person
Behavioral data = data that provides evidence of actions that you took (ex. Facebook likes, location data)
Sensitive data = data related to race, ethnicity, political opinion, religion, sexual orientation
1
, SUMMARY – DATA SCIENCE AND ETHICS
FAT Flow
= a framework for data science ethics, consists of three dimensions
1) The stages in the data science process
2) Evaluation criterion
Fair, accountable, and transparent
3) Role of the human
Data subject, data scientist, manager, and model subject
FAT
Fair = treating people equally without favoritism or discrimination
1) Privacy – fair to the data subject’s privacy rights
Privacy = a state in which one is not observed or disturbed by others (= a human right)
2) Discrimination – not discriminating against sensitive groups
Accountable = required or expected to justify actions or decisions – responsible
Responsible: having an obligation to do something or having control over or care for someone as part of
one’s job or role. This obligation has three components:
1) Implement appropriate and effective measures to ensure that principles are complied with
2) Demonstrate the compliance of the measures upon request (to regulators for example)
3) Recognize potential negative consequences
Transparent = easy to perceive or detect
1) Transparency of the process
o Is crucial for fairness and accountability
2) Explainable AI – explain the thought process of the AI model
Different roles of the data science process
Data subject: the person whose data is being used
Data scientist: the person who is performing the data science
Manager: the person who manages and signs off on a data science project
Model subject: the person on who the model is being applied
Overview of fairness and transparency concepts covered in the course
If the sample is biased, then the AI model
will be biased as well
Ethics guidelines for trustworthy AI
Trustworthy AI has three components: it should be lawful, ethical, and robust
AI act
Takes a risk based approach: depending on how big the ethical
concerns are, the risk is higher or less
2
, SUMMARY – DATA SCIENCE AND ETHICS
Ethical Data Gathering
Questions that need to be considered:
- For fairness
o Fair to the data subject and model subject: is the privacy of the data subject and model
subject respected, when gathering their data?
o Fair to the model subject: is a sufficient sample included for all sensitive groups?
- For transparency
o Transparent to the data subject and model subject: what data is used, for what purposes,
and for how long?
o Transparent to the model subject: if A/B testing is performed, is the user aware of this and
did they give informed consent?
o Transparent to the data scientist: how is the data gathered? Was specific over- or under
sampling of certain groups considered?
o Transparent to the manager: how is the data gathered?
Privacy and GDPR
Privacy
- A lot of personal data is out there (locations how much time we spend at … the doctors,…)
o Regulated: they cannot use it unless it's improving our services
- A lot of personal data can be predicted (based on social media behavior, they can predict your
sexuality for example)
o Dangerous thing: you don’t have to know any personal data to predict pregnancies,
sexualities,… very sensitive information that can be predicted
- Once personal data is shared online, it’s hard to make it private again
Solutions: awareness, regulations, or technology (eg. facial recognition)
Privacy is a human right – it is about the protection of personal data
Cambridge Analytica
Was a political consulting firm that used Facebook information of over 80 million users without their
permission, which was then used for targeted political advertising
Information used; page likes, birthday, city,…
The data was obtained from an app – people were paid to take a test and provide their Facebook
information, but unwillingly, the data on the user’s Facebook friends were sent when uploading
This was against Facebook’s policy Facebook removed the app, suspended Cambridge Analytica,
but the damage has already been done
GDPR
= General Data Protection Regulation
- Privacy and data protection of European citizens, also applicable to non-European companies if they
collect data on EU citizens
- Applicable since 2018
- Fines up to 20 million or 4% of the turnover of the company
Key concepts
1) Personal data – any information relating to an individual (private, personal, or public life)
2) Anonymisation – data cannot be brought back to an individual (the owner is not re-identifiable)
Not mentioned in GDPR
3) Psydonimisation – the processing of personal data in a way that it can no longer be attributed to a
data subject without the use of additional information ex. through encryption
3
, SUMMARY – DATA SCIENCE AND ETHICS
When does GDPR allow processing of personal data?
1) Unambiguous consent of the data subject
2) To fulfill a contract to which the data subject is party
3) Compliance with a legal obligation
4) Protection of vital interests of the data subjects
5) Performance of a task carried out in the public interest
6) Legitimate interest (subject to a balancing act between the data subject's rights and the interests of
the controller)
a. The company can process personal data in order to carry out tasks related to your business
activities
Unambiguous consent is a very complex term – maybe it’s better to have a short terms&conditions that
people actually do read and therefore accept intentionally instead of a very long one that no one reads and
simply accepts
Processing of personal data
The controller of the data shall be responsible for and be able to demonstrate the compliance with the
principles relating to processing of personal data
Encryption and hashing
Encryption
= to encode a message or information in such a way that only authorized persons can access it
Historically
Shift cipher (move a few letters down the alphabet)
Shave the head of the messenger, write a message, let the hair grown and send the messenger
Enigma
o Electro-mechanical machine used by Germans in WW II
o The state of the machine is defined by settings of rotors and plugs
o Each typed letter changes the state of the machine and outputs some other letter
o Only if two machines (the one of the sender and receiver) start in the same state, the same
letter will be outputted but, there are 10^6 possible states
The receiver would have to type the encoded letters on his machine and the
original message would come out
Symmetric encryption
The same key is used for encrypting and decrypting the message
Weakness: knowing that most messages start or end with “Hello”, “Greetings”,… simplifies
decrypting the message because when you find the key when decrypting these words you can
decrypt the whole message
DES - Data Encryption Standard
o One of the first major standards in symmetric key encryption – 56 bit key
o Flaw: too small – brute force attack will find the key
AES - Advanced Encryption Standard
o 128, 192, or 256 bit keys more secure
Challenges
o How to share keys: unsecure or overheard
o How to manage keys: if multiple users need to communicate with one another, a lot of keys
have to be shared before communicating
4