Samenvatting

Samenvatting voor Introduction to Data Science Exam (XB_0018)

0 keer verkocht

Vak
Data Science (XB_0018)

Instelling
Vrije Universiteit Amsterdam (VU)

Boek
Python Data Science Handbook

Introduction to Data Science samenvatting voor de Data Science Minor aan de VU. Informatie over Data Science, Introduction to Data Science, Artificial Intelligence, Computer Science, Machine learning basics, Linear Regression, Feature Engineering, Tree Based Methods, Model Validation, Neural networ...

[Meer zien]

Voorbeeld 4 van de 33 pagina's

Bekijk voorbeeld

Heel boek samengevat? Ja
Geupload op 30 mei 2023
Aantal pagina's 33
Geschreven in 2022/2023
Type Samenvatting

data science
artificial intelligence
computer science
machine learning basics
linear regression
feature engineering
model validation
neural networks
fairness and bias
clas
introduction to data science

Volgen

simonvanrens Lid sinds 4 jaar 11 documenten verkocht

€11,48

In winkelwagen

Opslaan

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten

Intro to data science 1

Data science vs statistics
Stats:
- includes operations research / economic theory / standardization / quality control
- focuses on detecting and preventing anomalies & optimal price setting
- summarizes data into few key metrics to enable manual processing and analysis

Data science:
- handles large amounts of data describing individuals
- focuses on visualizing trends and variance

Data science augments subject matter expertise
 DS doesn’t replace experts  augments expertise with knowledge derived from data
 Data scientists help another with data hacking skills

AI requires large datasets and statistics and linear regression (within data science) work on
small and large datasets

DS in history
 Dr Snow in 1854 studied cholera outbreak
 linked cholera and drinking water by having all patients note their address

Average = all / amount
Median = middle value if you sort by value
Modus = most common value (often with non-comparable entities common password)

Correlation expresses whether the values of 2 variables are related
If variables related, one of values can be used to predict the other

Correlation can be:
 negative: means they move in opposite directions (house price and distance to city
center)
 positive: means they move in same direction (house-size and tax value)
 near-zero: there is no relation

Intro to Data Science 2

Data science phases and problems
 Low end / starting point: data quality issues & data handling issues
 Data ready for analysis: human analysis & data science algorithms
 Deeper insights: patterns, models & heuristics
 Informed decisions: Ai driven & human driven

,Different data science problems:
- Classification
- Prediction
- Clustering
- Decision
- Recommendation
From technical perspective these problems are very similar, as one type of problem can be
converted into another
From user perspective problems are different with different risks and fairness requirements

Clustering:
Divide objects of dataset into different groups that are similar
Groups are not predefined, the algo discovers the groups
Possible applications: dating, recommendation and preprocessing

Decision and Classification
Classification problem: put the correct label from a finite set of labels on a datapoint
Potential labels for houses: monument, energyefficient and value-brackets
Decision problems are often binary classification problems: in/out, brake/no-brake, hire/fire

False neg: algo says no, is yes
False pos: algo says yes, is no
True neg: algo says no, is no
True pos: algo says yes, is yes

Precision and Recall
Precision is important if the cost of a false pos is high
For instance: the case in hiring for popular positions, or in case of high-risk treatment in non-
urgent cases
Recall is important if cost of false neg is high
Like medical screening

Prediction

,In prediction problem you must find a missing value based on available data
To make prediction algo, you need example data with correct answer  supervised learning
In prediction, you evaluate models based on some total / average error

Recommendation
In recommendation you have a large number of items and must select a few top results
Could predict relevance for all items, sort and select top
Results should be diverse dynamic and perhaps surprising / inspiring, not just accurate 
customers care about conversion

Types of value
Categorical  Ordinal  Numerical
Categorical (least structured):
- Column can take multiple values with no further structure:
- Yes/no
- Red/blue/green
- Supported aggregates
- Mode

Ordinal:
- Columns can take multiple values which are comparable and thus sortable
- Very low / average / high
- B-, B, A, AA, AAA
- Supported aggregates
- Mode
- Median, percentile, min/max

Numerical (most structured):
- Column can take multiple values that can be compared, averaged and subtracted
(aka numbers)
- 0%-100%
- 0.0 – 10.0
- Supported aggregates
- Mode (if few values)
- Median, percentile, min/max
- Normal average, geometric average, average without outliers

Letter grades ( A – F): ordinal, convertible to numerical
Three valued logic (yes, no, maybe): Ordinal
Colors of the rainbow: Ordinal but often treated as categorical
MBTI personality profiles: Categorical
1-10 + unknown: Numerical, ignoring unknown
Project status (successful, challenged, failed): ordinal

What is a normal distribution?

, The normal distr is a very special probability distr with a clear center and a wide base
Like a perfect circle, it does not often occur in nature

Many theorems / algos assume that data is distributed normally.
- Following theorem often assumed:
- Outcomes c=more than 2 std away from the mean occur less than 5% of the cases

Why are many distributions similar to normal?
The central limit theorem
The central limit theorem (CLT) establishes that, in many situations, when independent
random variables are summed up, their properly normalized sum tends toward a normal
distribution even if the original variables themselves are not established.
In many board games it is important to understand that the sum of 2 dice is more likely to be
6 or 7 than 2 or 12. This happens since you are adding 2 independent random variables

Other distributions:
Uniform distribution: all

equally likely (when data is generated, or you have not discovered the structure is when it
occurs)

Poisson distribution: waiting for a random event to occur. Waiting times cannot be below 0
(occurs when: predicting next event)

Intro to Data Science 3
Why data visualization?
To help data scientists with the data science:
- Understand what data is in the set
- Show data quality issues
- Support the search for patterns / features

To help communicating the data science outcomes to others:
- Good data visualizations do not just show the right answer: they convince the
audience of an important message
- Data visualizations are therefore important to get your advice implemented and your
work values

Not the same goals, but overlap: being transparent in what you did and how you found
patterns will make you more convincing

Data visualization needed to create convincing reports
- Client organizations have multiple people that all need to be convinced. This includes
people not present at the meeting
- Business decisions are often so drastic that they want to be sure. They need to be
sure. They need multiple arguments before people are convinced.

Dit zijn jouw voordelen als je samenvattingen koopt bij Stuvia:

Bewezen kwaliteit door reviews

Studenten hebben al meer dan 850.000 samenvattingen beoordeeld. Zo weet jij zeker dat je de beste keuze maakt!

In een paar klikken geregeld

Geen gedoe — betaal gewoon eenmalig met iDeal, creditcard of je Stuvia-tegoed en je bent klaar. Geen abonnement nodig.

Direct to-the-point

Studenten maken samenvattingen voor studenten. Dat betekent: actuele inhoud waar jij écht wat aan hebt. Geen overbodige details!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper simonvanrens. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €11,48. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 66184 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Begin nu gratis

Samenvatting

Samenvatting voor Introduction to Data Science Exam (XB_0018)

Document informatie

Onderwerpen

Gekoppeld boek

Geschreven voor

Verkoper

Ontvangen beoordelingen

Voorbeeld van de inhoud