Samenvatting

Summary Data Science Research Methods (JBM025)

1 keer verkocht

Instelling
Technische Universiteit Eindhoven (TUE)

Summary on the course Data Science Research Methods (JBM025) from the major Data Science in Eindhoven and Tilburg. This course has two parts. The first part focusses on the scientific method and design of experiments (DOE). The second part focusses on econometrics and builds upon what is discussed ...

[Meer zien]

Voorbeeld 3 van de 26 pagina's

Bekijk voorbeeld

Geupload op 26 juni 2022
Aantal pagina's 26
Geschreven in 2021/2022
Type Samenvatting

tiu
dsrm
data science
research methods
matching
doe
design of experiments
econometrics
eindhoven
tilburg
tue
tue
data science research methods

Volgen

NienkeUr Lid sinds 2 jaar 37 documenten verkocht

€4,49

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten

DATA SCIENCE RESEARCH METHODS
CONTENTS

Deriving optimal settings 14
The scientific method 2 Optimums 14
Six Sigma 2 Optimisation scheme 14

sample size determination 3 Econometrics for data scientists 15
Minimal sample sizes 3 Random variables 15
Normal distribution 3 regressions 16
Binomial distribution 4 Bivariate and multivariate regressions 16
When 𝝈 or 𝒑 is not known 4 Ordinary least squares (OLS) 16
Power analysis 4 Instrumental variable estimation 16
Normal distribution 4
Binomial distribution 4 Causality and selection 17
Causality 17
Analysis of variance (ANOVA) 5 Selection and selection bias 17
ANOVA table 5 Regression and randomized experiments 18
Potential problems with experiments 18
ANOVA – power and multiple comparisons 6
ANOVA power 6 Selection on observables and matching 19
Multiple comparisons 6 Matching 19
Fisher Least Significance Difference (LSD) 6 3 methods of matching 20
Tukey’s Honest Significant Difference (HSD) 6 Exact matching 20
Matching based on closeness of
Two-factor designs and blocking 7 observables 20
Propensity score matching 21
Full factorial designs 8 OLS estimator as matching estimator 21
DOE: how to determine whether an individual Flexible OLS as matching estimator 21
factor is of importance 9
Blocking with 2 factors 9 Differences-in-differences estimation 22
Some important details 23
Fractional Factorial designs 10 Generalization: 23
Fractional experiments 10
fractional factorials 10 Regression Discontinuity design (RDD) 24
Sharp regression discontinuity design 24
Response Surface Optimisation 12 Main idea and interpretation 24
Improvement Efficiently: finding near-optimal Estimation of the treatment effect in Sharp
factor settings 12 RDD 25
box/Simplex method 12 Approach 2 25
Steepest ascent/descent method 12 Approach 1 25
Quadratic models 13 Fuzzy regression discontinuity design 25
Response surface designs 13 Estimation the fuzzy RD 26
Central Composite Design (CCD) 13 Alternative to this estimation 26
Box-Behnken Design 14 Specification testing 26

,THE SCIENTIFIC METHOD

Key concepts What should you be able to do?
 Scientific method  Link elements of Six Sigma to the scientific method
 Experiment  Translate a case study in terms of independent variables (factors) and
 Factor dependent variables
 Independent variable  Be able to distinguish in a specific data science context, which of the
 Six Sigma three basic goals is relevant

Key insights
 It is important to identify which of the three different data science goals are relevant given a certain context
 The scientific method is an iterative process
 If you do not plan an experiment well in advance, then no statistical analysis may yield the hoped for results
 Experiments may involve several factors, each or which may have more than 2 levels
 The scientific method is also very useful in industry
 The Six Sigma approach in industry has incorporated several aspects of the scientific method.

Data science has three goals: Business has similar distinctions regarding analytics:
1. Description 1. Descriptive analytics provide insight into the past
2. Prediction 2. Predictive analytics provide understanding of the future
3. Explanation 3. Prescriptive analytics advice on the possible outcomes

Basic elements of the (iterative) scientific method Steps in experimentation
1. Formulate a question 1. Plan the experiment
2. Perform background research 2. Design the experiment
3. Formulate the hypothesis (answer) 3. Perform the experiment
4. Determine the logical consequences of the hypothesis 4. Analyse the resulting data
5. Collect observations (experiment) 5. Confirm the results
6. Test the truth of the hypothesis by analysing observations (statistics) 6. Evaluate the conclusion
7. Report the results
8. If the hypothesis is not confirmed, go back to 2

There are a number of valid reasons for the iterative approach:
1. New insights were obtained after analysing the experiment
2. New questions arose from the experiment
3. If the hypotheses are built upon wrong assumptions.
The iterative nature means that, if a hypothesis is refuted by the experiment, you should start over again and form
a new hypothesis to verify the new hypothesis. This iteration should be repeated until it’s no longer necessary.

SIX SIGMA

Six Sigma A disciplined, data-driven methodology for process improvement.
It is a combination of quality management tools and the statistical method
DMAIC The circular problem-solving approach of Six Sigma.
Its steps correspond to steps in experimentation of the scientific method:
Define (𝟏, 𝟐) – Measure (𝟑) – Analyse (𝟒) – Improve ( ) – Control ( )

Additionally, DMAIC also uses the principles of the scientific method:
1. DMAIC cycle uses the same iterative discovery cycle
2. It puts emphasis on doing well-defined experiments to discover new insights
3. It’s data driven and puts emphasis on quantification
4. It looks for causal relationships
5. It puts emphasis on proper verification and validation of results

, SAMPLE SIZE DETERMINATION
How much data do I need to collect?

Key concepts What should you be able to do?
 p-value  Compute the minimal sample size determination in terms of CI width
 hypothesis tests when you are given the formula (normal, binomial)
 width confidence interval  Compute the minimal sample size determination in terms of power
 power when you are given the formula (normal, binomial)
 minimal sample size  Compute minimal sample sizes when given a simple confidence or
power formula for a distribution

Key insights
 The absolute error parameter is the half-width of the CI in case of symmetric CIs
 CI width in binomial and normal distributions leads to the minimal sample size
 Minimal sample size determination binomial cases requires extra information on the success probability 𝑝

There are three basic ways of hypothesis testing:
1. Is test statistic in critical region (yes/no) This does not provide a lot of information
2. P-values Allows for people to choose their own 𝛼 value
3. Confidence intervals Gives insight in how uncertain we are about the prediction
(𝜽 ̂ + 𝒄) is a 𝟏𝟎𝟎(𝟏 − 𝜶)% CI when 𝑷(𝜽
̂ − 𝒄 ,𝜽 ̂−𝒄<𝜽< 𝜽 ̂ + 𝒄) = 𝟏 − 𝜶

Type I error False positives
𝜶: The probability to reject 𝑯𝟎 when 𝑯𝟎 is true. 𝟏 − 𝜶 is the True negative (not rejecting 𝑯𝟎 when true)
Type II error False negatives
𝜷: The probability of not rejecting 𝑯𝟎 when 𝑯𝟎 is false.
Power True positives
𝟏 − 𝜷: the probability of rejecting 𝑯𝟎 when 𝑯𝟎 is false

Z-tests (Normal distribution) 𝟏𝟎𝟎(𝟏 − 𝜶)% CI for 𝝁:
𝑋𝑖 ~𝑁(𝜇, 𝜎 2 ) + 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑐𝑒 𝜎 𝜎
൬𝑥ҧ − 𝑧𝛼/2 , 𝑥ҧ + 𝑧𝛼/2 ൰
𝐻0 : 𝜇 = 𝜇0 ξ𝑛 ξ𝑛
𝐻𝑎 : 𝜇 ≠ 𝜇0
Significance level 𝛼 𝑋ത − 𝜇0
𝝈𝟐 𝑇=
Decision rule: reject if ȁ𝑻ȁ > 𝒛𝜶/𝟐 , 𝑻~𝑵(𝟎, ) 𝜎/ξ𝑛
𝒏

MINIMAL SAMPLE SIZES

The formula to calculate the minimal sample size can be derived from the Confidence Interval.
The formula for the half-width returns the Error (𝑬), this can then be rewritten to calculate 𝑛.
𝒛𝜶/𝟐 𝟐
The formulas to calculate the sample size have a similar form: 𝒏 ≥ ⌈( ) 𝝈𝟐 ⌉
𝑬

If the deviation is not absolute but relative to the expected value 𝜎 (e.g. p of the response time), then 𝐸 = 𝑝 × 𝜎

NORMAL DISTRIBUTION

One-sample Two-sample
If 𝜎 is known If the 𝜎s are known, and 𝑛1 = 𝑛2 = 𝑛
𝜎
CI ̅ ± 𝒛𝜶/𝟐
𝒙 𝝈𝟐𝟏 + 𝝈𝟐𝟐
ξ𝑛 CI ̅𝟐 ± 𝒛𝜶/𝟐 √
̅𝟏 − 𝒙
𝒙
𝒏
𝜎
Error 𝐸 ≥ 𝑧𝛼/2 ×
ξ𝑛
Sample 𝝈𝟐 + 𝝈𝟐𝟐 𝒛𝜶/𝟐 𝟐 𝟐
Sample 𝒛𝜶/𝟐 𝟐 size 𝑬 ≥ 𝒛𝜶/𝟐 √ 𝟏 ⇒ 𝒏≥( ) (𝝈𝟏 + 𝝈𝟐𝟐 )
𝒏 𝑬
size 𝒏≥( ) 𝝈𝟐
𝑬

Dit zijn jouw voordelen als je samenvattingen koopt bij Stuvia:

Bewezen kwaliteit door reviews

Studenten hebben al meer dan 850.000 samenvattingen beoordeeld. Zo weet jij zeker dat je de beste keuze maakt!

In een paar klikken geregeld

Geen gedoe — betaal gewoon eenmalig met iDeal, creditcard of je Stuvia-tegoed en je bent klaar. Geen abonnement nodig.

Direct to-the-point

Studenten maken samenvattingen voor studenten. Dat betekent: actuele inhoud waar jij écht wat aan hebt. Geen overbodige details!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper NienkeUr. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €4,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 69052 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Begin nu gratis

Samenvatting

Summary Data Science Research Methods (JBM025)

Document informatie

Onderwerpen

Geschreven voor

Verkoper

Ontvangen beoordelingen

Voorbeeld van de inhoud