Summary on the course Data Science Research Methods (JBM025) from the major Data Science in Eindhoven and Tilburg. This course has two parts. The first part focusses on the scientific method and design of experiments (DOE). The second part focusses on econometrics and builds upon what is discussed ...
Deriving optimal settings 14
The scientific method 2 Optimums 14
Six Sigma 2 Optimisation scheme 14
sample size determination 3 Econometrics for data scientists 15
Minimal sample sizes 3 Random variables 15
Normal distribution 3 regressions 16
Binomial distribution 4 Bivariate and multivariate regressions 16
When 𝝈 or 𝒑 is not known 4 Ordinary least squares (OLS) 16
Power analysis 4 Instrumental variable estimation 16
Normal distribution 4
Binomial distribution 4 Causality and selection 17
Causality 17
Analysis of variance (ANOVA) 5 Selection and selection bias 17
ANOVA table 5 Regression and randomized experiments 18
Potential problems with experiments 18
ANOVA – power and multiple comparisons 6
ANOVA power 6 Selection on observables and matching 19
Multiple comparisons 6 Matching 19
Fisher Least Significance Difference (LSD) 6 3 methods of matching 20
Tukey’s Honest Significant Difference (HSD) 6 Exact matching 20
Matching based on closeness of
Two-factor designs and blocking 7 observables 20
Propensity score matching 21
Full factorial designs 8 OLS estimator as matching estimator 21
DOE: how to determine whether an individual Flexible OLS as matching estimator 21
factor is of importance 9
Blocking with 2 factors 9 Differences-in-differences estimation 22
Some important details 23
Fractional Factorial designs 10 Generalization: 23
Fractional experiments 10
fractional factorials 10 Regression Discontinuity design (RDD) 24
Sharp regression discontinuity design 24
Response Surface Optimisation 12 Main idea and interpretation 24
Improvement Efficiently: finding near-optimal Estimation of the treatment effect in Sharp
factor settings 12 RDD 25
box/Simplex method 12 Approach 2 25
Steepest ascent/descent method 12 Approach 1 25
Quadratic models 13 Fuzzy regression discontinuity design 25
Response surface designs 13 Estimation the fuzzy RD 26
Central Composite Design (CCD) 13 Alternative to this estimation 26
Box-Behnken Design 14 Specification testing 26
,THE SCIENTIFIC METHOD
Key concepts What should you be able to do?
Scientific method Link elements of Six Sigma to the scientific method
Experiment Translate a case study in terms of independent variables (factors) and
Factor dependent variables
Independent variable Be able to distinguish in a specific data science context, which of the
Six Sigma three basic goals is relevant
Key insights
It is important to identify which of the three different data science goals are relevant given a certain context
The scientific method is an iterative process
If you do not plan an experiment well in advance, then no statistical analysis may yield the hoped for results
Experiments may involve several factors, each or which may have more than 2 levels
The scientific method is also very useful in industry
The Six Sigma approach in industry has incorporated several aspects of the scientific method.
Data science has three goals: Business has similar distinctions regarding analytics:
1. Description 1. Descriptive analytics provide insight into the past
2. Prediction 2. Predictive analytics provide understanding of the future
3. Explanation 3. Prescriptive analytics advice on the possible outcomes
Basic elements of the (iterative) scientific method Steps in experimentation
1. Formulate a question 1. Plan the experiment
2. Perform background research 2. Design the experiment
3. Formulate the hypothesis (answer) 3. Perform the experiment
4. Determine the logical consequences of the hypothesis 4. Analyse the resulting data
5. Collect observations (experiment) 5. Confirm the results
6. Test the truth of the hypothesis by analysing observations (statistics) 6. Evaluate the conclusion
7. Report the results
8. If the hypothesis is not confirmed, go back to 2
There are a number of valid reasons for the iterative approach:
1. New insights were obtained after analysing the experiment
2. New questions arose from the experiment
3. If the hypotheses are built upon wrong assumptions.
The iterative nature means that, if a hypothesis is refuted by the experiment, you should start over again and form
a new hypothesis to verify the new hypothesis. This iteration should be repeated until it’s no longer necessary.
SIX SIGMA
Six Sigma A disciplined, data-driven methodology for process improvement.
It is a combination of quality management tools and the statistical method
DMAIC The circular problem-solving approach of Six Sigma.
Its steps correspond to steps in experimentation of the scientific method:
Define (𝟏, 𝟐) – Measure (𝟑) – Analyse (𝟒) – Improve ( ) – Control ( )
Additionally, DMAIC also uses the principles of the scientific method:
1. DMAIC cycle uses the same iterative discovery cycle
2. It puts emphasis on doing well-defined experiments to discover new insights
3. It’s data driven and puts emphasis on quantification
4. It looks for causal relationships
5. It puts emphasis on proper verification and validation of results
, SAMPLE SIZE DETERMINATION
How much data do I need to collect?
Key concepts What should you be able to do?
p-value Compute the minimal sample size determination in terms of CI width
hypothesis tests when you are given the formula (normal, binomial)
width confidence interval Compute the minimal sample size determination in terms of power
power when you are given the formula (normal, binomial)
minimal sample size Compute minimal sample sizes when given a simple confidence or
power formula for a distribution
Key insights
The absolute error parameter is the half-width of the CI in case of symmetric CIs
CI width in binomial and normal distributions leads to the minimal sample size
Minimal sample size determination binomial cases requires extra information on the success probability 𝑝
There are three basic ways of hypothesis testing:
1. Is test statistic in critical region (yes/no) This does not provide a lot of information
2. P-values Allows for people to choose their own 𝛼 value
3. Confidence intervals Gives insight in how uncertain we are about the prediction
(𝜽 ̂ + 𝒄) is a 𝟏𝟎𝟎(𝟏 − 𝜶)% CI when 𝑷(𝜽
̂ − 𝒄 ,𝜽 ̂−𝒄<𝜽< 𝜽 ̂ + 𝒄) = 𝟏 − 𝜶
Type I error False positives
𝜶: The probability to reject 𝑯𝟎 when 𝑯𝟎 is true. 𝟏 − 𝜶 is the True negative (not rejecting 𝑯𝟎 when true)
Type II error False negatives
𝜷: The probability of not rejecting 𝑯𝟎 when 𝑯𝟎 is false.
Power True positives
𝟏 − 𝜷: the probability of rejecting 𝑯𝟎 when 𝑯𝟎 is false
The formula to calculate the minimal sample size can be derived from the Confidence Interval.
The formula for the half-width returns the Error (𝑬), this can then be rewritten to calculate 𝑛.
𝒛𝜶/𝟐 𝟐
The formulas to calculate the sample size have a similar form: 𝒏 ≥ ⌈( ) 𝝈𝟐 ⌉
𝑬
If the deviation is not absolute but relative to the expected value 𝜎 (e.g. p of the response time), then 𝐸 = 𝑝 × 𝜎
NORMAL DISTRIBUTION
One-sample Two-sample
If 𝜎 is known If the 𝜎s are known, and 𝑛1 = 𝑛2 = 𝑛
𝜎
CI ̅ ± 𝒛𝜶/𝟐
𝒙 𝝈𝟐𝟏 + 𝝈𝟐𝟐
ξ𝑛 CI ̅𝟐 ± 𝒛𝜶/𝟐 √
̅𝟏 − 𝒙
𝒙
𝒏
𝜎
Error 𝐸 ≥ 𝑧𝛼/2 ×
ξ𝑛
Sample 𝝈𝟐 + 𝝈𝟐𝟐 𝒛𝜶/𝟐 𝟐 𝟐
Sample 𝒛𝜶/𝟐 𝟐 size 𝑬 ≥ 𝒛𝜶/𝟐 √ 𝟏 ⇒ 𝒏≥( ) (𝝈𝟏 + 𝝈𝟐𝟐 )
𝒏 𝑬
size 𝒏≥( ) 𝝈𝟐
𝑬
Voordelen van het kopen van samenvattingen bij Stuvia op een rij:
Verzekerd van kwaliteit door reviews
Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!
Snel en makkelijk kopen
Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.
Focus op de essentie
Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!
Veelgestelde vragen
Wat krijg ik als ik dit document koop?
Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.
Tevredenheidsgarantie: hoe werkt dat?
Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.
Van wie koop ik deze samenvatting?
Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper NienkeUr. Stuvia faciliteert de betaling aan de verkoper.
Zit ik meteen vast aan een abonnement?
Nee, je koopt alleen deze samenvatting voor €4,49. Je zit daarna nergens aan vast.