Summary on the course Data Science Research Methods (JBM025) from the major Data Science in Eindhoven and Tilburg. This course has two parts. The first part focusses on the scientific method and design of experiments (DOE). The second part focusses on econometrics and builds upon what is discussed ...
Deriving optimal settings 14
The scientific method 2 Optimums 14
Six Sigma 2 Optimisation scheme 14
sample size determination 3 Econometrics for data scientists 15
Minimal sample sizes 3 Random variables 15
Normal distribution 3 regressions 16
Binomial distribution 4 Bivariate and multivariate regressions 16
When 𝝈 or 𝒑 is not known 4 Ordinary least squares (OLS) 16
Power analysis 4 Instrumental variable estimation 16
Normal distribution 4
Binomial distribution 4 Causality and selection 17
Causality 17
Analysis of variance (ANOVA) 5 Selection and selection bias 17
ANOVA table 5 Regression and randomized experiments 18
Potential problems with experiments 18
ANOVA – power and multiple comparisons 6
ANOVA power 6 Selection on observables and matching 19
Multiple comparisons 6 Matching 19
Fisher Least Significance Difference (LSD) 6 3 methods of matching 20
Tukey’s Honest Significant Difference (HSD) 6 Exact matching 20
Matching based on closeness of
Two-factor designs and blocking 7 observables 20
Propensity score matching 21
Full factorial designs 8 OLS estimator as matching estimator 21
DOE: how to determine whether an individual Flexible OLS as matching estimator 21
factor is of importance 9
Blocking with 2 factors 9 Differences-in-differences estimation 22
Some important details 23
Fractional Factorial designs 10 Generalization: 23
Fractional experiments 10
fractional factorials 10 Regression Discontinuity design (RDD) 24
Sharp regression discontinuity design 24
Response Surface Optimisation 12 Main idea and interpretation 24
Improvement Efficiently: finding near-optimal Estimation of the treatment effect in Sharp
factor settings 12 RDD 25
box/Simplex method 12 Approach 2 25
Steepest ascent/descent method 12 Approach 1 25
Quadratic models 13 Fuzzy regression discontinuity design 25
Response surface designs 13 Estimation the fuzzy RD 26
Central Composite Design (CCD) 13 Alternative to this estimation 26
Box-Behnken Design 14 Specification testing 26
,THE SCIENTIFIC METHOD
Key concepts What should you be able to do?
Scientific method Link elements of Six Sigma to the scientific method
Experiment Translate a case study in terms of independent variables (factors) and
Factor dependent variables
Independent variable Be able to distinguish in a specific data science context, which of the
Six Sigma three basic goals is relevant
Key insights
It is important to identify which of the three different data science goals are relevant given a certain context
The scientific method is an iterative process
If you do not plan an experiment well in advance, then no statistical analysis may yield the hoped for results
Experiments may involve several factors, each or which may have more than 2 levels
The scientific method is also very useful in industry
The Six Sigma approach in industry has incorporated several aspects of the scientific method.
Data science has three goals: Business has similar distinctions regarding analytics:
1. Description 1. Descriptive analytics provide insight into the past
2. Prediction 2. Predictive analytics provide understanding of the future
3. Explanation 3. Prescriptive analytics advice on the possible outcomes
Basic elements of the (iterative) scientific method Steps in experimentation
1. Formulate a question 1. Plan the experiment
2. Perform background research 2. Design the experiment
3. Formulate the hypothesis (answer) 3. Perform the experiment
4. Determine the logical consequences of the hypothesis 4. Analyse the resulting data
5. Collect observations (experiment) 5. Confirm the results
6. Test the truth of the hypothesis by analysing observations (statistics) 6. Evaluate the conclusion
7. Report the results
8. If the hypothesis is not confirmed, go back to 2
There are a number of valid reasons for the iterative approach:
1. New insights were obtained after analysing the experiment
2. New questions arose from the experiment
3. If the hypotheses are built upon wrong assumptions.
The iterative nature means that, if a hypothesis is refuted by the experiment, you should start over again and form
a new hypothesis to verify the new hypothesis. This iteration should be repeated until it’s no longer necessary.
SIX SIGMA
Six Sigma A disciplined, data-driven methodology for process improvement.
It is a combination of quality management tools and the statistical method
DMAIC The circular problem-solving approach of Six Sigma.
Its steps correspond to steps in experimentation of the scientific method:
Define (𝟏, 𝟐) – Measure (𝟑) – Analyse (𝟒) – Improve ( ) – Control ( )
Additionally, DMAIC also uses the principles of the scientific method:
1. DMAIC cycle uses the same iterative discovery cycle
2. It puts emphasis on doing well-defined experiments to discover new insights
3. It’s data driven and puts emphasis on quantification
4. It looks for causal relationships
5. It puts emphasis on proper verification and validation of results
, SAMPLE SIZE DETERMINATION
How much data do I need to collect?
Key concepts What should you be able to do?
p-value Compute the minimal sample size determination in terms of CI width
hypothesis tests when you are given the formula (normal, binomial)
width confidence interval Compute the minimal sample size determination in terms of power
power when you are given the formula (normal, binomial)
minimal sample size Compute minimal sample sizes when given a simple confidence or
power formula for a distribution
Key insights
The absolute error parameter is the half-width of the CI in case of symmetric CIs
CI width in binomial and normal distributions leads to the minimal sample size
Minimal sample size determination binomial cases requires extra information on the success probability 𝑝
There are three basic ways of hypothesis testing:
1. Is test statistic in critical region (yes/no) This does not provide a lot of information
2. P-values Allows for people to choose their own 𝛼 value
3. Confidence intervals Gives insight in how uncertain we are about the prediction
(𝜽 ̂ + 𝒄) is a 𝟏𝟎𝟎(𝟏 − 𝜶)% CI when 𝑷(𝜽
̂ − 𝒄 ,𝜽 ̂−𝒄<𝜽< 𝜽 ̂ + 𝒄) = 𝟏 − 𝜶
Type I error False positives
𝜶: The probability to reject 𝑯𝟎 when 𝑯𝟎 is true. 𝟏 − 𝜶 is the True negative (not rejecting 𝑯𝟎 when true)
Type II error False negatives
𝜷: The probability of not rejecting 𝑯𝟎 when 𝑯𝟎 is false.
Power True positives
𝟏 − 𝜷: the probability of rejecting 𝑯𝟎 when 𝑯𝟎 is false
The formula to calculate the minimal sample size can be derived from the Confidence Interval.
The formula for the half-width returns the Error (𝑬), this can then be rewritten to calculate 𝑛.
𝒛𝜶/𝟐 𝟐
The formulas to calculate the sample size have a similar form: 𝒏 ≥ ⌈( ) 𝝈𝟐 ⌉
𝑬
If the deviation is not absolute but relative to the expected value 𝜎 (e.g. p of the response time), then 𝐸 = 𝑝 × 𝜎
NORMAL DISTRIBUTION
One-sample Two-sample
If 𝜎 is known If the 𝜎s are known, and 𝑛1 = 𝑛2 = 𝑛
𝜎
CI ̅ ± 𝒛𝜶/𝟐
𝒙 𝝈𝟐𝟏 + 𝝈𝟐𝟐
ξ𝑛 CI ̅𝟐 ± 𝒛𝜶/𝟐 √
̅𝟏 − 𝒙
𝒙
𝒏
𝜎
Error 𝐸 ≥ 𝑧𝛼/2 ×
ξ𝑛
Sample 𝝈𝟐 + 𝝈𝟐𝟐 𝒛𝜶/𝟐 𝟐 𝟐
Sample 𝒛𝜶/𝟐 𝟐 size 𝑬 ≥ 𝒛𝜶/𝟐 √ 𝟏 ⇒ 𝒏≥( ) (𝝈𝟏 + 𝝈𝟐𝟐 )
𝒏 𝑬
size 𝒏≥( ) 𝝈𝟐
𝑬
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller NienkeUr. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $4.88. You're not tied to anything after your purchase.