Samenvatting

Summary Statistics & Methodology (880259-M-6)

4 keer verkocht

Instelling
Tilburg University (UVT)

Detailed summary of all lectures and additional notes, explanations and examples for the course "Statistics and Methodology" at Tilburg University which is part of the Master Data Science and Society. Course was given by L.V.D.E. Vogelsmeier during the second semester, block three of the academic y...

[Meer zien]

Voorbeeld 3 van de 30 pagina's

Bekijk voorbeeld

Geupload op 21 juni 2022
Aantal pagina's 30
Geschreven in 2021/2022
Type Samenvatting

data science
m dss
statistics
statistics and methodology
master data science and society

Volgen

hannahgruber Lid sinds 2 jaar 90 documenten verkocht

€5,99

Ook beschikbaar in voordeelbundel v.a. €18,49

In winkelwagen

Opslaan

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten

Ook beschikbaar in voordeelbundel (1)

Summaries + Cheat Sheets for all compulsory courses of Master Data Science & Society (Statistics, Data Mining, Machine Learning)

€ 25,45 € 18,49

8x verkocht

5 items

1. Overig - Cheat sheet for data mining for business and governance (880022-m-6) exam
2. Samenvatting - Summary data mining for business and governance (880022-m-6)
3. Overig - Cheat sheet for machine learning (880083-m-6)
4. Samenvatting - Summary machine learning (880083-m-6)
5. Samenvatting - Summary statistics & methodology (880259-m-6)
Meer zien

Tilburg University
Study Program: Master Data Science and Society
Academic Year 2021/2022, Semester 2, Block 3 (January to March 2022)

Course: Statistics and Methodology (880259-M-6)
Lecturers: L.V.D.E. Vogelsmeier

,Lecture 1: Statistical Inference, Modeling and Prediction

Introduction to statistical inference

Statistical Reasoning
• consideration of uncertainty
• systematize the way we account for uncertainty when making data-based decisions
→ avid bias by ourselves: “get the result I wish to find”

Probability Distributions
• Probability distributions quantify how likely it is to observe each possible value of some
probabilistic entity “re-scaled frequency distributions”
• they show the proportion of observations that are in a certain bin, not the absolute number /
frequency of observations
• probability distributions with higher standard deviation are broader and less high

Statistical Testing
• When we conduct statistical tests, we weight the estimated effect by the precision of the
estimate.
𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 − 𝑁𝑢𝑙𝑙 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑧𝑒𝑑 𝑉𝑎𝑙𝑢𝑒
• Wald Test (type of T test) 𝑇 =
𝑉𝑎𝑟𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦
o if there is no effect hypothesized, we assume “0”
o in general, the larger the test statistic, the better

Sampling Distribution of the test statistic
• probability distribution of a statistic
• The sampling distribution quantifies the possible values of the test statistic over infinite
repeated sampling.
• The area of a region under the curve represents the probability of observing a test statistic
within the corresponding interval.
• To quantify how exceptional our estimated test statistic is, we compare the estimated value
to a sampling distribution of t-statistics assuming no effect (null hypothesis)
o null hypothesis = no effect → “nil-null”
• If our estimated statistic would be very unusual in a population where the null hypothesis is
true, we reject the null and claim a “statistically significant” effect

Interpreting P-Values
• All that we can say is that there is a 0.032 probability (p value) of observing a test statistic at
least as large as 𝑡̂, if the null hypothesis is true.

Introduction to statistical modeling
• For simple questions we can use statistical testing to control for uncertainty. In most real-
world cases, we want to employ a modeling perspective to control for confounding variables.
• When modeling, we can make inferences about the model parameters, or we can predict
outcomes for new cases.

, Lecture 2: Research Cycle, Research Design and Exploratory Data Analysis

Discuss research/data science cycle
• CRISP-DM: The Cross-industry
Standard Process for Data
Mining was developed to
standardize the process of data
mining in industry applications
• The Data Science Cycle combines
the classical Research Cycle and
the CRISP-DM. The grey colored
activities are mandatory.

Discuss research design in data science
• In data science, we rarely design experiments/empirical studies
• Research design is still crucial to data science to design an appropriate analysis.
o You must know how to operationalize the question in a statistically rigorous way.
▪ Make sure you understand exactly what is being asked
▪ Convert each aspect of the question into something quantifiable
▪ If possible, code the research question into a set of hypotheses.
o You must be able to choose/build a statistical model, statistical test, or machine
learning algorithm that can answer your well-operationalized research question.
▪ Once you have a well-operationalized research question, you need to
convert that question into some type of model or test.
o You must understand what types of data/data sources you’ll need.

Introduce EDA (Exploratory Data Analysis)
• interactively analyze/explore your data
• More of a mindset than a specific set of techniques or steps: data driven approach to explore
something, not to test hypothesis
• diverse selection of tools to use
o Statistical graphics: Histograms, Boxplots, Scatterplots, Traceplots
o Summary graphics: measures of tendency & dispersion, order statistics
o Data Screening/Cleaning: missing data, outliers, invalid values

Interfacing EDA & CDA (Confirmatory Data Analysis)
• CDA: there is usually a clear hypothesis to test, we have some prior knowledge which we
want to test, e.g., by using hypothesis testing
• unsupervised learning models are usually more EDA because we want to find pattern
• Either can stand alone, but they play together better
o When the data are well-understood, we can proceed directly to CDA.
o If we don’t care about testing hypotheses, we can focus on EDA.
• EDA can be used to generate hypotheses for CDA.
• EDA can be used to sanity check (Plausibilitätsprüfung) hypotheses

Dit zijn jouw voordelen als je samenvattingen koopt bij Stuvia:

Bewezen kwaliteit door reviews

Studenten hebben al meer dan 850.000 samenvattingen beoordeeld. Zo weet jij zeker dat je de beste keuze maakt!

In een paar klikken geregeld

Geen gedoe — betaal gewoon eenmalig met iDeal, creditcard of je Stuvia-tegoed en je bent klaar. Geen abonnement nodig.

Direct to-the-point

Studenten maken samenvattingen voor studenten. Dat betekent: actuele inhoud waar jij écht wat aan hebt. Geen overbodige details!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper hannahgruber. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €5,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 66184 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Begin nu gratis

Samenvatting

Summary Statistics & Methodology (880259-M-6)

Document informatie

Onderwerpen

Geschreven voor

Verkoper

Ontvangen beoordelingen

Voorbeeld van de inhoud