Doorstroomminor: statistiek 1, alle informatie die je nodig hebt!
20 views 0 purchase
Course
Methodologie 1
Institution
Hogeschool Van Amsterdam (HvA)
Book
Statistics
Het begin is Engels, daarna alleen maar in het Nederlands! Een samenvatting van zowel de extra filmpjes via Coursera, aantekeningen van het college statistiek 1.
*Inleidende video ‘Basic statistics’
Descriptive statistics: methods of summarizing the information for analysis (by means of
graphs) door middel van gemiddelden, grafieken, percentages or correlation coefficient
Inferential statistics: drawing conclusions about a population on the basis of a limited
number of cases. Je doet een uitspraak over een totale populatie, gebaseerd op een
geselecteerd aantal personen.
*1.01 Cases, variables and levels of measurement
Variables = features of something or someone, bijvoorbeeld karaktereigenschappen
‘gewicht, haarkleur, leeftijd, aantal goals’. Every characteristics of a case could be a variable;
with the criterion it needs to vary. If it’s not a variable, for example if all the teams come
from Spain, then it’s a constant.
Cases = something or someone, bijvoorbeeld de personen waarvan je het onderzoekt. Most
of the time these are persons, teams, or companies.
Because there are a lot of different variables, it’s important to have levels of measurement
(maten). Both are categorical variables.
- The simplest one is the nominal level. There is no order (it’s not possible to argue
something is better or worse), but there is difference, for example the nationality.
- Ordinal level: there is a difference in the categories and there is an order (you can see
who the winner is, who is the second), but you don’t know anything about how much
the number one was better dan the number two.
There are also quantative variables (represented in numerical values):
- Interval level: there is a difference in the categories, but also in order and in similar
intervals, for example the age of the player. You can say the one is older than the
other (18 is older than 16), but also, it’s similar to the difference between 14 and 16
years both two years.
- Ratio level: similar to interval level but has a meaningful zero point. For example,
someone’s height. If the height is 0, it means there is no height, but a age of 0 is
impossible because an age of 0 means not there is no age.
Quantitative variables can be disguised in discrete and continuous variables:
- Discrete: a set of separate numbers (volledige cijfers), for example the number of
goals.
- Continuous: the variable forms an interval, for example the height can be 170
centimeters, but also 170,87 centimeters.
*1.02 Data matrix and frequency table
How to order the variables and cases.
If you interest in the cases ‘footballers’ and want to know their ‘age, hair color, height etc.’
you can you a data matrix (an overview). The data, for example the age of player 4 is called
‘observations.
,Often you summarize the data matrix if you would share the data matrix to others, you can
you for example a frequency table. It’s a list of all possible values, with the number of
observations of the values. You can make the percentages of that (deel : geheel x 100 =
percentage). Sometimes there are cumulative percentages. That’s is the percentages in
every category added up (percentage kolom 1 + percentage kolom 2 = cumulative
percentage). You can use the frequency table with categorial variable, but with quantitative
variable it doesn’t give a good overview. In that case, you can build new ‘ordinal categories
(bijvoorbeeld blokken van elke keer 10 kg. minder dan 60 kg, 60 – 69,9 kg, 70-79,9 kg).
You recoded the variable.
*1.03 Graphs and shapes of distributions
If you want to summarize nominal or ordinal data (categorical data):
Pie chart (cirkeldiagram) = you can see immediately how much of the total is taken by one of
the variables, but the exact number is not good to see.
Bar graph (staafdiagram) = you can see easily the numbers.
A bar graph had advantages (voordelen) over a pie chart if the number of categories of a
variable increase (hoe hoger het aantal categorieën, des te beter je een bar graph kunt
gebruiken). These can used with all categorical data.
But how can you summarize the quantitative data? You can use a dot plot. You have to draw
a horizontal line and label the possible values in regular intervals (bij het gewicht
bijvoorbeeld 165-170-175-180). For each observation you place a dot above the line. A dot
plot is only useful when you have a small sample.
When there are many observations, you can make a histogram. It’s similar to a bar graph,
but the difference is that in a histogram the bars touches each other. When for example the
weights are specific, like 93,8, you can use intervals of five kilogram. The interval can be from
47,5 to 52,5 kilogram. 50 kilogram is displayed because it’s the middle of the interval.
A histogram can have different shapes:
- A bell shape (unimodal): one peak, approximately symmetric (ongeveer gelijkvormig).
- Skewed to the left/right: it’s no symmetric because its skewed (scheef).
- Two peaks (bimodal): if you’re interested in the age of people whose are in the
canteen. If the football players come with their parents, there’s one peak of the
children and the other peak of the parents.
*1.04 Mode, median and mean (modus, mediaan en gemiddelde)
- Mode: the most common outcome (meest voorkomend). Can be used if a variable is
measured on a nominal or ordinal level. The mode is the name of the variable, not
the percentage. It’s possible to have two modes (bi-model distribution).
- Median: the middle value of the observations when arranged from the smallest to
the largest (het middelste getal als de getallen op chronologische volgorde staan).
This is easy if you have an odd number of cases (oneven aantal). If there is an even
number of cases, you have to take the average of the two middle values (number 1 +
number 2: 2).
, - Mean: the sum of all the values divided by the number of observations (N). Its also
called as the balance point of the data.
Nominal: you can’t use median or mean.
Categorical: you can use the mode
Quantitative: you can use the median or the mean.
Outlier: a value that’s much more than all the other values go for the median. If it’s not
that case, go for the mean.
*1.05 Range, interquartile range and box plot
To analyze the data, you also need more information about the variability of the data.
- Range: the difference about the highest and the lowest value (hoogste aantal –
laagste aantal = range). In most of the times it doesn’t give a good impression of the
variability of the data, it only takes the extreme values (de twee uiterste).
- Interquartile range: it leaves out the extreme values. Its distribution the values in
four parts. There is the first quartile (Q1, eerste kwartaal), the second quartile (Q2,
tweede kwartaal) and the third quartile (Q3, derde kwartaal). 50% of the value is
below the Q2 and 50% of the value is above. Q2
is the same as the median. The interquartile
range = Q3-Q1.
First you look for the median (Q2). For Q1 you
have to look for the middle value of the left
side of the median. For Q3 you need the same
strategy, only the right side.
- There is an outlier if the value is lower than Q1
-1,5xIQR or if the value is higher than Q3 +
1,5xIQR.
- The graph to describe the center, variability
and outliers is the box plot. The box is 50% of
the distribution, also the IQR.
*1.06 Variance and standard deviation
- Variance: eerst bereken je het gemiddelde. Dan bij alle variabelen
moet je het gemiddelde aftrekken. Alle variabelen moeten in het
kwadraat. Daarvan moet alles worden opgeteld = the sum of
squares. Daarbij moet je de populatie (n) min 1. Daarna moet je de som delen door
de populatie -1. The larger the variance, the larger the variability, so the more the
variables are spread from the mean.
- Standard deviation (standaarddeviatie): het nadeel van de variance is dat de uitkomst
in het kwadraat is, dat komt doordat je al eerder de positieve en negatieve
uitkomsten gekwadrateerd hebt zodat ze elkaar niet opheffen. It’s the average
distance of an observation from the mean. Het is de viarance met een wortel.
, *1.07 Z-scores (standaardafwijking & normaal verdeling)
To know if a specific observation is common (vlakbij het gemiddelde) or exceptional (of
uitzonderlijk), you can use the number of standard deviations removed from the mean Z-
score. It gives you information about how extreme the observation is. If you recode original
scores into z-scores, you are standardization.
Het eerste getal wat je invult is waar je in geïnteresseerd bent. Als je dit bij alle variaties
doet, kom je uit op 0, omdat ‘the mean’ het balanspunt is. If the histogram is bell-shaped,
68% of all the observations fall between z-scores -1 and 1 (-1s and 1s). 95% between z-scores
-2 and 2 (-2s and 2s). And 99% between -3 and 3 (-3s and 3s). Scores of
mean -3s or 3s are exceptional. If the histogram is more skewed to the
right, there are more extreme values and large positive z-scores.
Agresti hoofdstuk 1 ‘Statistics: The Art and Science of Learning from Data’
1.1 Using Data to Answer Statistical Questions
The information we gather with experiments and surveys is called data. Statistics is a way of
thinking about data and translating data into knowledge and understanding of the world
around us. There are three main components of statistics for answering a statistical
question:
- Design: stating the goal/question and plan how to obtain data (wat wil je
onderzoeken/hoe ga je de data verkrijgen).
- Description: summarizing and analyzing the data that are obtained (verkregen data).
- Inference: making decisions based on data for answering the statistical question
(vanuit een steekproef een conclusie trekken over de gehele populatie).
Infer (werkwoord van inference) is het maken van een beslissing op basis van bekend bewijs
Statistical inference = de verkregen data is het bewijs.
Inference is probability (je doet een ‘waarschijnlijke’ uitspraak)
1.2 Sample versus Population
The thing you measure in a study is called the subject. The population is the set of all the
subjects. Usually you have data for only some of the subjects who belong to that population,
these subjects are called a sample.
Subject basisscholen
Population Alle Nederlandse basisscholen
Sample basisscholen in Amsterdam
Descriptive statistics (beschrijvende statistiek) and Inferential statistics (gevolgentrekking).
Descriptive statistics is a method for summarizing the collected data. For example, a bar
graph. It is easier to comprehend than the entire set of data. Descriptive statistics are useful
for the entire population. Inferential statistics are used when data are available for a sample
only, but we want to make a decision or prediction about the entire population. Important of
statistical inference is the precision of a prediction. The term parameter is used for a
numerical summery of the population.
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller Makkelijkleren. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $5.85. You're not tied to anything after your purchase.