100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Summary Statistics for premaster $5.83
Add to cart

Summary

Summary Statistics for premaster

1 review
 161 views  4 purchases
  • Course
  • Institution
  • Book

A summary of the statistics course. This is given in the pre master as well as in the first bachelor year.

Preview 8 out of 52  pages

  • Yes
  • December 12, 2020
  • 52
  • 2020/2021
  • Summary

1  review

review-writer-avatar

By: youplichtenberg • 3 year ago

avatar-seller
Samenvatting statistics
Module 1. Introduction.
Why learn statistics and R?
Wanneer er een onderzoeksvraag is, kijk je of je een hypothese kan maken om de vraag te
beantwoorden. Je moet de wereld observeren om data te verzamelen en deze vervolgens te
analyseren en daarna de resultaten te generaliseren.

Theory = a hypostasized general principle or set of principles that explain known findings about a
topic and from which new hypotheses can be generated.

Hypothesis = a prediction typically derived from theory observation.

Falsification = the act of disproving a theory or hypothesis.

Measurement = identify variables. You have to conceptualize the question. After that you have H0
operationalize it  go to something that can be observed. There are many scales of measurement:

- Categorical scales (entities are divided into distinct categories, waarden hebben geen
meetbare hoeveelheid):
 Binary variables: there are only 2 categories  death or alive (can’t say one is better
than the other).
 Nominal variables: there are more than 2 categories  whether someone is an
omnivore, vegetarian or vegan.
 Ordinal variable: the same as a nominal variable but the categories have a logical
order  whether people got a fail, a pass, a merit or a distinction in their exam.
- Continuous scales (entities get a distinct score, kan gemeten worden in getallen waarbij de
tussenliggende waarden betekenis hebben):
 Interval variable: equal intervals on the variable represent equal differences in the
property being measured  the difference between 6 and 8 is equivalent to the
difference between 13 and 15.
 Ratio variable: the same as an interval variable, but the ratios of the scores on the
scale must also make sense and have true zero scales  a score of 16 on an anxiety
scale means that the person is, in reality twice as anxious as someone scoring 8.
Temperature is not ratio because 0 degrees does not mean temperature (there is no
0 temperature).

Examples:

Temperature in degrees Celsius: interval scale. The numerical value is genuinely meaningful. The
differences between the numbers are interpretable, but the variable does not have a “natural” zero
value.

Gender: nominal scale. It serves as “labels” only to identify an object. It usually deals with non-
numeric variables where numbers have no value (e.g., averaging “male = 1” and “female = 2” does
not make any sense).

The order that runners cross the finish line in a marathon race: Ordinal scale. An ordinal scale variable
can be used to identify a natural, meaningful way to order the different possibilities, but you cannot
do anything else (e.g., you can say that the person who finished first was faster than the person who
finished second, but you cannot say that how much faster the first person was).

,The amount of time bob took to solve a calculus problem: Ratio scale. The numerical value is
genuinely meaningful. The differences between the numbers are interpretable, and zero really means
zero; therefore, it is fine to multiply and divide the values.

When you have your measurements, you have to think how reliable the measurements are.
Reliability = the ability of the measure to produces the same results under the same conditions
(consistency).

- Test-retest reliability: the ability of a measure to produce consistent results when the same
entities are tested at two different points in time.
- Inter-rater reliability: consistency across people. Do they produce the same answer?
- Parallel forms reliability: do different measures that are supposed to measure the same thing
actually measure it the same? (two different eye trackers)
- Internal consistency reliability: do things that are supposed to measure the same thing
actually measure it? (multiple question measuring IQ)

Example

Inter-rater reliability: Tom and Edith are two judges measuring 50 speed skaters’ time in a short track
speed skating competition. If the results of the two judges were very similar, the results showed an
excellent inter-rater reliability

Test-retest reliability: Sue is a clinical psychologist and uses the Beck Depression Inventory (BDI) to
measure her client’s level of depression. If the scores of the BDI are consistent over multiple occasions,
the BDI has test-retest reliability

Internal consistency reliability: The following two items show good internal consistency reliability to
measure the level of satisfaction with life: “In most ways my life is close to my ideal” and “I am
satisfied with my life.” However, if the following item were part of the same measure, “I can control
my facial expressions”, it would have low internal consistency reliability to measure satisfaction with
life.

Parallel forms reliability: An experimenter developed a large set of word memory questions (i.e., list of
words). He split these questions into half, and administered them to a randomly selected half of a
target sample. If the results of the two sets of questions show a high correlation, this would be one
indicator that the tests have a good parallel forms reliability.

The role of the variable:




Common types of research:

- Correlational research: observing what naturally goes on in the world, without directly
interfering with it.
- Cross-sectional research: data come from people at different age points, with different
people representing each age point. Could be quasi-experimental, case study, naturalistic
observations.

, - Experimental research: one or more variable is systematically manipulated to see the effect
(alone or in combination) on an outcome variable; randomization (random assignment,
random sampling; statements can often be made out of cause and effect, but you must be
careful for:
 Confounds = an unmeasured variable that could be related to the variables of the
interest.
 Artefacts = something that might threaten the external validity or construct validity
of your results (movement noise in an EEG signal))

There are 5 types of validity:

- Internal validity: the extent to which you are able to draw the correct conclusions about the
causal relationship between variables.
- External validity: the generalizability of you findings. To what extend do you expect to see
the same pattern of results in “real life” as you saw in your study.
- Construct validity: whether your actually measuring what you want to be measuring.
- Face validity: whether or not a measure “looks like” it’s doing what it’s supposed to.
- Ecological validity: the entire set up of the study should closely approximate the real world
scenario that’s being investigated (you cannot put an deathly experiment on people).

Example

Case(fictional) A researcher in psychology is conducting an experimental study about the effect of a
new type of cognitive behavioral therapy to treat depression among young Dutch women between
20-25 years old.

External validity: The extent to which we can generalize the results of this study to the real
population, namely young Dutch women between 20-25 years old with depression issues. When for
example just using undergraduate psychology students with small mental issues as the participants,
your sample has a high risk of not representative of the population. This experiment will carry a risk of
lacking external validity.

Internal validity: The extent to which we can draw the correct conclusions about the causal
relationship between the variables in the study. In this case we would like to study the causal
relationship between therapy and a change in depression levels among our population. We could find
a statistical significant result, due to certain correlation levels. However, a change in state of
depression could also be caused by other (confounding) variables, such as longer sleeping hours,
lower stress in daily life and new social relationships. These variables should be controlled for to
justify internal validity.

Construct validity: Construct validity is to show that the test is actually measuring what you want to
be measuring. If you are trying to examine the effect of cognitive therapy on depression, the research
should have the tools and the methods in order to somehow measure the concept of “depression”.

Ecological validity: The entire set up of the study should closely approximate the real world scenario
that is being investigated. In the above mentioned case, this should mean that the therapy session
that is considered to be the independent variable, should approximate “real world” therapy sessions.
This also applies to the environment where the participants are questioned about the change in
depression levels.

,Module 2. Intro to R/RStudio (additional, not very important)
A small collection of numbers is called a vector. Any object that contains data is called a data.
Structure and numeric vectors are the simplest type of data structure in R. Only numeric vectors can
be used in arithmetic expressions. When given two vectors of the same length, R simply performs the
specified arithmetic operation element – by – element. If the vectors are of different lengths, R
“recycles” the shorter vector until it is the same length as the longer vector.

Variables are used to store information. They also provide a way of labelling information. There are
different classes of variables:

- Numeric variables store numbers.
- Character variables store text. The quote marks are used to tell R that this is a text and aren’t
part of the data itself.
- Logical variables tore truth variable (TRUE of FALSE).

Sometimes variables can have the “special” value of NA. it means “missing data” and can be numeric,
character or logical.

Vectors are variables that store multiple pieces of information. Create vectors using c(); extract
specific elements using [].

CSV is a standard “universal” format. The raw data is just a plain text file. CSV stands for “comma
separated value”. CSV files are often opened by spreadsheets, and produce tabular data in R. a CSV
file is imported as a data frame. Data frames are the way R stores a typical dataset. It is very similar
to a data set in SPSS. It is a collection of variables bundled together, organized into a “case by
variable” matrix. Each row is a “case” each column is a named variable.

NULL is a “special” value in R that means: this value does not exist, or, it has no value. It is different to
NA, which means “the variable exists (and in principle has a value) but the value is
missing/unknown”.

Expt$age [1]  expt$age is a vector and we’re requesting the first element of it.

Expt [1,2]  expt is a data frame, and we’re requesting the value found in the first row, and the
second column.

Expt [1, “age”]  expt is a data frame, and we’re requesting the value found in the first row, and the
column named age.

Expt [4]  row 4.

Expt [c(1, 4, 7)]  row 1, 4, 7.

Get a subset of cases AND variables:

Expt [c(1, 4, 7), c(“age”, “gender”)]  age and gender row 1, 4, 7.

Put out all information that satisfy some criterion:

Subset(expt, gender == “male”)  alle males.

Factors “look” like character vectors, but they’re much richer than that. R needs to know if a variable
is nominal scale. A “factor” is a nominal scale variable. Use the as.factor() command to convert.
Sometimes R automatically creates factors for you.

,Lists are bundles of variables, but they aren’t organized into “case by variables”. Matrices are (like
data frames) organized into rows and columns but all of the values must be the same type. They’re
handy for numerical computations.

A package is a collection of functions and data sets that someone has contributed to the R
ecosystem. Installed means that the package files are stored on your computer. Your version of R is
able to load the package. Loaded means that R has opened the package files (sort of), and now
“knows” what they contain. You can use the functions/data stored in the package. So, a package
must be installed before you can load it and a package must be loaded before you can use it.

Vectors com in two different ways: atomic vectors and lists. An atomic vector contains exactly one
data type, whereas a list may contain multiple data types. Logical vectors can contain TRUE, FALSE or
NA.

Missing values:

Is.na tells us which element is NA.

! gives us the negatation of a logical expression.

!is.na() can be read as is not NA.

== for testing equality between two objects.

Subset values:

X[c(3,5,7)]  rij 3, 5, 7 van vector x

X[c(-2, -10)]  all elements of x except row 2 and 10.

Matrices and data frames:

Matrices can only contain a single class of data, while data frames can consist of many different
classes of data.

The dim() function should allow you to get OR set the dimensions attribute or an R object.

A matrix is simply an atomic vector with a dimension attribute. A more direct method of creating the
same matrix uses the matrix() function. To label the rows, is to add a column to the matrix which
contains the names. You first need to make a new variable, and then use rbind (matrix, patients)

Data.frame function allowed us to store our character vector of names right alongside our matrix of
numbers. The data.frame function takes one number of arguments and returns a single object of
class.

Logic:

There are two logical values in R, also called Boolean values: TRUE and FALSE.

! = not equaled.

At some point you may need to examine relationships between multiple logical expressions  AND
& OR operators.

AND operator &: if the right and left operator of & are both TRUE, the entire expression is TRUE.

&& means only the first after && is true.

,| means OR for the entire expression.

|| means OR for the number after ||.

All AND operators are evaluated before OR operator.

ISTRUE() takes one argument. If that argument evaluates to TRUE, the function will return TRUE.
Otherwise, the function will return FALSE. Identical() will return TRUE if two objects are identical.
Xcor() takes two arguments. It stands for exclusive OR. If one argument is TRUE, the other one is
FALSE.

,Module 3. Descriptive statistics.
What are descriptive statistics? A way to characterize some data we have collected (our sample)
without attempting to go beyond that data (to understand a population). They only tell us what our
data actually show. Includes a few properties of the data such as

- Where are the most of the data concentrated (measures of central tendency)?
- How spread out are the data (measures of dispersion)?
- What is the shape of the distribution (features of the distribution)?

In statistics we fit models to our data (we use statistical model to represent what is happening in the
real world). The mean is a hypothetical value (it doesn’t have to be a value that actually exists in the
dataset). As such, the mean is a simple statistical model.

The mean is the sum of all scores divided by the number of all scores. The mean is also the value
form which the (squared) score deviate least (it has the least error).



The mean is a model of what happens in the real world: it does
tell us a typical score, but it is not a perfect representation of the data. How can we asses how all the
mean represents reality?  think about how well the fit is. You want to be able to calculate “error”.
The simplest format of error is called a deviation. A deviation is the difference between the mean and
an actual data point. Deviations can be calculated by taking each score and subtracting the mean
from it.



How do you know how much total error there is? We could just take the error between the mean
and the data and add them. The sum of the errors could sum up to zero. Because this happens, we
have to modify our equations to do something differently. We take the sum of the squared errors: SS.
SS is always a positive number!!

What does it mean when the mean is 2.6 and you have an error of 5.2?
5.2 is lager than the mean. The sum of squares is a good measure of overall variability but is
dependent on the number of scores. we calculate the average variability by dividing by the number
of scores (N-1) this value is called the variance: S^2.



The problem of the variance is: it is not measured in units
squared. This isn’t a very meaningful merit so we take the square root value. This is the standard
deviation (S).



The sum of squares, variance and STD (standard deviation) represent the same thing:

- The “fit” of the mean to the data.
- The variability in the data.
- How well the mean represent the observed data.
- All sources of error.

, You can have variables with same mean, but different STD. The STD is important how good the mean
describes our dataset.




calculations in R:

what is the mean of 3, 4, 5 and NA?

- Pragmatic answer (ignore missing data) and calculate the average of 3, 4, 5.
- Cautious answer (don’t know the missing value) so we don’t know the mean either. Mean =
NA.

When mean = NA you have to look for different argues, so the mean has to know that we can use
NA.rm (remove) = TRUE. Than R will ignore the NA value.

Example: mean (age, na.rm=TRUE).

Calculating a trimmed mean: sometimes the mean isn’t a compelling measure of central tendency,
but we’d prefer not to resort to the median because the sample size is so small. When there is one
specific outlier: the trim. argument removes a certain percentage of the data based on the value.
Mean(score, trim = 1) = 10% trimmed mean, a more robust measure of central tendency than the
mean, because you remove a certain percentage. You want to be very careful of what data we decide
to remove of the sample and you want justification for doing so.

If your data is normal, there is no skewness going on. Then the mean, median and mode more or less
are the same, and they tell us where there is the highest frequency of data in our sample. The
median is the middle of the normal distribution and the mode is the value with the highest frequency
(welke het vaakst voorkomt). If we have data that are skewed right or left, then our measures of
central tendency (mean, median, mode) appear differently. Mean can be more easily skewed away
from the highest tendency than the mode or median.

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller robinvanheesch1. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $5.83. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

50990 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 15 years now

Start selling
$5.83  4x  sold
  • (1)
Add to cart
Added