Statistics Versatest
Chapter 1 Introduction
A variable can be qualitative (categorical) or quantitative (numerical).
Two types of categorical variables:
1. Nominal variables: categories are ordered in an arbitrary way, no logical order. For example:
single, married, divorced.
2. Ordinal variables: logical order. For example: social-economic status (low, medium, high).
Categorical variables with only two categories, like diabetes (yes/no), are called binary or
dichotomous variables.
Quantitative, numerical value has a natural meaning. Variable can be discrete: only certain numerical
values are possible, for example the number of children in a household. Or continuous: all possible
numbers within a range are possible. For example height, birth weight.
Chapter 3: Probabilities
The probability of a certain even A equals the number of outcomes in favor of A divided by the total
number of possible outcomes, in a formula:
P (A) = number of outcomes in favor of A/ total number of possible outcomes
The classical definition of probability is: P(A) = number of outcomes in favor of A/ total number of
possible outcomes. If all possible outcomes are equally likely.
If we do not know if the outcomes are equally likely, we cannot use the classical definition and we will
have to use the empirical definition of probability:
P(A) = lim(n-->inf) x (number of times A is observed/n)
In words: the probability that event A will happen is equal to the limit of the ratio of number of times
the event A was observed divided by n (number of experiments), for n going to infinity.
The complement of an even A, written as A^c, is defined as everything but event A. The complement
rule gives P (A^c) = 1 - P(A).
A conditional probability is the probability of an event, given a certain condition. For the probability
of event A to happen, given that event B occurred, we use the notation P (A | B).
If P(A | B) = P(A), we say A and B are independent.
If A and B are independent of each other, the probability that both A and B will happen P (A n B) =
P(A) x P(B). This is called the product rule.
If A and B are not independent of each other, the probability that A and B will both happen is P(A n B)
= P(A) x P(B | A).
P(A u B) means the probability that the union of A and B will happen: event A happens, event B
happens or both happen at the same time. For calculations, we will use the rule of addition: P(A u B)
= P(A) + P(B) - P(A n B).
, Qualities of a diagnostic test are given by its sensitivity and its specificity. Both are conditional
probabilities.
For calculating the positive predictive value of a diagnostic test, we make use of Bayes'rule. His
formula can be used to calculate a predictive value as a function of the qualities of a diagnostic test
and the prevalence of the disease.
We already saw that P(A n B) = P(A)*P(B | A). Interchanging the roles of A and B, we can also write
P(A n B) = P(B n A) = P(B) * P(A | B). Now it follows that P(A)*P(B)*P(A|B) and dividing both sides by
P(A) will give us Bayes' rule: P(B | A) = P(A | B) x P(B) / P(A).
P(TB+ | M+) = (P(M+| TB+) x P(TB+))/PM+ = (P(M+| TB+) x (P(TB+))/P(M+|TB+)*P(TB+)+P(M+|
TB-)*P(TB-)).
Chapter 4: Probability distributions
The Bernoulli experiment: there is a binary outcome with a fixed probability of an ´event´. If we have
an experiment with n Bernoulli trials, all independent of each other with equal probability of the
event phi, and we are interested in the number of events X (sometimes called 'the number of
successes), we have a Binomial distribution. In mathematical notation: X - B(n, phi).
The cumulative probability of observing k events is defined as the probability of observing at most k
events, so the probability of observing k events or less.
In general, the number of combinations (n over k) can be calculated as n! / (n-k)!k!.
Example: calculate the probability that in this family of four children, exactly three will have disease S.
P(X = 3) = (4 above 3) x 0.25^3 x 0.75^1.
The normal distribution, also called Gaussian distribution, plays an important roll. Some variables
have a symmetric, bell-shaped distribution. All normal distributions have some common features: the
distribution is symmetric around the mean, median and mean coincide, almost 68% of all
observations are located between one standard deviation below the mean and one standard
deviation above the mean, almost 95% of all observations are located between two standard
deviations below the mean and two standard deviations above the mean. Almost 99.7% of all
observations are located between three standard deviations below the mean and three standard
deviations above the mean.
For finding probabilities in normal distributions, tables have been published. These tables are based
on the so called standard normal distribution, the normal distribution with mean 0 and standard
deviation equal to 1. This last distribution is also named z-distribution.
All areas from a normally distributed variable X, with arbitrary mean muh and standard deviation,
can be transformed to an area in the z-distribution by using the so called standardization formula.
If we want to determine the red area in a figure, we can make use of a table if we transform the
graph into a standard normal graph. First, we shift the graph to a graph with mean 0, by subtracting
the mean of the original distribution. The second step is dividing the result by the standard deviation
of the original distribution: Z = (X - muh)/standard deviation.