Summary of the book, The Analysis of Biological Data
Chapter 1-4,6
Authors: Withlock – Schluter
Second edition
Chapter 1 Statistics and samples
Estimation is the process of inferring an unknown quantity of a population using sample
data.
A parameter is a quantity describing a population, whereas an estimate or statistic is a
related quantity calculated from a sample.
Statistics is also about hypothesis testing. A statistical hypothesis is a specific claim
regarding a population parameter. Hypothesis testing uses data to evaluate evidence for or
against statistical hypotheses.
A sample is a much smaller set of individuals selected from the population. A population is
all the individual units of interest, whereas a sample is a subset of units taken from the
population.
This chance difference from the truth is called sampling error. The spread of estimates
resulting from sampling error indicates the precision of an estimate. The lower the sampling
error, the higher the precision. Larger samples are less affected by chance and so, all else
being equal, larger samples will have lower sampling error and higher precision than smaller
samples.
Sampling error is the difference between an estimate and the population parameter being
estimated caused by chance.
Ideally, our estimate is accurate (or unbiased), meaning that the average of estimates that
we might obtain is centered on the true population value. If a sample is not properly taken,
measurements made on it might systematically underestimate (or overestimate) the
population parameter. This is a second kind of error called bias.
Bias: is a systematic discrepancy between the estimates we would obtain, if we could
sample a population again and again, and the true population characteristic.
A sample of convenience is a collection of individuals that are easily available to the
researcher.
Volunteer bias: which is a bias resulting from a systematic difference between the pool of
volunteers (the volunteer sample) and the population to which they belong.
Compared with the rest of the population, volunteers might be
- More health conscious and more proactive;
- Low-income (if volunteers are paid);
, - More ill, particularly if the therapy involves risk, because individuals who are dying
anyway might try anything;
- More likely to have time on their hands (e.g., retirees and the unemployed are more
likely
- to answer telephone surveys);
- More angry, because people who are upset are sometimes more likely to speak up;
or
- Less prudish, because people with liberal opinions about sex are more likely to speak
to surveyors about sex.
Categorical data are qualitative characteristics of individuals that do not have magnitude on
a numerical scale.
survival (alive or dead),
- Sex chromosome genotype (e.g., XX, XY, XO, XXY, or XYY),
- Method of disease transmission (e.g., water, air, animal vector, or direct contact),
- Predominant language spoken (e.g., English, Mandarin, Spanish, Indonesian, etc.),
- Life stage (e.g., egg, larva, juvenile, subadult, or adult),
- Snakebite severity score (e.g., minimal severity, moderate severity, or very severe),
- Size class (e.g., small, medium, or large)
A categorical variable is nominal if the different categories have no inherent order.
Nominal means “name.
The values of an ordinal categorical variable can be ordered.
A variable is numerical when measurements of individuals are quantitative and have
magnitude. These variables are numbers. Measurements that are counts, dimensions,
angles, rates, and percentages are numerical. (Numerical data are quantitative
measurements that have magnitude on a numerical scale.)
Often when association between two variables is investigated, a goal is to assess how well
one of the variables, deemed the explanatory variable, predicts, or affects the other
variable, called the response variable. When conducting an experiment, the treatment
variable (the one manipulated by the researcher) is the explanatory variable, and the
measured effect of the treatment is the response variable. For example, the administered
dose of a toxin in a toxicology experiment would be the explanatory variable, and organism
survival would be the response variable.
The frequency distribution describes the number of times each value of a variable occurs in
a sample.