MRM I NOTES
WEEK 1: DATA
What is “data”?
- Various properties (variables) measured from a set of things, people (units)
- Data has a fixed structure
o Each column = properties of the unit
o Each row = unit, the thing we’re studying
What is a “case” or “unit”?
- Experimental or observational entity being measured (students, cats etc.)
- Each case or unit has variables
Types of measurement
- Categorical measurements shown with distinct categories
o Binary variable
Two categories (dead/alive, black/white).
Offers the least amount of information
Need minimum 300 units for good sample:
The less information shown, the larger the sample needs to be
o Nominal variable
Has several categories (omnivore, vegetarian or vegan)
o Ordinal variable
Assesses value on a scale (bad, ok, good, great)
- Numerical measurements: shown with numbers
o Discrete data:
Round number counts (nr of defects, delayed flights)
Cannot be negative, must be count-able
o Continuous data:
A numerical value (body temp, height, weight)
Offers the most amount of information
Need minimum 30 units for good sample
Information amounts:
- We can always downshift information amounts (from discrete to nominal, subtracting
amount of information)
- We cannot shift upwards: we cannot add information
- Downshifting information is non-reversible.
o Body length less than 160 cm we convert to category “small”
o Body length between 160 cm – 180 we convert to category “med”
o Body length greater than 180 cm we convert to category “tall”
The lower the amount of information, the larger the sample needs to be
Complementary research methods:
- Research involving numbers = quantitative methods
- Research analyzing language = qualitative methods
Research:
- Start with question you want to answer, based on observation, anecdotal, etc.
- Generate theory, generate hypothesis
- Identify variables
- Gather data, measure variables
- Analyze data, fit model to graph
- Theory is supported or not
,Falsification: proving a theory wrong
Variables: things that change based on circumstances
- Independent variable: the “causing” variable (predictor), shown as “x”
- Dependent variable: the “effect” variable (outcome)
- Categorical variable: things that belong to various categories
- Binary variables: things that fall into two categories
- Nominal variable: a variable with the same name but with more than two categories
(cat/brown, cat/black, cat/white, cat/orange); can use numbers
- Ordinal variable: categorical variables on a value scale (awful, bad, ok, good, great)
- Continuous variable: gives a score for each thing being measured, can take on any
value (negative, decimal, etc.)
- Interval variable: when points on a scale are equidistant to each other
- Ration variable: ratio of values along the scale have meaning: true “zero”, and that
“4” is twice as meaningful as “2”
- Discrete variable: round numbers only, no negative numbers
Parameters:
- Different from variables
- Constants believed to be fundamental truths
- The mean and median represent the center of the distribution
Level of measurement:
- That which is being measured and the numbers that represent what is being
measured
Validity: whether an instrument is measuring what it is supposed to measure
- Criterion validity: how well one measure predicts an outcome for another measure
(a high GRE score predicts how well someone does in school)
- Concurrent validity: data recorded simultaneously with new instrument against
existing criteria
- Predictive validity: data from a new instrument are used to predict observations at a
later point in time
- Content validity: the degree to which individual items represent the construct being
measured
- Test-retest validity: test the same group twice – a reliable instrument produces
similar results at various points in time
Reliability:
- Whether an instrument measures consistently across different situations
Correlational research:
- We observe what naturally goes on without directly interfering with it
- Cross-sectional research
Experimental research: we manipulate one variable to see its effect on another
Assessing data
Is the sample representative?
- Does the sample represent the total population?
- Can the sample findings be generalized to an entire population?
- Eg: only sampling students in Amsterdam for a country-wide study
Is the data valid?
- Does data reflect what it should reflect?
- Can it be used to answer the research question?
- “Face validity check”: checking data for obvious errors and mistakes
, - Were there other problems / irregularities during measurement?
Is there a measurement error?
- Discrepancy between the actual (real life) value we are trying to measure and the
number we use to represent that value
- Example: you (in reality) weigh 80 kg. According to your bathroom scale, you weigh
83 kg. The measurement error is 3 kg.
- Two types of measurement error:
o Systematic: problem with the system; the results are accurate but “off”:
measuring tool isn’t calibrated
o Random: problems unrelated to the system, results are all over the place;
measuring tool is non-functional, imprecise
Operational definition:
- Describe measuring procedure
Describing data: location
Median:
- The “middle score” when data is ordered
- In a lineup of 11 datapoints, nr 6 is the median
- For an even number of results, use the mean of the two central numbers
Mean:
- The sum of the data (sigma) divided by the number of datapoints (n)
- Symbolized by “x bar”
If the median is lower than the mean…
- You have major outliers in the high end of the distribution
- Eg. one Bill Gates among normal to measure personal net worth
- Graph is positively skewed
If the median is higher than mean …
- You have major outliers in the low end of the distribution
- Eg. one guy with gambling debt among normal to measure personal net worth
- Graph is negatively skewed.
Describing data: distribution
Range:
- The smallest value subtracted from the largest
- Highest value = 100, smallest = 20? Range = 80
- Very sensitive to outliers
Interquartile range:
- The middle 50% of the data
- Range between upper and lower quartiles
- Upper quartile: the median of the upper half of the data
- Lower quartile: the median of the lower half of the data