DV = outcome metric: ratio/interval = contentious
IV = explanatory non-metric: ordinal/nominal = categorical
L1: Chapter 2 Book p. 31-88 (see images)
1. Univariate profiling: the starting point for understanding the nature of any variable is to
characterize the shape of its distribution. → Histogram.
2. Bivariate profiling: relationships between two or more variables. → Scatterplot. This is a
graph of data points based on two metric variables.
A straight line means a linear relationship or correlation. A curved set can denote a
nonlinear relationship. And a random pattern may indicate no relationship.
Bivariate profiling, examining group differences: Use a Boxplot, a pictorial representation
of the data distribution of a metric variable for each group of a nonmetric variable.
Boxplot: the middle 50% above 25% and beneath 75%. median = line. A bigger spread is
bigger boxplot = higher standard deviation. Whiskers: lines extending from each box.
Outliers = 1.0 to 1.5 quartiles (25%-37.5%) away from the box
Extreme values = 1.5+ quartiles away from the box are both depicted by symbols outside
whiskers.
3. Multivariate profiling: To compare observations characterized on a multivariate profile
we can add multivariate graphical displays.
1. direct portrayal of the data values
2. mathematical transformation of original data into a relationship = Andrew Fourier
3. face, high representativeness.
MISSING DATA, we need to worry because:
1. Practical impact: the reduction of sample size n.
2. Substantive impact: results based on data with a nonrandom missing data process could
be biased.
4 step process for identifying missing data and applying remedies:
STEP 1: Determine the type of missing data
Ignorable missing data: when missing data is expected and used in technique.
• When taking a survey, the part of population that is not included in sample is missing.
• It is part of the survey list. Respondents skip questions in case of certain answers
• Censored data: when respondents cannot give complete info (time or death)
Not ignorable data:
• Many missing data processes are know to the researcher (failure to complete
survey), but some remedies can be used.
• Unknown missing data processes are less easily identified. (refusal to answer because
of sensitive nature). When missing data = random, remedies may be available.
, STEP 2: Determine the extent of missing data
Missing data under 10% can be ignored.
Before step 3, the researcher should consider Deleting individual Cases and/or Variables.
Variables with lower than 15% missing data are candidates for deletion, but higher levels
(20/30%) can often be remedied.
STEP 3: Diagnose the randomness of the missing data process
• Missing data are termed missing at random (MAR) if the missing values of Y depend
on X. → not generalizable
• A higher level of randomness is termed missing completely at random (MCAR). The
cases with missing data are indistinguishable from cases with complete data.
Diagnostic tests for levels of randomness:
1. form 2 groups: missing data for Y and valid values for Y. → test if sign. Difference exist.
2. second approach: overall test of randomness to determine to classify data as MCAR.
STEP 4: Select the imputation method
Imputation is the process of estimating the missing value based on valid values of other
variables in the sample.
Imputation method Advantages Disadvantages Best used when
Complete data: only have -Simplest and default -reduction in n -large n
complete data for programs -affected by -strong relationships
nonrandom process - low missing data
All available data -maximizes use of data -varying sample n -low missing data
-results in largest n -out of range values -moderate relationship
Case substitution: replace entire -realistic values -must have similar - additional cases
observation additional cases available
Hot and Cold Deck imputation: - replaces missing data -must define suitably - established
Hot: other observation in list with actual values from similar cases replacement value is
Cold: external case input similar case known
Mean substitution: replace - easy -reduces variance - low missing data
missing with mean value - all cases complete - lowers correlation -strong relationship
Regression imputation: predict - employs relationships - reduces generalizable - mod/high missed data
missing data on its relationship - values are connected -need strong relation - strong relationships
Model-Based Methods: involve - accommodates - complex model -only method that can
missing data nonrandom and -requires specialized accommodate
random data processes not available software nonrandom missing
- best representation data
Under 10% - any imputation.
10%-20% - all-available, hot deck case & regression for MCAR. Model-based for MAR.
20%+ - regression for MCAR and model-based for MAR.
Outliers:
-Practical: 20 persons income between 30k-60k, average 48k, but when 1 people of 1 million
adds here, 90k average. Researcher must asses whether value is retained or eliminated due
to its undue influence on results.
, -Substantive: outliner must be viewed how representative it is of the population. If there is a
group of millionaires this can be retained, but if he is the only one, it may be deleted.
Why do outliers occur?
1. Procedural error: data entry error, mistake in coding
2. Extraordinary event: a unique real observation, researcher decides if it fits objective
3. Extraordinary observations with no explanation: researcher decides (most deleted)
4. Unique in their combination: not high or low, but unique in combination.
Detecting outliners:
- Univariate Detection: make normal distribution, look if outliner exist normally in
distribution, then decide. Outliner 2.5 or higher.
- Bivariate Detection: pair of variables can be assessed jointly through scatterplot. Downside
is when there are 5 variables, we already have 10 graphs.
- Multivariate Detection: more than 2 variables. Bivariate becomes inadequate, because a
lot of graphs and are limited to 2Dimension. High D2 value is observation farther removed
from general distribution.
Four important statistical assumptions
1. Normality: Shape of the distribution if its not normal: Kurtosis: flatness, Skewness:
balance of distribution.
2. Heteroscedasticity: result of 1 or more variables of nonnormality.
3. Linearity: remedy for nonlinear relationships can be transformed to linear.
Absence of correlated errors: when these errors aren’t found, serious biases can occur.
Lecture 1
Multivariate Analysis: ‘Broadly speaking, it refers to all statistical methods that
simultaneously analyze multiple measurements on each individual or object under
investigation’
Nonmetric measurement scales:
Nominal: unique definition (brand name, %, mode, Chi square test
gender, student ANR)
Ordinal: indicate ‘order’, sequence (level of percentiles, median, rank correlation
educations)
Metric measurement scales:
Interval: arbitrary origin (attribute scores, Arithmetic average, range, standard
price index) deviation, product-moment correlation
Ratio: unique origin, zero point (age, cost, Geometric average, coefficient of variation
number of customers)
Reliability: Is the measure ‘consistent’, correctly registered?
Validity: Does the measure capture the concept it is supposed to measure?
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller FantaNaranja. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $3.70. You're not tied to anything after your purchase.