Methodology Marketing & Strategic Management
Key terms Chapter 1 - Introduction
Big Data The explosion in secondary data typified by increases in the volume, variety and
velocity of the data being made available from a myriad set of sources (e.g., social media,
customer-level data, sensor data, etc.).
Bivariate partial correlation Simple (two-variable) correlation between two sets of residuals
(unexplained variances) that remain after the association of other independent variables is
removed.
Bootstrapping An approach to validating a multivariate model by drawing a large number of
subsamples and estimating models for each subsample. Estimates from all the subsamples
are then combined, providing not only the “best” estimated coefficients (e.g., means of each
estimated coefficient across all the subsample models), but their expected variability and thus
their likelihood of differing from zero; that is, are the estimated coefficients statistically different
from zero or not? This approach does not rely on statistical assumptions about the population
to assess statistical significance, but instead makes its assessment based solely on the
sample data.
Causal inference Methods that move beyond statistical inference to the stronger statement
of “cause and effect” in non-experimental situations.
Composite measure Fundamental element of multivariate measurement by the combination
of two or more indicators. See summated scales.
cross-validation Method of validation where the original sample is divided into a number of
smaller sub-samples (validation samples) and that the validation fit is the “average” fit across
all of the sub-samples.
Data mining models Models based on algorithms (e.g., neural networks, decision trees,
support vector machine) that are widely used in many Big Data applications. Their emphasis
is on predictive accuracy rather than statistical inference and explanation as seen in
statistical/data models such as multiple regression.
Data models See statistical models.
Dependence technique Classification of statistical techniques distinguished by having a
variable or set of variables identified as the dependent variable(s) and the remaining variables
as independent. The objective is prediction of the dependent variable(s) by the independent
variable(s). An example is regression analysis.
Dependent variable Presumed effect of, or response to, a change in the independent
variable(s).
Dimensional reduction The reduction of multicollinearity among variables by forming
composite measures of multicollinear variables through such methods as exploratory factor
analysis.
Directed acyclic graph (DAG) Graphical portrayal of causal relationships used in causal
inference analysis to identify all “threats” to causal inference. Similar in some ways to path
diagrams used in structural equation modeling.
Dummy variable Nonmetrically measured variable transformed into a metric variable by
assigning a 1 or a 0 to a subject, depending on whether it possesses a particular characteristic.
Effect size Estimate of the degree to which the phenomenon being studied (e.g., correlation
or difference in means) exists in the population.
1
,Estimation sample Portion of original sample used for model estimation in conjunction with
validation sample.
General linear model (GLM) Fundamental linear dependence model which can be used to
estimate many model types (e.g., multiple regression, ANONA/MANOVA, discriminant
analysis) with the assumption of a normally distributed dependent measure.
Generalized linear model (GLZ or GLiM) Similar in form to the general linear model, but able
to accommodate non-normal dependent measures such as binary variables (logistic
regression model). Uses maximum likelihood estimation rather than ordinary least squares.
Holdout sample See validation sample.
independent variable Presumed cause of any change in the dependent variable.
indicator Single variable used in conjunction with one or more other variables to form a
composite measure.
interdependence technique Classification of statistical techniques in which the variables are
not divided into dependent and independent sets; rather, all variables are analyzed as a single
set (e.g., exploratory factor analysis).
Measurement error Inaccuracies of measuring the “true” variable values due to the fallibility
of the measurement instrument (i.e., inappropriate response scales), data entry errors, or
respondent errors.
Metric data Also called quantitative data, interval data, or ratio data, these measurements
identify or describe subjects (or objects) not only on the possession of an attribute but also by
the amount or degree to which the subject may be characterized by the attribute. For example,
a person’s age and weight are metric data.
Multicollinearity Extent to which a variable can be explained by the other variables in the
analysis. As multicollinearity increases, it complicates the interpretation of the variate because
it is more difficult to ascertain the effect of any single variable, owing to their interrelationships.
Multivariate analysis Analysis of multiple variables in a single relationship or set of
relationships.
Multivariate measurement Use of two or more variables as indicators of a single composite
measure. For example, a personality test may provide the answers to a series of individual
questions (indicators), which are then combined to form a single score (summated scale)
representing the personality trait.
Nonmetric data Also called qualitative data, these are attributes, characteristics, or
categorical properties that identify or describe a subject or object. They differ from metric data
by indicating the presence of an attribute, but not the amount. Examples are occupation
(physician, attorney, professor) or buyer status (buyer, non-buyer). Also called nominal data
or ordinal data.
Overfitting Estimation of model parameters that over-represent the characteristics of the
sample at the expense of generalizability to the population at large.
Power Probability of correctly rejecting the null hypothesis when it is false; that is, correctly
finding a hypothesized relationship when it exists. Determined as a function of (1) the statistical
significance level set by the researcher for a Type I error, (2) the sample size used in the
analysis, and (3) the effect size being examined.
Practical significance Means of assessing multivariate analysis results based on their
substantive findings rather than their statistical significance. Whereas statistical significance
determines whether the result is attributable to chance, practical significance assesses
whether the result is useful (i.e., substantial enough to warrant action) in achieving the
research objectives.
2
,Reliability Extent to which a variable or set of variables is consistent in what it is intended to
measure. If multiple measurements are taken, the reliable measures will all be consistent in
their values. It differs from validity in that it relates not to what should be measured, but instead
to how it is measured.
Specification error Omitting a key variable from the analysis, thus affecting the estimated
effects of included variables.
Statistical models The form of analysis where a specific model is proposed (e.g., dependent
and independent variables to be analyzed by the general linear model), the model is then
estimated and a statistical inference is made as to its generalizability to the population through
statistical tests. Operates in opposite fashion from data mining models which generally have
little model specification and no statistical inference.
Summated scales Method of combining several variables that measure the same concept
into a single variable in an attempt to increase the reliability of the measurement through
multivariate measurement. In most instances, the separate variables are summed and then
their total or average score is used in the analysis.
treatment Independent variable the researcher manipulates to see the effect (if any) on the
dependent variable(s), such as in an experiment (e.g., testing the appeal of color versus black-
and-white advertisements).
type i error Probability of incorrectly rejecting the null hypothesis—in most cases, it means
saying a difference or correlation exists when it actually does not. Also termed alpha (α).
Typical levels are five or one percent, termed the .05 or .01 level, respectively.
type ii error Probability of incorrectly failing to reject the null hypothesis—in simple terms, the
chance of not finding a correlation or mean difference when it does exist. Also termed beta
(β), it is inversely related to Type I error. The value of 1 minus the
Type II error (1-β) is defined as power.
Univariate analysis of variance (AnoVA) Statistical technique used to determine, on the
basis of one dependent measure, whether samples are from populations with equal means.
Validation sample Portion of the sample “held out” from estimation and then used for an
independent assessment of model fit on data that was not used in estimation.
Validity Extent to which a measure or set of measures correctly represents the concept of
study—the degree to which it is free from any systematic or nonrandom error. Validity is
concerned with how well the concept is defined by the measure(s), whereas reliability relates
to the consistency of the measure(s).
Variate Linear combination of variables formed in the multivariate technique by deriving
empirical weights applied to a set of variables specified by the researcher.
Key terms chapter 2 - Examining your data
All-available approach Imputation method for missing data that computes values based on
all-available valid observations, also known as the pairwise approach.
Binning Process of categorizing a metric variable into a small number of categories/bins and
thus converting the variable into a nonmetric form.
Boxplot Method of representing the distribution of a variable. A box represents the major
portion of the distribution, and the extensions—called whiskers—reach to the extreme points
of the distribution. This method is useful in making comparisons of one or more metric
variables across groups formed by a nonmetric variable.
Cardinality The number of distinct data values for a variable.
3
,Censored data Observations that are incomplete in a systematic and known way. One
example occurs in the study of causes of death in a sample in which some individuals are still
living. Censored data are an example of ignorable missing data.
centering A variable transformation in which a specific value (e.g., the variable mean) is
subtracted from each observation’s value, thus improving comparability among variables.
cold deck imputation Imputation method for missing data that derives the imputed value from
an external source (e.g., prior studies, other samples).
comparison group See reference category.
complete case approach Approach for handling missing data that computes values based
on data from complete cases, that is, cases with no missing data. Also known as the listwise
deletion approach.
curse of dimensionality The problems associated with including a very large number of
variables in the analysis. Among the notable problems are the distance measures becoming
less useful along with higher potential for irrelevant variables and differing scales of
measurement for the variables.
Data management All of the activities associated with assembling a dataset for analysis. With
the arrival of the larger and diverse datasets from Big Data, researchers may now find they
spend a vast majority of their time on this task rather than analysis.
Data quality Generally referring to the accuracy of the information in a dataset, recent efforts
have identified eight dimensions that are much broader in scope and reflect the usefulness in
many aspects of analysis and application: completeness, availability and accessibility,
currency, accuracy, validity, usability and interpretability, reliability and credibility, and
consistency.
Data transformations A variable may have an undesirable characteristic, such as non-
normality, that detracts from its use in a multivariate technique. A transformation, such as
taking the logarithm or square root of the variable, creates a transformed variable that is more
suited to portraying the relationship. Transformations may be applied to either the dependent
or independent variables, or both. The need and specific type of transformation may be based
on theoretical reasons (e.g., transforming a known nonlinear relationship), empirical reasons
(e.g., problems identified through graphical or statistical means) or for
interpretation purposes (e.g., standardization).
dCor A newer measure of association that is distance-based and more sensitive to nonlinear
patterns in the data.
Dichotomization Dividing cases into two classes based on being above or below a specified
value.
Dummy variable Special metric variable used to represent a single category of a nonmetric
variable. To account for L levels of a nonmetric variable, L - 1 dummy variables are needed.
For example, gender is measured as male or female and could be represented by two dummy
variables (X1 and X2). When the respondent is male, X1 = 1 and X2 = 0. Likewise, when the
respondent is female, X1 = 0 and X2 = 1. However, when X1 = 1, we know that X2
must equal 0. Thus, we need only one variable, either X1 or X2, to represent the variable
gender. If a nonmetric variable has three levels, only two dummy variables are needed. We
always have one dummy variable less than the number of levels for the nonmetric variable.
The omitted category is termed the reference category.
Effects coding Method for specifying the reference category for a set of dummy variables
where the reference category receives a value of minus one (-1) across the set of dummy
variables. With this type of coding, the dummy variable coefficients represent group deviations
from the mean of all groups, which is in contrast to indicator coding.
4
, Elasticity Measure of the ratio of percentage change in Y for a percentage change in X.
Obtained by using a log-log transformation of both dependent and independent variables.
EM Imputation method applicable when MAR missing data processes are encountered which
employs maximum likelihood estimation in the calculation of imputed values.
Extreme groups approach Transformation method where observations are sorted into
groups (e.g., high, medium and low) and then the middle group discarded in the analysis.
Heat map Form of scatterplot of nonmetric variables where frequency within each cell is color-
coded to depict relationships.
Heteroscedasticity See homoscedasticity.
Histogram Graphical display of the distribution of a single variable. By forming frequency
counts in categories, the shape of the variable’s distribution can be shown. Used to make a
visual comparison to the normal distribution.
Hoeffding’s D New measure of association/correlation that is based on distance measures
between the variables and thus more likely to incorporate nonlinear components.
Homoscedasticity When the variance of the error terms (e) appears constant over a range
of predictor variables, the data are said to be homoscedastic. The assumption of equal
variance of the population error E (where E is estimated from e) is critical to the proper
application of many multivariate techniques. When the error terms have increasing or
modulating variance, the data are said to be heteroscedastic. Analysis of residuals best
illustrates this point.
Hot deck imputation Imputation method in which the imputed value is taken from an existing
observation deemed similar.
ignorable missing data Missing data process that is explicitly identifiable and/or is under the
control of the researcher. Ignorable missing data do not require a remedy because the missing
data are explicitly handled in the technique used.
imputation Process of estimating the missing data of an observation based on valid values
of the other variables. The objective is to employ known relationships that can be identified in
the valid values of the sample to assist in representing or even estimating the replacements
for missing values.
indicator coding Method for specifying the reference category for a set of dummy variables
where the reference category receives a value of zero across the set of dummy variables. The
dummy variable coefficients represent the category differences from the reference category.
Also see effects coding.
ipsatizing Method of transformation for a set of variables on the same scale similar to
centering, except that the variable used for centering all of the variables is the mean value for
the observation (e.g., person-centered).
Kurtosis Measure of the peakedness or flatness of a distribution when compared with a
normal distribution. A positive value indicates a relatively peaked distribution, and a negative
value indicates a relatively flat distribution.
Linearity Used to express the concept that the model possesses the properties of additivity
and homogeneity. In a simple sense, linear models predict values that fall in a straight line by
having a constant unit change (slope) of the dependent variable for a constant unit change of
the independent variable.
Listwise deletion See complete case approach.
Mean substitution Imputation method where the mean value of all valid values is used as the
imputed value for missing data.
Mic (mutual information correlation) New form of association/correlation that can represent
any form of dependence (e.g., circular patterns) and not limited to just linear relationships.
5