Lecture LRM
Variables can be split into categorical and continuous, and within these types there are different
levels of measurement:
- Categorical (entities are divided into distinct categories):
- Binary variable: There are only two categories (e.g., dead or alive).
- Nominal variable: There are more than two categories (e.g., whether someone is an omnivore,
vegetarian, vegan, or fruitarian).
- Ordinal variable: The same as a nominal variable but the categories have a logical order (e.g.,
whether people got a fail, a pass, a merit or a distinction in their exam).
Continuous (entities get a distinct score):
- Interval variable: Equal intervals on the variable represent equal differences in the property being
measured (e.g., the difference between 6 and 8 is equivalent to the difference between 13 and 15).
- Ratio variable: The same as an interval variable, but the ratios of scores on the scale must also
make sense (e.g., a score of 16 on an anxiety scale means that the person is, in reality, twice as
anxious as someone scoring 8). For this to be true, the scale must have a meaningful zero point.
- The mean is the sum of all scores divided by the number of scores. The value of the mean can be
influenced quite heavily by extreme scores.
- The median is the middle score when the scores are placed in ascending order. It is not as
influenced by extreme scores as the mean.
- The mode is the score that occurs most frequently.
The variance and standard deviation tell us about the shape of the distribution of scores. If the mean
represents the data well then most of the scores will cluster close to the mean and the resulting
standard deviation is small relative to the mean. When the mean is a worse representation of the
data, the scores cluster more widely around the mean and the standard deviation is larger. Figure
1.11 shows two distributions that have the same mean (50) but different standard deviations. One
has a large standard deviation relative to the mean (SD = 25) and this results in a flatter distribution
that is more spread out, whereas the other has a small standard deviation relative to the mean (SD =
15) resulting in a pointier distribution in which scores close to the mean are very frequent but scores
further from the mean become increasingly infrequent. The message is that as the standard
deviation gets larger, the distribution gets fatter. This can make distributions look platykurtic or
leptokurtic when, in fact, they 72 are not
- The deviance or error is the distance of each score from the mean.
- The sum of squared errors is the total amount of error in the mean. The errors/deviances are
squared before adding them up.
- The variance is the average distance of scores from the mean. It is the sum of squares divided by
the number of scores. It tells us about how widely dispersed scores are around the mean.
- The standard deviation is the square root of the variance. It is the variance converted back to the
original units of measurement of the scores used to compute it. Large standard deviations relative to
the mean suggest data are widely spread around the mean, whereas small standard deviations
,suggest data are closely packed around the mean.
- The range is the distance between the highest and lowest score.
- The interquartile range is the range of the middle 50% of the scores
The standard error of the mean is the standard deviation of sample means. As such, it is a measure
of how representative of the population a sample mean is likely to be. A large standard error (relative
to the sample mean) means that there is a lot of variability between the means of different samples
and so the sample mean we have might not be representative of the population mean. A small
standard error indicates that most sample means are similar to the population mean (i.e., our sample
mean is likely to accurately reflect the population mean).
Article lecture + Q&A Discrete Choice modelling
Important things to consider when you look at the reliability of a scientific journal for your
research + answered for the article lecture
The research aim is entrepreneurial intention differences between north and south Europe.
Database is: 2004 GEM Data on an individual level it is a survey (individual information) provided
by the people themselves. It is micro data about themselves (individual)
Data restrictions: regional restrictions (only north and south Europe), data of individuals that are
already active with activity. It must also be people with entrepreneurship intentions within 3 years.
(note: 3 years in this research, because, e.g. 10 years would be too long (idk maybe I will be
entrepreneur)
Dependent variable: Intentions to be an entrepreneur intensions within 3 years
Quantitative analyses method: discrete choice (yes/ no? 0=no 1=yes)
Does the model in the journal meet the conditions?: not entirely, some things are missing. not
looking at residuals, error terms in the model.
How is the overall model evaluated?
Odds ratio, how is that interpreted?
Why is there a separate regression?: first they do wih all data, then separate only north and south:
for the overview. But then it is more difficult to compare because there are now 2 different datasets
Which model is the best? (model fit): the model is significant, but what is the relevance of various
models? So this is about model relevance. Look for the goodness of fit:
Cox and snell Pseudo r-squared (the higher the better)
Look at the negelkerke r squared (the higher the beter)
Percentage correct (classification table): higher the better
For the test: argue why which model is the best: argue with the pseudo r squared, significance and
relevance
, In this journal, it’s either 4 or 5
Notes
Importance of comparative data; comparing data is important to harmonization of data because it
is impossible to have 1 dataset with every data. Plus if you collect data yourself only for your own
region/country: less valuable; you can do less with the data.
Significance: is a model usefull or not uberhaupt
Relevance: is it relevant for this research
Lecture slide: Result analyses 1 (2)
Odds ratio:
Check which ExpB is different than 1
Significance for odds ratio: whether you can reject the null hypothesis that the odds ratio is 0 (that
there is no effect) so if p <0.05 it is significant
The odds ratio is below 0. The reference is Mediterranean because you always compare to 1. So
the intentions in mediterranian countries is lower than in Scandinavia. It is 1 – 0.752 = (…) % lower
than in Scandinavia
The categories (age, gender) are dummies
When looking at age, model 5: when the Age goes up with 1 year; their entrepreneurial intentions go
down with 3.9 % (1-0.961x100%)
1 – (..) because: you have a reference: 1 is your reference, you calculate the difference with your
reference. So f e is your odds ratio is 2.5 compare it to the reference 1 always with odds ratio’s,
but then substract the 1 to see what the difference is. (zoveel keer hoger dan de reference categorie)
4 steps:
1. Testing model assumptions (de lijst met waar het model aan moet voldoen, mogen we het
uberhaupt wel gebruiken?)
2. Is the overall model significant (is it better than an empty model?, think of the classification table)
3. What is the model relevance (how relevant is this model compared to other e.g.)
4. And what are the effect that we’re interested in (in this case: chances of entrepreneurship
compared to education level)
Q&A Part
Natural logarithm (nl): mathematic function that has to to with the power (tot de macht). Om tot het
goeie getal te komen en een lineaire relatie te hebben.