Masterproject – Basis Lecture 1
This course
- 3 lectures: two on quantitative analyses and one on writing the discussion and abstract.
- 2 consultation sessions: for students who encounter problems, 4 teachers are available, you can
register for 10 minutes consultation in week 9 in excel sheet Teams.
- 6 tutorial groups: discuss your analysis with teacher and fellow students.
- Deadline thesis: June 18.
- Presentation thesis: July 2.
Data-analysis Organisation Plan (assignment 1; 1 A4)
- Data cleaning is very important. Provide information about:
- 1. Data cleaning (How will you clean your data, what are you actually going to do).
- 2. Outliers (How are you going to handle your outliers, how will you find them and what are you
going to do with them).
- 3. Sample (Give information about your sample, what is it going to look like; which sample are you
going to select, maybe you will select subsamples).
- 4. Scale creation (How are you going to create your variables, or your scales).
- It’s helpful to have your data available when you’re writing down what you’re going to do because
it’s easier to think about outliers when you can check for outliers for instance.
Data cleaning
- Examine the frequencies of all of the items of your variables (that you’re going to use). Not on
variable level but on item level, make a frequency table of all these items included. Because it
already gives you more information about the quality of your data, and also about the possible next
steps that you’re going to have to make.
Are there any impossible values?
- You know which values your items can have, most of the time these are values like 1-5 on a 5-point-
scale. If you have a 5-point-scale then values of 0, 6 or 9 are impossible. So then you have to recode
them, maybe these are missing values. Make sure that all the items only have the scores that are
possible. And exclude the impossible values.
- So you can recode them into the most probable value, and in case of doubt recode it as ‘missing
value’. Make sure that there are only possible values.
Next to impossible values you also have to do with unreliable values.
- In research among adolescents, even more so than among adults, young people and adolescents
make it unreliable values. Maybe they’re not motivated to fill out a questionnaire. Many
questionnaires are administered in school settings. Young people can be bored, and may provide
unreliable values because they think it’s boring or they want to be rebellious (extreme scores
because of making fun of the questions).
- How are you going to deal with them and detect unreliable values?
- First check if there are systematic response tendencies. These are the tendencies when you
have the data, so you’re going to look at your data-set. Maybe there’s a lot of information, so
it’s not easy to find response tendencies. But maybe you can find participants who have a 1
score on 50 questions, or 5 score on 50 questions. They have the lowest, highest or mid-
scores, which are most used for systematic response tendencies. If there are many same
scores for many questions, it may indicate that these questions are answered unreliably.
- A good way also to find unreliable answers is to check inconsistencies (within or between
, variables). They go well together, so it’s good to check one thing and for the next step check
inconsistencies. Because sometimes you find systematic response tendencies by finding
these inconsistencies. You can check them within variables (scale with positive and negative
items, if a participant gives exactly the same answer for the positive and the negative item,
then you’re quite sure that these answers are unreliable, and you have to deal with them),
but also between variables (when you look at your questions like ‘Did you drink alcohol
during last month’ YES, and then ‘How many glasses last month’ 0. Then you know that one
of these answers is unreliable).
- How to deal with these unreliable answer patterns and inconsistencies? Maybe recode into the
most probable value, Like in this case, probably they haven’t been drinking because ‘yes drinking is
cool’ but when asked about the number of glasses they say no I haven’t been drinking. But in case of
doubt recode it as a missing value.
- This is information you have to write about in your data analysis plan: How did you handle
unreliable data? How many recodes/excludes do you have in your dataset?
Outliers
- You can find them, you have to check for them.
- There are two types of outliers:
- Outliers on one variable: when you have one variable and almost all the students are
around the same score, and a few are more than 10 times as high, then these are real
outliers; e.g. how many alcoholic drinks, there are always young adolescents who say 150 or
1000 a week; how will you handle these extreme scores are they reliable or not?.
- Outliers found in the relationship between two variables: e.g. you’re studying the
relationship between alcohol use (e.g. number of glasses) and problematic alcohol use (e.g.
symptoms of alcohol addiction). You expect a relationship, the more someone drinks, the
higher the chance of symptoms of alcohol addiction. But, there may be one two or three
people who do not fit this pattern. Because for instance they do not drink that many glasses,
but they do have a relative high score on problematic alcohol use. So you have to ask
yourself, what’s going on here and is this reliable or unreliable.
- There are a lot of statistics to find these outliers. There are YouTube videos and there’s information
in the SPSS book by Field on finding these outliers. Nevertheless, determining whether or not an
observation is an outlier and specifically how to handle outliers is a very subjective exercise. So,
handle ‘outliers’ with care. Never go on doing the statistics, that show that there are outliers, and
then by definition exclude the outliers. Because you really need a good reason to exclude these
outliers, think about it. What’s going on? Is it acceptable? Can it be reliable? Or is this person being
rebellious? You do have to exclude outliers for open ended questions because then there is a long
range of variables, if you have open ended questions and values are clearly impossible, then it can be
excluded.
- So, when do you exclude outliers? In all cases for open questions (because then there’s a long
range of variables; if you have open ended questions and values are clearly impossible or unreliable
you can exclude these cases); possible solution unreliable high scores (for example the number of
glasses of alcohol, you can say what is still reliable/normal; M+2SD, or M+3SD accepted as normal;
and for instance then you recode the ‘outliers’ into the maximum score that you find reliable).
- This is information you have to write about in your data analysis plan: How many outliers, on which
variables, and what did you do with them in your sample?
Sample
- 1. What does your sample look like? You can provide the most basic information about your
sample in your data-analysis plan: the number of respondents, gender, age, educational level, etc.
, What are the frequencies of the different levels. Short paragraph on this information.
- 2. Which sample to select? Important because it may depend on different factors in your research.
Sometimes you need to select a subsample. E.g. If you’re interested in the onset of a certain
behavior, like drinking or smoking. And you use longitudinal data, you have to make a selection of
participants on the first measurement T1. And exclude those who already drink/smoke. So you
include the non-drinkers/non-smokers at T1, to see what their score is on T2. Have they become a
drinker or a smoker. Or are they still abstaining from drinking or smoking. So, onset of new behaviors
is an important reason to choose a subsample and not use your complete sample. E.g. a subsample
to avoid extremely skewed distributions (zero-inflated distributions) in your dependent variable
particularly. If your dependent variable is skewed, with a lot of zero scores. Then you have to think
twice about your sample. Like game addiction, lots of adolescents have a 0 score on lots of
symptoms. 0 score can mean no symptoms of game addiction, or that they are non-gamers. Then
you have to exclude the non-gamers; select the gamers only.
- 3. What about the sample of longitudinal research? Many use longitudinal data, how do you deal
with participants that did not participate in all measurement waves. Well, as a rule of thumb you
select the participants that engaged in all waves. So determine how big the group is of participants
that filled out all the questionnaires, scores on all waves. However, maybe you’ll lose a lot of
participants. For instance you have 2000 participants at T1, and at T2 a year later you only have 1000
students who also filled out the second questionnaire. Then you have a drop-out of 50%. You have to
report drop-out per wave (N’s, % of T1) in your data-analysis plan. And provide information how
you handle the risk of selective dropout. Because if you’ve lost participants it’s really important that
you can answer the question if there’s selective dropout. Are there certain characteristics that make
it that there’s a bigger chance that these persons dropped out from the sample. Selective dropout is
particularly problematic, if it’s particularly those adolescents who are vulnerable to develop e.g.
game addiction. And your study is on game addiction, then you have a problematic selective dropout
for the reliability and therefore the external validity of your data. And you have to get more insight
into who exactly dropped out from the study. And to gain more information about that you can
conduct a attrition analyses (are there differences between those who dropout and those who
participated in follow-up wave?; A logistic regression analysis with the dependent variable and two
scores, does a certain participant also have a score on the 2 nd measurement wave, or is it a dropout.
So dropout yes/no. Then you can do a logistic regression and see if the number of game addiction
symptoms differ between those who dropout and those who remained in the dataset. Then you have
qualitative information about what’s going on. And to what extent is this problematic for you data
and your results).
- You write about this in your data-analysis plan: What was the original sample size, which selections
did you make and why, what size is your final sample? What did attrition analysis show? And what
does it mean for your data?
- You can’t solve the problem in case of selective dropout. But it’s important to have in mind when
you write your discussion section. So you have to reflect on the problem in your discussion.
Scale Creation
- 4. How to create your scales? We have to write about our scale creation. Most of us will work with
scales that consist of more items. And there are different ways to create these scales. And how are
you going to create these scales? Well, first of all it’s important to see if you’re working with already
existing validated scales, that are often used in literature. Or do you work with rather new scales
that are not quite validated yet. If you’re working with validated scales, it has a preference to use this
exact same scale. Also when you’re going to do a reliability analysis, with Cronbach’s alpha, and you
see that for instance 1 item is problematic. If it’s a validated scale and is used in international