VVA hoorcollege 1
Ook al ken je causale verband —> als je inputs niet kunt meten kan het nog steeds random proces
zijn.
Random proces of systamitisch proces
Wat is er bijzonder aan bepaalde gegevens —> wijken de getallen af? Is dit random?
Statistiek is ook gemiddelde bijvoorbeeld
Overheid
Programma
- sqrt zonder hoofdletter
- Selecteren van de som en run —> antwoord
- Met enter ook antwoord
- X <- 5 dan geeft het programma x = 5 aan rechterkant
- Als je x +2 dan intypt krijg je als antwoord 7
- Hetzelfde geld voor Y
- Reeks nummers achter elkaar: myvariable <- c(72, -90, 69)
- Dan krijg je als je myvariable*3 alle antwoorden van de getallen keer 3
- NaN —> not a number (bijv wortel kan niet van negatief getal gegeven worden0
- Mean = gemiddelde
- Myvariable2 <- myvariable*2 , dan enter en weer myvariable intypen krijg je waarde er van
- Length, sum kan ook
- [2] is tweede getal in reeks myvariable bijv
- Myvariable > 0 —> enter krijg je true en false
- [Myvariable > 0 ] enter krijg je de nummers waarbij het klopt
- == betekent is gelijk aan, dus 2x =
- ! = betekent niet gelijk aan
- Excel bestand in csv zetten voor importeren
- Import data rechtsboven —> text file —>
- Mean (naam van geïmporteerde bestand) enter —> eenheid achter de naam bijv dollar teken,
en dan de kolom waarvan je het gemiddelde wilt weten
- Datanaam [ 2,5] —> krijg je het getal wat staat als je 2 naar onder gaat en 5 naar rechts
- Om alle getallen te krijgen doe je [ , 5]
- Table (datanaam$gender) dan krijg je aantal mannen bijv en vrouwen
- Alleen females —> datanaam$gender == condition where I want to be it true dit kan door ‘F’
bijvoorbeeld —> dan krijg je true en false ding
- Nieuw dataframe : female <- datanaam [vorige statement namelijk datanaam$gender == ‘F’ , ]
- Femalespeeds <- datanaam$speed[datanaam$gender == ‘F] geen komma want geen frame
-
Werkcollege 1
Hekje en vervolgens bijv antwoord op .. dan wordt dat niet berekend, maar als comment gegeven
Vraagteken en erachter functie —> uitleg van de functie krijg je dan
, • str(): Prints the structure of the dataframe in a compact
way. Each variable name is given (preceded by a $
sign), followed by an indication of the variable type, and
then an example of the contents. The label 'Factor' can
be taken as a synonym for 'Categorical'. The label 'int'
refers to integers: these are numbers without decimals,
and the label 'num' refers to numbers with decimals.
• summary(): Prints for each variable in the data frame a
short overview of the contents. For the categorical
variables, it gives a list of how frequently each category
occurs (up to the first 6 categories, alphabetically
ordered). For the numerical variables, the 5-number
summary and the mean is given.
• head(): Prints the top 6 rows of the dataframe.
• Size:
◦ dim(G) - returns a vector with the number of rows in
the first element, and the number of columns as
the second element (the dimensions of the object)
◦ nrow(G) - returns the number of rows
◦ ncol(G) - returns the number of columns
• Names:
◦ names(G) - returns the column names (synonym of
colnames() for dataframes)
◦ rownames(G) - returns the row names.
The different variables make-up different columns in the
dataframe. You can select a column from a dataframe by
using the $ symbol. The command G$lifeExp means: column
lifeExp from dataframe G. So to copy column lifeExp into a
new variable, the following notation can be used.
lifeExp <- G$lifeExp
The new object created (lifeExp) is not a dataframe
anymore, but a vector with the data for one variable and
consequently also values of one type (numerical data in this
,case). The lifeExp variable also shows-up in the
Environment tab in the upper-right pane (under the section
‘Values').
Dataframes have rows and columns. If you want to extract
specific information from it, you need to specify which rows
and columns you want in between square brackets. Row
numbers come first, followed by column numbers, separated
by a comma. If you don't specify the row number or the
column number all rows or all columns are returned. If you
want multiple rows or columns, you can combine them with
the c() command or use the : command if you want
consecutive rows.
# First element in the first column
G[1,1]
# First element in the 3th column
G[1,3]
# First row
G[1,]
# First column
G[,1]
# First three elements in the 4th column
G[1:3,4]
# Elements from the second row, first and fifth column
G[2,c(1,5)]
The command unique() determines the number of unique
entries in a variable. This can be very useful to find out the
details of large data sets. For the gapminder data it can, for
example, help to find out for how many countries we have
data. To determine the length of a vector you can
furthermore use the command length().
You can also use values of one variable to make selections
from the dataset. For this the logical operators like == can be
used. The following command selects e.g. all rows in G
, which apply to Europe, and subsequently uses the result to
make a subset from the vector country (which is stored in a
new vector countryEurope).
Make a vector with pop data for the year 1962 and the
continent Americas.
Solution
You can do this in a few steps:
1) Select all data for 1962
Save this in a new dataframe G1962.
G1962 <- G[G$year == 1962, ]
The selection between the square brackets means: 1) select
all rows from G for which G$year is 1962 and 2) (after the ,)
use all columns.
2) Select all rows for which the continent is Americas.
The syntax is the same as in the first step, but now uses
G1962 to start with.
G1962_Americas <- G1962[G1962$ continent == "Americas",]
3) Select the column pop
For the third step, select from the dataframe that contains
only data from Americas in 1962 (created in step 2).
G1962_Americas_pop <- G1962_Americas$pop
Rij links
Kollom rechts
[rij,colom]