General
● install packages
install.packages(“ …”)
library(...)
● Checking class of variable
class()
● Changing class of variables
character_var <- as.character(numeric_var)
→ a character type variable is one that stores textual data, such as letters, words, or
any other character strings
numeric_var <- as.numeric(character_var)
Numeric variables are used to represent quantitative, continuous, or discrete
numeric values, such as integers or real numbers.
date_var <- as.Date("2022-01-01")
→ pick a date
factor_variable <- as.factor(char_vector)
→ represent categorical data in R
● checking NAs
is.na(df)
any(is.na(x)))
● removing NAs
unique()
na.omit()
cleaned_vec <- vec[complete.cases(vec)]
● providing output
summary()
● merging datasets
package = dplyr
inner_join()
result <- inner_join(df1, df2, by = "ID")
left_join() or right_join()
merge()
● interpretation R-squared (R2)
,If R-squared is close to 1, it suggests that the model is a good fit, and a large
proportion of the variability in the dependent variable is explained by the independent
variables.
If R-squared is close to 0, it indicates that the model does not provide a good fit to
the data, and the independent variables do not explain much of the variability in the
dependent variable.
● interpretation intercepts/coefficients
The sign of the coefficient (positive or negative) indicates the direction of the effect.
If the coefficient is positive (e.g., +29.4), it suggests that an increase in the
independent variable is associated with an increase in the dependent variable. If it is
negative, the interpretation is the opposite.
● select and filter data
package = dplyr
selected_data <- select(data, ID, Name)
filtered_data <- filter(data, Age > 25, Score >= 90)
== → selecting on date for example
!= → not want to include this particular thing
● Family
,Model 1: Ordinary Least Squares Models
● Load dataset
airports <- read.delim(file.choose(), sep=",", header=F)
routes <- read.delim(file.choose(), sep=",", header=F)
airunuts <- read.delim(file.choose(), sep=",", header=T)
● Changing names of columns
names(airports) <- c("id", "name", "city", "country", "iata", "guko", "lat", "lon",
"altitude", "timezone", "dst", "timezonename", "type", "source")
names(airports)
names(routes) <- c("Airline", "AirlineID", "SourceAirport", "SourceairportID",
"Destairport", "DestairportID", "Codeshare", "Stops", "Equipment")
names(routes) <-tolower(names(routes))
● Package edges&nodes
library(igraph)
● Making graph from edgelist
edgelist <- routes[c("sourceairport", "destairport")]
package = igraph
route1 <- graph_from_data_frame(edgelist,directed = T,vertices = NULL)
● Calculating mean & median
mean(degree(route1,mode = "in"))
median(degree(route1,mode = "in"))
● Calculating standard deviation of indegree
sd(degree(route1,mode = "in"))
● Calculating indegree and sorting it
sort(degree(route1, mode = "in"), decreasing= T)[1:12]
● Making histogram
hist(degree(route1), col="navyblue", breaks = 50 )
● Create edgelist with large airports only >411 degrees (top 12)
sa <- data.frame(table(edgelist$sourceairport))
, ds <- data.frame(table(edgelist$destairport))
edgelist2 <- merge(edgelist, sa, by.x = "sourceairport", by.y = "Var1", all = T )
edgelist2 <- merge(edgelist2, ds, by.x = "destairport", by.y = "Var1", all = T )
edgelist3 <- edgelist2[edgelist2$Freq.x >411 & edgelist2$Freq.y >411,]
route2 <- graph_from_data_frame(edgelist3[1:2])
plot(route2)
● Differences transitivity, betweenness & closeness
Transitivity: If two neighboring airports are unconnected, people have a larger
likelihood to pass your airport. They might also take an alternative equally long
route however.
Betweenness: Expresses how many people are forced to go through your airport
if they want to take the shortest route.
Closeness: Says something about how many steps you are from other airports.
This makes you are more attractive airport.
● Making histograms of centrality measures
hist(transitivity(route1, type ="local", isolates = "zero"), col="coral", breaks = 50)
hist(betweenness(route1, directed = T), col = "darkorchid2", breaks = 50)
hist(closeness(route1, normalized=F,mode = "in" ), col = "deepskyblue", breaks =
50)
round(sort(betweenness(route1, directed = T), decreasing =T)[1:12],0)
round(sort((closeness(route1, mode="in" )), decreasing =T)[1:12],2)
● run a linear model
model0 <- lm(y ~x, data= namedataset, na.action = na.exclude)