Chapter 1 Introduction
1.2 R syntax
Here a brief overview of the main components of the R language.
Note: this is not a complete guide on the language nor on programming in general, but rather a quick-start introduction with the most used operations and structures, which assumes some very basic programming skills. For all the rest, Google is your friend (it really is).
1.2.1 Basic operations
R can be used as a fully featured calculator, typically directly from the console (see bottom panel in RStudio).
Some examples:
## [1] 4
## [1] 4 8 12
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
As any other programming language, it allows for variables declaration. Variables hold a value which can be assigned, modified, and used for other computations.
## [1] 4
The arrow sign <-
is the traditional sign for the assignment operation,
but also =
works. Check this examples to see why <-
is recommended.
Conditional statements and loops have straightforward syntax (similar to C/C++ , but more compact).
if (b == 4 && a > 0) { # if-else statement
out <- ""
for (i in 1:b) { # for loop
out <- paste(out, class(b)) # concatenate two or more strings
}
print(out) # print string to output
} else {
stop("error") # stop program and exit
}
## [1] " numeric numeric numeric numeric"
While <-
is only used for variable assignment, i.e. filling up the variable
allocated space with some value, the =
is both used for assignment and
function argument passing (as introduced in Functions).
1.3 Functions
Functions are called with the ()
operator and present named arguments.
# get Normal samples with 0 (default) mean and 10 standard deviation
array <- rnorm(10, sd = 10) # sd is an argument,
# the mean is left to the default 0
array
## [1] 2.2122111 15.5501461 3.6440240 -7.2013880 -9.2220237 3.5074603
## [7] -12.8923085 -4.4643633 0.8044537 13.8015212
You can explicitly call the arguments to which you want to pass the parameters (e.g. sd of the Gaussian). In this case it is necessary, because the second position argument is the mean, but we want it to be the default.
1.3.1 Definition
To define a custom function
# implements power operation
#'
#' @param x a real number
#' @param n a natural number, defaults to 2
#' @return n-th power of x
pow <- function(x, n = 2) {
ans <- 1
if (n > 0) {
for (i in 1:n) {
ans <- ans * x
}
} else if (n == 0) {
# do nothing, ans is already set to 1
} else {
stop("error: n must be non-negative")
}
return(ans)
}
print(paste("3^5 = ", pow(3, 5),
", 4^2 = ", pow(4),
", 5^0 = ", pow(5, 0)))
## [1] "3^5 = 243 , 4^2 = 16 , 5^0 = 1"
The return()
call can be omitted (but it’s better not to).
1.4 Data types
Every variable in R is an object, which holds a value (or collection of values) and some other attributes, properties or methods.
1.4.1 Base types
There are 5 basic data types:
- numeric (real numbers)
- integer
- logical (aka boolean)
- character
- complex (complex numbers)
1.4.1.1 Numeric
When you write numbers in R, they are going to default to numeric type values
a <- 3.4 # decimal
b <- -.30 # signed decimal
c <- 42 # also without dot, it's numeric
print(paste0("data types | a:", class(a),
", b:", class(b),
", c:", class(c)))
## [1] "data types | a:numeric, b:numeric, c:numeric"
1.4.1.2 Integer
Integer numbers can be enforced typing an L
next to the digits.
Casting is implicit when the result of an operation involving integers
is not an integer
int_num <- 50L # 50 stored as integer
non_casted_num <- int_num - 2L # result is exactly 48, still int
casted_num <- int_num / 7L # implicit cast to numeric type to store decimals
print(paste(class(int_num), class(non_casted_num),
class(casted_num), sep = ", "))
## [1] "integer, integer, numeric"
1.4.1.3 Logical
The logical
type can only hold TRUE
or FALSE
(in capital letters)
bool_a <- FALSE
bool_b <- T # T and F are short for TRUE and FALSE
bool_a | !bool_b # logical or between F and not-T
## [1] FALSE
You can test the value of a boolean also using 0,1
## [1] TRUE
and a sum between logical values, treats FALSE = 0
and TRUE = 1
,
which is useful when counting true values in logical arrays
## [1] 2
1.4.1.4 Character
A character is any number of characters enclosed in quotes ' '
or double
quotes " "
.
char_a <- "," # single character
char_b <- "bird" # string
char_c <- 'word' # single quotes
full_char <- paste(char_b, char_a, char_c) # concatenate chars
class(full_char) # still a character
## [1] "character"
1.4.1.6 Special values
NA
: “not available”, missing valueInf
: infinityNaN
: “not-a-number”, undefined value
## [1] TRUE
Every operation involving missing values, will output NA
## [1] NA
## [1] "1/0 = Inf , 0/0 = NaN"
These special values are not necessarily unwanted, but they require
extra care. E.g. Inf
can appear also in case of numerical overflow.
## [1] Inf
1.4.1.7 Conversion
Variables types can be converted with as.<typename>()
-like functions, as long
as conversion makes sense. Some examples:
v <- TRUE
w <- "0"
x <- 3.2
y <- 2L
z <- "F"
cat(paste(
paste(x, as.integer(x), sep = " => "), # from numeric to integer
paste(y, as.numeric(y), sep = " => "), # from integer to numeric
paste(y, as.character(y), sep = " => "), # from integer to character
paste(w, as.numeric(w), sep = " => "), # from number-char to numeric
paste(v, as.numeric(v), sep = " => "), # from logical to numeric
sep = "\n"
))
## 3.2 => 3
## 2 => 2
## 2 => 2
## 0 => 0
## TRUE => 1
## Warning: NAs introduced by coercion
## [1] NA
1.4.2 Vectors and matrices
1.4.2.1 Vectors
Vectors are build with the c()
function.
A vector holds values of the same type.
## [1] 4 3 9 5 8
Vector operations and aggregation of values is as intuitive as it can be.
## [1] 29
## [1] 4.8
## [1] 9 8 5 4 3
Conversion is still possible and it’s made element by element.
## [1] "4" "3" "9" "5" "8"
Range vectors (unit-stepped intervals) are built with start:end
syntax.
Note: the type of range vectors is integer
, not numeric
.
## [1] "integer"
They are particularly useful in loops statements:
vec3 <- c() # declare an empty vector
# iterate all the indices along vec1
for (i in 1:length(vec1)) {
vec3[i] <- vec1[i] * i # access with [idx]
}
vec3
## [1] 4 6 27 20 40
Vector elements are selected with square
brackets []
. Putting vectors inside brackets performs
slicing
## [1] 4 3 9
## [1] 4 9
## [1] 5 8
## [1] 4 9 8
To find an element in a vector and get its index/indices,
the which()
function can be used
## [1] 2
## [1] 1 2
And finally, to filter only values that satisfy a
certain condition, we can combine which
with
splicing.
## [1] 9 5 8
## [1] 9 5 8
1.4.2.2 Matrices
Matrices are built with matrix()
## [,1] [,2] [,3] [,4]
## [1,] 1 7 13 19
## [2,] 2 8 14 20
## [3,] 3 9 15 21
## [4,] 4 10 16 22
## [5,] 5 11 17 23
## [6,] 6 12 18 24
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
## [4,] 13 14 15 16
## [5,] 17 18 19 20
## [6,] 21 22 23 24
## [1] 6 4
## [1] 6 4
## [1] 6
All indexing operations available on vectors, are also available on matrices
## [1] 1
## [1] 9 10 11 12
## [,1] [,2]
## [1,] 1 2
## [2,] 5 6
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 5 9 13 17 21
## [2,] 2 6 10 14 18 22
## [3,] 3 7 11 15 19 23
## [4,] 4 8 12 16 20 24
Operations with matrix and vectors can be both element-wise
and matrix operations (e.g. scalar product).
Note that a vector built with c()
is a column vector by default.
Some examples:
diagonal_mat <- diag(nrow = 4) # 4x4 identity matrix
# element by element
diagonal_mat * 1:2 # note: 1:2 is repeated to match the matrix dimensions
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 0 2 0 0
## [3,] 0 0 1 0
## [4,] 0 0 0 2
## [,1]
## [1,] 2
## [2,] 4
## [3,] 6
## [4,] 8
## [,1]
## [1,] 20
1.4.2.3 Arrays
Arrays are multi-dimensional vectors (generalization of a matrix with more than two dimensions). They work pretty much like matrices.
## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 9 11 13 15
## [2,] 10 12 14 16
##
## , , 3
##
## [,1] [,2] [,3] [,4]
## [1,] 17 19 21 23
## [2,] 18 20 22 24
## [1] 18
## [,1] [,2] [,3]
## [1,] 3 11 19
## [2,] 4 12 20
## [1] 2 3
1.4.3 Lists and dataframes
Lists are containers that can hold different data types. Each entry, which can even be another list, has a position in the list and can also be named.
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] TRUE
##
## $x
## [1] "a" "b" "c"
## [1] "a" "b" "c"
## [1] "a" "b" "c"
Dataframes are collections of columns that have the same length. Contrarily to matrices, columns in dataframes can be of different types. They are the most common way of representing structured data and most of the dataset will be stored in dataframes.
df1 <- data.frame(x = 1, y = 1:10,
char = sample(c("a", "b"), 10, replace = TRUE))
df1 # x was set to just one value and gets repeated ('recycled')
## x y char
## 1 1 1 a
## 2 1 2 b
## 3 1 3 b
## 4 1 4 b
## 5 1 5 b
## 6 1 6 b
## 7 1 7 b
## 8 1 8 a
## 9 1 9 a
## 10 1 10 b
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 1 1 1 1 1 1 1 1 1 1
## [1] "a" "b" "b" "b" "b" "b" "b" "a" "a" "b"
## x y char
## 2 1 2 b
## 3 1 3 b
## 4 1 4 b
The dplyr
library provides another dataframe object (called tibble) which has all
the effective features of Base R data.frame
and none of the
deprecated functionalities. It’s simply a newer version of dataframes (therefore
recommended over the old one).
## # A tibble: 15 × 3
## x y z
## <int> <dbl> <dbl>
## 1 1 1 1
## 2 2 1 2
## 3 3 1 3
## 4 4 1 4
## 5 5 1 5
## 6 6 1 6
## 7 7 1 7
## 8 8 1 8
## 9 9 1 9
## 10 10 1 10
## 11 11 1 11
## 12 12 1 12
## 13 13 1 13
## 14 14 1 14
## 15 15 1 15
For more information on tibble and its advantages with respect to traditional
dataframes, type vignette("tibble")
in an R console.
Notice that you can convert datasets to tibble with as_tibble()
, while
with as.data.frame()
you will get a Base R dataframe.
1.5 Data manipulation
Now that we know what a dataframe is and how it is generated, we can focus on data manipulation.
The dplyr
library provides an intuitive way of working with datasets.
For instance, let’s consider the mtcars
dataset.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mtcars$modelname <- rownames(mtcars) # name column with models
mtcars <- as_tibble(mtcars) # convert to tibble
mtcars # display the raw data
## # A tibble: 32 × 12
## mpg cyl disp hp drat wt qsec vs am gear carb modelname
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 Mazda RX4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 Mazda RX4 …
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 Datsun 710
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 Hornet 4 D…
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 Hornet Spo…
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 Valiant
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 Duster 360
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 Merc 240D
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 Merc 230
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 Merc 280
## # ℹ 22 more rows
Let’s say we want to get the cars with more than 100 hp, and we are just interested in the car model name and we want the data to be sorted in alphabetic order.
mtcars %>% # send the data into the transformation pipe
dplyr::filter(hp > 100) %>% # filter rows with hp > 100
dplyr::select(modelname) %>% # filter columns (select only modelname col)
dplyr::arrange(modelname) # display in alphabetic order
## # A tibble: 23 × 1
## modelname
## <chr>
## 1 AMC Javelin
## 2 Cadillac Fleetwood
## 3 Camaro Z28
## 4 Chrysler Imperial
## 5 Dodge Challenger
## 6 Duster 360
## 7 Ferrari Dino
## 8 Ford Pantera L
## 9 Hornet 4 Drive
## 10 Hornet Sportabout
## # ℹ 13 more rows
There are many other dplyr functions for data transformation. This useful cheatsheet summarizes most of them for quick access.
1.6 Plotting
In base R plots are built with several calls to functions, each of which edit the current canvas. For instance, to plot some points and a line:
# generate synthetic data
n_points <- 20
x <- 1:n_points
y <- 3 * x + 2 * rnorm(n_points)
plot(x, y)
abline(a = 0, b = 3, col = "red")
However, the ggplot
library is now the new standard plotting library.
In ggplot, a plot is decomposed in three main components: data, coordinate
system and visual marks, called geoms.
The plot is built by stacking up layers of visualization objects.
Data is in form of dataframes and the columns are selected in the aesthetics
arguments.
The same plot shown before can be drawn with ggplot in the following way.
library(ggplot2)
gg_df <- tibble(x = x, y = y)
ggplot(gg_df) +
geom_point(mapping = aes(x, y)) +
geom_abline(mapping = aes(intercept = 0, slope = 3), color = "red")
This is just a brief example. More will be seen in the next lessons. Check out this cheatsheet for quick look-up on ggplot functions.
1.7 Examples: plot and data manipulation
Combining altogether, here a data visualization workflow on the Gapminder dataset.
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
A factor, which we haven’t seen yet, is just a data-type characterizing a
discrete categorical variable; the levels of a factor describe how many
distinct categories it can take value from (e.g. the variable continent
takes
values from the set {Africa, Americas, Asia, Europe, Oceania}
).
Let’s say we want to compare the GDP per capita of some different countries (Italy, Japan, Brasil and Ethiopia), plotted against time (year by year).
# transform the dataset according to what is necessary
wanted_countries <- c("Italy", "Japan", "Brazil", "Ethiopia")
gapminder %>%
dplyr::filter(country %in% wanted_countries) %>%
# now feed the filtered data to ggplot (using the pipe op)
ggplot() +
geom_line(aes(year, gdpPercap, color = country))
If we want to add some information about the same measure over the whole continent, showing for instance the boundaries of GDP among all countries in the same continent of the four selected countries, this is more or less what we can do
# give all the data to ggplot, we'll filter later
plt <- gapminder %>%
ggplot() +
geom_line(data = . %>%
dplyr::filter(country %in% wanted_countries),
aes(year, gdpPercap, color = country)) +
# now group by continent and get the upper/lower bounds
geom_ribbon(data = . %>%
# min(NA) = NA, make sure NAs are excluded
dplyr::filter(!is.na(gdpPercap)) %>%
# gather all entries for each continent separately
dplyr::group_by(continent, year) %>%
# compute aggregated quantity (min/max)
dplyr::summarize(minGdp = min(gdpPercap),
maxGdp = max(gdpPercap), across()) %>%
dplyr::filter(country %in% wanted_countries),
aes(ymin = minGdp, ymax = maxGdp,
x = year, color = country, fill = country),
alpha = 0.1, linetype = "dashed", size = 0.2)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: There was 1 warning in `dplyr::summarize()`.
## ℹ In argument: `across()`.
## Caused by warning:
## ! Using `across()` without supplying `.cols` was deprecated in dplyr 1.1.0.
## ℹ Please supply `.cols` instead.
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'continent', 'year'. You can override using
## the `.groups` argument.
But, since it looks a bit confusing, we might want four separate plots.
wanted_continents <- gapminder %>%
dplyr::filter(country %in% wanted_countries) %>%
# extract one column from a dataframe (different from select)
dplyr::pull(continent) %>%
unique()
gapminder %>%
dplyr::filter(continent %in% wanted_continents) %>%
ggplot() +
geom_line(data = . %>%
dplyr::filter(country %in% wanted_countries),
aes(year, gdpPercap, color = country)) +
# now group by continent and get the upper/lower bounds
geom_ribbon(data = . %>%
# min(NA) = NA, make sure NAs are excluded
dplyr::filter(!is.na(gdpPercap)) %>%
# gather all entries for each continent separately
dplyr::group_by(continent, year) %>%
# compute aggregated quantity (min/max)
dplyr::summarize(minGdp = min(gdpPercap),
maxGdp = max(gdpPercap), across()) %>%
dplyr::filter(country %in% wanted_countries),
aes(ymin = minGdp, ymax = maxGdp,
x = year, color = country, fill = country),
alpha = 0.1, linetype = "dashed", size = 0.2) +
facet_wrap(vars(continent)) +
labs(title = paste("Country GDP per capita compared with",
"continent lower and upper bounds"),
x = "year", y = "GDP per capita (PPP dollars)")
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'continent', 'year'. You can override using
## the `.groups` argument.
1.8 Probability
Base R provide functions to handle almost any probability distribution. These functions are usually divided into four categories:
- density function
- distribution function
- quantile function
- random function (sampling)
n <- 10
normal_samples <- rnorm(n = n, mean = 0, sd = 1) # sample 10 Gaussian samples
normal_samples
## [1] -1.69323775 -0.16895647 0.43849260 -0.11836184 0.37104864 0.04193511
## [7] -1.90674237 0.90455160 1.59088381 0.09928534
## [1] 0.0004355325 0.0379637081 0.1178796350 0.0423127430 0.1058555560
## [6] 0.0586629112 0.0001934955 0.2189433766 0.3669144512 0.0655267619
## [1] 0.04520511 0.43291544 0.66948538 0.45289048 0.64469935 0.51672479
## [7] 0.02827698 0.81714851 0.94418214 0.53954414
## [1] -1.644854 1.644854
1.9 Extras
1.9.1 File system and helper
R language provides several tools for management of files and function help. Here some useful console commands. Note that most of them are also available on RStudio through the graphic interface (buttons).
R saves all the variables and you can display them with ls()
.
rm(list = ls()) # clear up the space removing all variables stored so far
# let's add some variables
x <- 1:10
y <- x[x %% 2 == 0]
ls() # check variables in the environment
## [1] "x" "y"
The working directory is the folder located on your computer from which R navigates the filesystem.
## [1] "/Users/runner/work/statistical-models-r/statistical-models-r"
setwd("./tmp") # set the working directory to an arbitrary (existing) folder
# save the current environment
save.image("./01_test.RData")
# check that it's on the working directory
dir()
RStudio typically save the environment automatically, but sometimes (if not every time you close R) you should clear the environment variables, because loading many variables when opening RStudio might fill up too much memory.
You can also read function helpers simply by typing ?function_name
. This
will open a formatted page with information about a specific R function or object.
1.9.2 Packages
Packages can be installed via command line using
install.packages("package_name")
, or through RStudio graphical interface.
# the following function call is commented because package installation should
# not be included in a script (but you can find it commented, showing that the
# script requires a package as dependency)
# install.packages("tidyverse")
And then you can load the package with library
.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ lubridate 1.9.3 ✔ stringr 1.5.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
1.9.3 Arrow sign
The difference between <-
and =
is not just programming style preference.
Here an example where using =
rather than <-
makes a difference:
## Error in eval(substitute(expr), e): argument is missing, with no default
# 'b<-a^2' is the value passed to the expr argument of within()
within(data.frame(a = rnorm(2)), b <- a^2)
## a b
## 1 1.492388 2.227221
## 2 -1.137960 1.294953
Although this event might never occur in one’s programming experience, it’s
safer (and more elegant) to use <-
when assigning variable.
Besides, ->
is also valid and it is used (more intuitive) when assigning pipes
result to variables.
library(dplyr)
x <- starwars %>%
dplyr::mutate(bmi = mass / ((height / 100) ^ 2)) %>%
dplyr::filter(!is.na(bmi)) %>%
dplyr::group_by(species) %>%
dplyr::summarise(bmi = mean(bmi)) %>%
dplyr::arrange(desc(bmi))
x
## # A tibble: 32 × 2
## species bmi
## <chr> <dbl>
## 1 Hutt 443.
## 2 Vulptereen 50.9
## 3 Yoda's species 39.0
## 4 Kaleesh 34.1
## 5 Droid 32.7
## 6 Dug 31.9
## 7 Trandoshan 31.3
## 8 Sullustan 26.6
## 9 Zabrak 26.1
## 10 Besalisk 26.0
## # ℹ 22 more rows
1.10 Exercises
- LogSumExp trick
Try to implement the log-sum-exp trick in a function that takes as argument
three numeric
variables and computed the log of the sum of the exponentials
in a numerically stable way.
See this
Wiki paragraph
if you don’t know the trick yet.
log_sum_exp3 <- function(a, b, c) {
# delete this function and re-write it using the trick
# in order to make it work
return(log(exp(a) + exp(b) + exp(c)))
}
# test - this result is obviously wrong: edit the function above
log_sum_exp3(a = -1000, b = -1001, c = -999) # should give -998.5924
## [1] -Inf