Getting started with R

Tips, Tricks & Troubleshooting

Author
Affiliation
Published

05 April 2024

1 Introduction

This document was created as an aid to students struggling with working with their computers and consequently with R. While we cannot focus on, for example, teaching folder structures in our tutorials, we do think that we should provide documentation which we can refer to. This document is purely supplementary. It was composed based on the struggles we have experienced in the tutorials over the past years, especially those that are not directly related to data analysis.

In this document you will find tips on organising files on your computer, which we think is beneficial for your study. Of course you can do it in whatever way you like, but if you do not where to start, this might just be what you need. Further, we describe how to set up R and RStudio, how to write proper R code, what kind of data structures exist in R, how do download and install R packages, how to load them on new R sessions (and why that is important), some tips on storing your data and loading it, and last some general troubleshooting information.

On the right side you find the table of contents. You can click on a section, and the document will jump directly there.

2 Organising files on your computer

Before we get into working with R specifically, in this section we present some general recommendations for organising your files.

Over the past years we have noticed that it has changed how people use their computers. This comes naturally, as nowadays basically all devices have search functionalities that make it easy for the user to find their files. Also, because of the increased use of smartphones of tablets, also computers have become more application-based than they used to: Now you can open, for example, Microsoft Word, and the program allows you to open the most recently used documents with just one click.

However, this easy-to-use approach can become an issue when dealing with large amount of files, when you are collaborating with other people, or when you need to know where your files are stored. Imagine you are writing your Bachelor or Master thesis: In order to not lose track of your data and report drafts, it is useful to create a folder structure that allows you to easily find all files, without having to use a search function.

Therefore, we formulate the following advice:

  1. Create folders for your projects and your study, rather than simply keeping your files on the Desktop or in the Downloads folder.

Figure 1: Folders - Example
  1. In these folders, create subfolders for different parts of the project. For example, when you have multiple drafts of your report, move the older drafts into a separate folder to keep overview.

Figure 2: Subfolders - Example
  1. Move files on your computer by using the file explorer (Windows) or the finder (Mac). If you don’t know how to open it, on Windows you can simultaneously press CTRL + E, on Mac press CMD + OPTION + SPACEBAR.

  2. Do NOT open any data files using Excel or similar. Depending on the language settings of your device, this might mess up the data. For example, if your device is set to German, English, Spanish or any other language that uses commas (,) as decimal separators instead of periods (.), you will have to write additional R code to properly read the data. To prevent this you can set your device language to international English.

  3. On Mac there are issues when downloading files from Canvas using the Safari browser. Use Google Chrome, Mozilla Firefox instead. By default, most browsers save your downloaded files in the Downloads folder on your device.

NOTE: To choose yourself where something is downloaded to either right-click on the link, and select save link as, or specify in the browser settings that you can choose where something is downloaded to. For example, in Google Chrome go to Settings –> Downloads and enable Ask where to save each file before downloading:

Figure 3: Downloads - Example

3 Setting up R

3.1 Download & installation

The first thing is to download and install both R and RStudio for you operating system. But why do you have to install two things (R AND RStudio)? Isn’t one just enough? The answer is simple.

R is a programming language. It comes with a simple interface, however it is rather unpractical to use, and it lacks useful features such as code completion. RStudio is a program that functions as a more functional interface for various programming languages, among which R. RStudio has useful features such as code completion, file management, an environment overview, R-projects (more on this later) and many more.

In other words: RStudio makes your life easier when you have to work with R. Therefore, we ask you to install both, R AND RStudio.

After downloading R and RStudio, you have to install it in the correct order, to ensure that everything works perfectly:

  1. First install R
  2. Only once that is done, install RStudio. This ensures that RStudio can find the correct version of R.

To download and install R and RStudio, follow the guide on this website.

Note: If you already have an older version of R or RStudio installed, it is advised to update them regularly. If you you want to start fresh, uninstall both before you install the newest versions.

CAUTION: If you do this, you might have to re-install packages that you already installed before.

3.2 Set up your R-project

When working with R, we always advise to use RStudio. Therefore, do NOT open R itself, but open RStudio instead. If you do this for the first time, it should look like this:

Figure 4: RStudio - 1st time

As you can see, in what is labelled the Console RStudio tells me that my currently installed R version is R version 4.3.1. This was the most recent R version at the time of creating this document, so by now there might be a newer one. Please always install the newest R version.

3.2.1 Create a project

At BMS we work with R-projects. Using R-projects, you can ensure that R will always be able to find all (data) files that you need for your current project. To create a Project follow the steps outlined below (as shown in the video):

  1. File \(\to\) New Project.

  2. Save it in an existing directory (as shown in the video) or a new directory (in case you do not have a folder yet).

This project directory (i.e. folder) is your working directory (R thinks all your files are in here) for this specific R-project. Everytime you want to continue working on this project, you can do so by simply opening the R-project file (with the .Rproj extension).

Once you opened the R-project (when creating it RStudio does it for you), you now see in the bottom right pane the (data) files and subfolders that are within your project. All the data that you need for this project, you should put in this project folder or in a subfolder (in the video that subfolder is called “data”).

3.2.2 Create a script

Once you created your R-project, you should create a Script. Scripts are used such that you can save your written code. If you would write in the console only, all your code would be lost once you close or restart RStudio. Therefore, always create a script! To create a script follow the steps outlined below (and again, as shown in the video):

  1. File \(\to\) New File \(\to\) R script

  2. Save your script.

Using scripts you can easily repeat what you have done before or share your work with others.

Save it every now and then (either using the “save icon” as shown in the video, or the key combination “CTRL+S / CMD+S”).

4 Writing R Code

4.1 Simple calculations

To get started, you can use R as a calculator. You can use it for simple calculations such as:

1 + 1
[1] 2

or:

2.36 * 15.13
[1] 35.7068

R also knows the order of operations in mathematics:

1 + 2.36 * 15.13
[1] 36.7068

This you can of course overwrite with parentheses, like in actual mathclass:

(1 + 2.36) * 15.13
[1] 50.8368

4.2 Assignment operator

R is more than a just a powerful calculator. To assign a value to a symbolic variable use the assignment operator <-. For example:

x <- 2

This creates an object x, that has the value **2*, which also shows up in your R environment:

Figure 5: Object in environment - Example

4.3 Functions in R

Functions return values (e.g. 10 or another R object). There are built-in functions like the mean() function, but also from packages (more on this later) like mutate() from the package dplyr.

4.3.1 Function arguments

Suppose we have a vector of numbers, called x:

x <- c(0, 5, 10, 15, 20, NA)

We can use the mean() function to compute the mean (the average value) of that vector:

mean(x)
[1] NA

However, as you see in this example, it does not always work. That is because there is a missing value in the data (NA). For cases like these, the function has arguments such as na.rm. If we use this, we can compute the mean over the other values:

mean(x, na.rm = TRUE)
[1] 10

Find the arguments of a function using R’s internal help function: ?mean.

4.4 Data structures

In R you will end up working with different data structures. In this guide we show an overview of some common ones.

4.4.1 Vectors

There are vectors with numbers only:

x_num <- c(0, 5, 10, 15, 20, NA)

There are vectors with characters only:

x_char <- c("Hello", "World", "!")

And of course there are also mixed vectors:

x_mix <- c("Hello", "World", "!", 5, 10, 15)

R sees mixed vectors automatically as character vectors.

4.4.2 Matrices

Further, there are matrices. They are basically like an excel table (just without the titles for the columns or rows):

char_matrix <- matrix(x_char, 
                      nrow = 1,
                      ncol = 3)

char_matrix
     [,1]    [,2]    [,3]
[1,] "Hello" "World" "!" 

Some people call this a row vector.

num_matrix <- matrix(x_num, 
                     nrow = 6,
                     ncol = 1)

num_matrix
     [,1]
[1,]    0
[2,]    5
[3,]   10
[4,]   15
[5,]   20
[6,]   NA

Some people call this a column vector.

mix_matrix <- matrix(x_mix,
                     nrow = 2,
                     ncol = 3)

mix_matrix
     [,1]    [,2] [,3]
[1,] "Hello" "!"  "10"
[2,] "World" "5"  "15"

This is called a 2 x 3 matrix.

Note: While people may use the term vector for matrices with 1 row or column, they are technically NOT the same.

4.4.3 Data frames

We can convert matrices to data frames:

df_num <- data.frame(mix_matrix)

df_num
     X1 X2 X3
1 Hello  ! 10
2 World  5 15

And assign variable names:

names(df_num) <- c("Characters", 
                   "Mix", 
                   "Numbers")

df_num
  Characters Mix Numbers
1      Hello   !      10
2      World   5      15

In data frames you can have different variable types, the most common ones are:

Variable type Meaning
dbl double (numeric); 12.44, 23, NaN, Inf
int integer (numeric); 2L, 1134L
num numeric; 12.44, NaN, Inf, 1134L
fct factor (categorical)
lgl logical (categorical); TRUE, FALSE
chr character
lbl labelled
Missing Values all types; NA

Always check the variable types (later you will see how to do that using the glimpse() function).

4.4.4 Lists

Last, there are lists. Lists are objects that can “hold” a collection of other objects. It can hold data frames, matrices, vectors, functions, lists, and so on.

list_1 <- list(df_num, char_matrix, x_num)

list_1
[[1]]
  Characters Mix Numbers
1      Hello   !      10
2      World   5      15

[[2]]
     [,1]    [,2]    [,3]
[1,] "Hello" "World" "!" 

[[3]]
[1]  0  5 10 15 20 NA

4.5 Coding tips

Take a look at the code shown in Figure 6. While it does work, it easily becomes cramped. Further, when coming back to this code after some time, it requires quite some effort to understand what the code is doing, as it is not annotated.

Figure 6: Non-annotated and dense code - Example

In contrast, the code shown in Figure 8 is more readable, and is easier and faster to recognize what is being done with what piece of code. When you write code, although this takes slightly longer, you help your future self and your collaborators a lot when you make use of spaces and annotations properly.

Figure 7: Annotated and spaced code - Example

Sometimes you might produce code that does not work, this is totally normal. Now, instead of writing your code entirely new from scratch, go back and fix the mistake! Think of it like writing a report: In the end, you submit only the final version, not all the intermittent, partially error-prone, versions. For example your code should NOT look like this:

# Make a 2x2 matrix
m <- matrix(z, ncol = 2, nrwo = 2)
m <- matrix(z, ncol = 2, nrow = 2)

Do you spot the mistake in the first line of code (there is a typo)? There is no reason to keep that line of code in your script. Instead, only keep the correct code, namely like this:

# Make a 2x2 matrix
m <- matrix(z, ncol = 2, nrow = 2)

4.6 Copying code from other sources

In many tutorials we will provide you with code, or you can use code provided in the R-manual. While it is easiest to simply copy and paste the code, this sometimes might cause issues. For example, there is a distinction between a - (minus) sign and a - (dash) sign. While not directly visible to you, R does not recognize the dash, and would therefore not be able to perform the computation. If you type the code over, you prevent that.

However, when typing code yourself, you might end up making typing errors. This will also lead to error messages. Here it is CRUCIAL to remain calm, and first check whether you have written everything correct, whether all parentheses are where they need to be, and so on. You can find more information on common errors and how to fix them in Section 7.

5 R packages

5.1 Installing packages

Installing packages is simple. For example, when you want to install the package tidyverse (which you will use a lot) you just have to type and run the code below:

install.packages("tidyverse")

In general, you can think of a package like a program on your computer or an app on your phone. You install them once, and every now and then it might be smart to update them. But you definitely do not need to install them every time again! Therefore, after installing the package, you can remove or comment out the code (use the # simple in front of the code):

# install.packages("tidyverse")

5.2 Loading packages

Whenever you restart your phone or computer, it starts without opening all your apps. Of course there are settings to change that, but this is the default behavior. The same goes for R and RStudio. When you start a new session, R will start empty. Therefore, it is crucial to always specify in your scripts what packages you want to work with. You do that in the first lines of the script. The code for loading the tidyverse packages is:

library(tidyverse)

This will give you this message in the console (if the tidyverse was not loaded before):

Figure 8: Message after loading the tidyverse

Note: R does not load all installed packages automatically, as this might take a long time. Over time you end up with many different packages that you installed on your computer. Loading all of them would take too long (and potentially crash your computer). Therefore, R starts empty and you need to tell R what you want to work with yourself. Think of it like clothes: In the morning you pick the clothes you want to wear that day, rather than wearing all different clothes that you have at once.

Tip: Write the code for loading the packages at the top of your script, as you only need to load them once per session. For example:

Figure 9: Load packages at the top of your script

5.3 The tidyverse

“The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.” - Tidyverse.org (2023)

The tidyverse is a powerful and very useful collection of R-packages. It includes many useful functions for data wrangling, data reshaping, plotting and many more. Further, with the tidyverse you can create more readable code than with base R.

The tidyverse also introduces the pipe operator %>%, which takes whatever is before (e.g. your data) and feeds it to what comes after (e.g., mutate). In other words, the function mutate is applied to your data.

See for example some code written in base R:

mtcars$ratio <- mtcars$disp/mtcars$cyl            # create new variable
summary(lm(ratio ~ wt, data = mtcars))$r.squared  # run lm and get R2
[1] 0.7706461

Now take a look at code written using the tidyverse. Here we write code with the idea of performing one action per line, such that the code remains readable. Further it is useful annotate what you do in each line, such that you can easily find things back whenever you need to update your code. Despite looking substantially different from the previous code, it produces the exact same output:

mtcars %>%                      # data set on cars
  mutate(ratio = disp/cyl) %>%  # compute ratio variable
  lm(ratio ~ wt, data = .) %>%  # run linear model
  summary() %>%                 # make summary of linear model
  pluck("r.squared")            # select R squared from the summary
[1] 0.7706461

Next to being more readable, the tidyverse comes with many useful functions for data wrangling:

  1. mutate() - Create new columns (variables) by using information from other columns

  2. filter() - Take a subset of rows

  3. arrange() - Sort results

  4. select() - Take a subset of columns

  5. count() / n() - Count discrete values

  6. group_by() & summarise() - Create summary statistics on grouped data

  7. left_join & right_join & inner_join & full_join - join two dataframes into one based on one or multiple variables

  8. semi_join & anti_join - to filter for cases that are / are not in a second data frame

The tidyverse also has functions that you can use to reshape your data, for example from wide to long format or from long to wide format:

  1. pivot_longer() - function to reshape data from wide to long format

  2. pivot_wider() - function to reshape data form long to wide format

Find more information about the tidyverse on the dplyr and tidyr cheat sheets!

6 Load your data

In case you saved your data set in a subfolder, specify that. Suppose in your project folder, you have another folder titled data, and in that folder you can find your data set. The data set is in a .csv format and it is called data.csv:

Figure 10: Data set in subfolder - Example

To load it into R, simply use the following code:

data <- read_csv("data/data.csv")

You might not always have, want or need a subfolder. Hence, in case your data set is simply stored in your project folder. Then you would use the following code:

data <- read_csv("data.csv")

In both cases it is wise to inspect your data. This you can do (1) to check whether it is loaded correctly and (2) to get a first idea of the variables and variable types in your data.

data %>% glimpse()

You can also now see the data in your Environment. When you click on the little blue icon, it will look somewhat like this (of course it differs per data set, but what matters is that it shows up in the environment):

Figure 11: Data set in environment - Example

Note: When loading the data, some information (about the data) shows up in the Console. This is normal, and there is nothing to worry about.

6.1 Common packages and functions for loading data

Data sets come in all sorts of file formats. For most of them, R or R-packages have useful functions. See the most commonly used once below:

## included in tidyverse
library(readr)                  
# if delimited by ,
read_csv()
# if delimited by ; instead of ,
read_csv2()                        

## haven package
library(haven)
# for SPSS (.sav & .por) files
read_spss()
# for STATA files
read_stata()  
# for SAS files
read_sas()

## readxl package
library(readxl)
# for xlsx (MS Excel) files
read_xlsx()                         
read_xls()

## foreign package
library(foreign)
# old alternative for SPSS (.sav) files 
read.spss( , to.data.frame = TRUE)  

## base R
# if delimited by ,
read.csv()
# if delimited by ; instead of ,
read.csv2() 

7 Troubleshooting & useful tips

Working with R is not always without any issues. Find some tips & tricks below.

7.1 Common problems

Obviously, when we work with R we sometimes run into (a variety of) problems. Let’s look at the most common ones and how to approach them:

  • First, check whether you ran all the code that you need to run.

If you did run everything, but you still run into errors, try the following approaches:

  • The error says “function not found”? Check whether you loaded (and installed) all the necessary packages

  • R is case-sensitive: Mean() instead of mean()

  • Typing errors: read_csv("daat.csv") instead of read_csv("data.csv")

  • Your files are not stored in correct directory: R says “file not found” or “no such file exists”

  • Missing values in your data: some functions like mean() need the argument na.rm = TRUE, otherwise the output will be NA

  • The data is in wrong format for the function: double check with documentation, e.g. by typing ?mean in the console

In case you run into more complex problems, you often will be able to solve it yourself through using the internet. With many issues you will see that someone, some time ago, ran into a similar issue and asked about it in online fora such as stackoverflow or similar. In most cases, you can find (more or less) efficient fixes in the replies.

However, if you cannot manage to solve your problem or if you feel new to R, please contact your teacher.

7.2 Keeping R & RStudio up-to-date

R and RStudio, as other apps and programmes, receive regular updates. While you do not need to have the newest versions for everything, it is adviced to update them at least at the start of new studyyear. For that, simply uninstall R and RStudio, and install them again as described in Section 3.