1 Script usage in R and RStudio

1.1 Why should we write scripts (and do that in R) ?

Very much like a laboratory notebook, it is critical to record all the steps that gave rise to a given analysis, figure, etc:

  • what are the input data ?
  • how did you transform/filter/summarize them ?
  • what parameters did you use to get a particular figure or table ?

Many biological data can be analysed using Excel, machine-specific proprietary softwares, etc - that proceed through the GUI ; yet going to R can turn to be very handy should you need to:

  • repeat multiple times the same analysis
  • redo an analysis you did very long ago
  • find where the error comes from (and fix it)
  • share what you did (co-worker, student, etc)

Scripts consist of a list of commands, organized into a file, to be executed in a given order to get the output(s) you want, starting with input data.

In addition to the commands themselves, scripts also contain commented chunks, which are preceded by a specific character (#) to instruct R not to run these lines. These comments could be anything that is not an instruction given to R, yet which is useful to document: description of the how/the why of a peculiar command, note to your future self about a future improvement, … there is no such thing as a too heavily commented script!

Upon writing scripts, a good rationale to have in mind is the following:

  • anything produced by the script could be trashed and recreated from the script(s).
  • no other material than the script(s) should be needed to understand the analysis.

R is a free, open-source programming/scripting language, designed for data analysis / statistics (although it can do anything a programming language can do).

R has built-in functionalities (base R) which are very handy and “good enough” for mathematical operations and (basic) graphical representations.

Should more specific/complex tasks be performed (nicer plots, identification of DE genes), it is often not needed (neither recommended) to code from scratch the functions we need… as many of these have been already developed and been made available to the users community in the form of “R packages”.

The official R packages (~20.000 to date) can be found through the Comprehensive R Archive Network (https://cran.r-project.org/).

In addition to the CRAN, the Bioconductor initiative (https://www.bioconductor.org/), which curates ,gathers and organizes bioinformatics-related packages (and beyond) is worth mentioning. Since anyone can set up one’s own package and make it available (eg through gitlab, github, bitbucket, …), many other packages can be found, that you can install at your own risks!

R scripts usually have a “.R” extension. Irrespective of the language, scripts shall be opened (and modified) with dedicated softwares: ideally script editors/IDEs, a minima text editors, never with MS Word or equivalents.

1.2 RStudio, the script editor (and beyond) for R

RStudio is an IDE (integrated development environment) which is typically used to manage and execute R code.

It is free, open-source, and can be installed on Mac, Windows, Linux.

It has also multiple features very useful to more advanced users (project management, version control, creation of packages, books, shiny apps, …), not covered here.

The RStudio interface is structured into 4 panes: Script editor window

  • upper left: Source pane (script editor)

This is where scripts (and Rmd, etc) are opened/can be modified.

  • lower left: R Console pane

This is where all the code gets run ; the commands that were ran are recorded into the History (which is cumbersome to look into).

  • upper right: Environment pane (as tabs)

This is where your workspace (all the variables, objects, etc) and history can be seen (and more tabs if you are a R developer).

  • lower right: Files/Plots/Packages/Help pane (as tabs)

This is where you can visualise the plots you generated, check the manuals for the different functions, see what packages are installed/loaded, and access a file navigator.

Should you need to execute a simple command which you don’t want to record (eg simple mathematical operation), using directly the console can suffice.

Should you need to do anything you want to be able to come back to afterwards, the proper way to go is to work from a script in the upper-left pane. From there, you can run the whole script at once (using source), or one/few lines (using run or cmd/ctrl + enter). Everything that is not commented (that is, preceded by a dash) will be executed, and will appear in the console.

Script run

Script run

1.3 Practice on an existing script

The goal of the practical is to load (and execute, and comment at your will) an existing Rscript which allows you to:

  • setup your work environment
  • load a dataset
  • subset it using base R functions
  • plot it using ggplot2, a R package dedicated to plotting
  • save the plot and the subsetted dataset.

To get started:

  • Create a folder called Rpractical (where you want, preferably in your Documents folder )
  • Download this script
  • Move/copy the script to Rpractical folder
  • Open the script in RStudio