---
title: "Preliminary practice"
output: 
  learnr::tutorial:
    progressive: true
    allow_skip: true
runtime: shiny_prerendered
description: > 
    Train before the Hypothesis test module with a few exercises
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(learnr)
library(ggplot2)

url_file = "https://raw.githubusercontent.com/vguillemot/prelim/master/BirthWeight.txt"
dat1 <- read.table(url_file, header = T)
# assign("dat1", dat1, env = globalenv())
```

## Instruction

This document is interactive! You don't need to start Rstudio to run the commands, intead, you just need to enter your commands in interactive displays such as *this one just below*. Try any command (e.g. `1+1`), press **Run code** and see what happens.

```{r test1, exercise = TRUE}

```

In this documents, we give you the solution: click on the button "Solution" to reveal (a proposition of) solution. In the next interactive display, compute the square of 2 ($2^2$) and pu the result in an object called `x`.

```{r test2, exercise = TRUE}

```

```{r test2-solution}
x <- 2^2
```

Sometimes, we also give you a hint, and then the solution: click on the button "Hints" to reveal the hint, and on the button "Next hint" to reveal the solution. In the next interactive display, compute the square-root of 2 ($\sqrt{2}$) and put the result in a object called `y`.

```{r test3, exercise = TRUE}

```

```{r test3-hint}
?sqrt
```

```{r test3-solution}
y <- sqrt(2)
```

_NB: anytime, you can clear your work using by clicking "Start Over" (in the left panel, below section titles)._


## 1. Import and handle data

### Do you remember this ?

We want to import the file _BirthWeight.txt_ (in the folder data) which is tab delimited:

```{r load_showed, echo = TRUE, eval = FALSE}
dat1 <- read.table("BirthWeight.txt", header = T)
```


We then check whether it was correctly imported using the function _head()_ (NB: you might also want to use the function _View()_ in a more interactive fashion).

```{r head}
head(dat1)
```


__Tip 1:__ _read.table()_ may handle various field delimiters such as ";" or ",". They may be specified as follow: read.table("my_file.txt", sep = ",")

__Tip 2:__ if your files are genuine .csv files, you may import them using _read.csv()_ [when fields are delimited by ","] or _read.csv2()_ [when fields are delimited by ";"]

__Tip 3:__ you may also directly import .xlsx (or .xls) files by using _ad hoc_ functions such as _read.xls()_ in the package _gdata_ 


### R commands


What is the class of `dat1`?

```{r class, exercise = TRUE}
dat1
```

```{r class-hint}
?class
```

```{r class-solution}
class(dat1)
```

What is the structure of `dat1`?

```{r structure, exercise = TRUE}
dat1
```

```{r structure-hint}
?str
```

```{r structure-solution}
str(dat1)
```

What are the variable names (i.e. the columns) in `dat1`?

```{r colnames, exercise = TRUE}
dat1
```

```{r colnames-hint}
?colnames
```

```{r colnames-solution}
colnames(dat1)
```

Call the variable _bw_ with the `$` syntax.

```{r calldollar, exercise = TRUE}
dat1
```

```{r calldollar-solution}
dat1$bw
```

Call the variable _bw_ with the `["NAME_OF_THE_VARIABLE"]` syntax.

```{r callsqarebrackets, exercise = TRUE}
dat1
```

```{r callsqarebrackets-solution}
dat1[, "bw"]
```

Select values in _bw_ higher or equal to 2000, and put the result in an object called `sel_dat1`.

```{r select, exercise = TRUE}
dat1
```

```{r select-solution}
sel_dat1 <- dat1$bw[dat1$bw >= 2000]
```

Finally, subset the data frame such that it contains only values of $bw >= 2000$ and $bpd >= 90$ and exclude the 4th column (ID number), and put the result in an object called `sub_dat1`.

```{r subset, exercise = TRUE}
dat1
```

```{r subset-solution}
sub_dat1 <- dat1[dat1$bw >= 2000 & dat1$bpd >= 90, c("bw", "bpd", "ad")]
```


## 2. Very basic commands in statistics


### Do you remember this kind of statement ?

```{r fig_quantile, echo = FALSE, eval = TRUE}
threshold = qnorm(0.75)

df_gauss <- data.frame(x = seq(-4, 4, 0.01), density = dnorm(seq(-4, 4, 0.01)))
df_75p100 <- rbind(df_gauss[df_gauss$x < threshold, ], data.frame(x = c(threshold, threshold), density = c(dnorm(threshold), 0)))

ggplot() +
  geom_line(data = df_gauss, aes(x = x, y = density)) +
  geom_polygon(data = df_75p100, aes(x = x, y = density), fill = "steelblue", alpha = 0.35) +
  geom_segment(aes(x = threshold, xend = threshold, y = 0.075, yend = 0), arrow = arrow(type = "closed", angle = 20), col = "steelblue4") +
  geom_text(aes(x = threshold, y = 0.09, label = round(threshold, 3)), col = "steelblue4") +
  geom_text(aes(x = -0.5, y = 0.17, label = "Area in blue \n = 0.75"), col = "steelblue4") +
  ggtitle("In a Gaussian distribution (mean = 0, sd = 1), \n the quantile of order 0.75 is 0.674") +
  theme_classic()

```


### Commands in R

Calculate the quantile of order 0.975 from a Gaussian distribution (mean = 0, standard-deviation = 1)

```{r gaussian_quantile, exercise = TRUE}
0.975
```

```{r gaussian_quantile-hint}
?qnorm
```

```{r gaussian_quantile-solution}
qnorm(0.975)
```


Now, could you calculate quantile of order 0.975 from a Gaussian distribution of mean = 3 and standard-deviation = 5)?

```{r gaussian_quantile2, exercise = TRUE}
0.975
```

```{r gaussian_quantile2-hint}
?qnorm
```

```{r gaussian_quantile2-solution}
qnorm(0.975, mean = 3, sd = 5)
```


Make a histogram of the distribution of _bpd_ (in dataset _dat1_).

```{r hist_basic, exercise = TRUE}
dat1
```

```{r hist_basic-hint}
?hist
```

```{r hist_basic-solution}
hist(dat1$bpd)
```

Compute the average, the median, the standard deviation as well as the 25% and 75% empirical quartiles for the distribution of bpd:

```{r basic_stats, exercise = TRUE}
dat1$bpd
```


```{r basic_stats-solution}
# average
mean(dat1$bpd)
# median
median(dat1$bpd)
# standard deviation
sd(dat1$bpd)
# average
quantile(dat1$bpd, c(0.25, 0.75))
```

Transform the _ad_ variable, wich is continuous, into two classes: small, $<= 100$ and large $> 100$ and create a factor object called `ad_categories`. 


```{r continous_to_classes, exercise = TRUE}
dat1$ad
```

```{r continous_to_classes-hint}
?cut
```

```{r continous_to_classes-solution}
ad_categories <- cut(dat1$ad, c(-Inf, 100, Inf), labels = c("small", "large"))
```

```{r add_to_dat1-setup}
ad_categories <- cut(dat1$ad, c(-Inf, 100, Inf), labels = c("small", "large"))
```


Add the factor `ad_categories` to the existing data frame, as a new column (also) called `ad_categories`.

```{r add_to_dat1, exercise = TRUE}
ad_categories
```

```{r add_to_dat1-hint}
"Remember the syntax df$new_column <- new_object"
```

```{r add_to_dat1-solution}
dat1$ad_categories <- factor(ad_categories)
```

Finally, make a boxplot of the _ad_ as a function of the latter classes (i.e., small and large).


```{r basic_boxplot-setup}
ad_categories <- cut(dat1$ad, c(-Inf, 100, Inf), labels = c("small", "large"))
dat1$ad_categories <- ad_categories
```


```{r basic_boxplot, exercise = TRUE}
dat1$bw
```

```{r basic_boxplot-hint}
?boxplot
```

```{r basic_boxplot-solution}
boxplot(bw ~ ad_categories, data = dat1)
```

### To go further

Optional (1): you might want to make a fancier histogram (of the distribution of bpd) using the ggplot2 package (!) Tweak the code below to make it fit our dataset and objects:


```{r ggplot_histogram, exercise = TRUE}
ggplot(data = your_data_frame, aes(x = your_variable_of_interest)) + # creates a ggplot object
  geom_histogram(binwidth = 5, fill = "steelblue", col = "steelblue4") +  # add the histogram
  ggtitle("Distribution of bpd values") # add a title
```

```{r ggplot_histogram-solution}
ggplot(data = dat1, aes(x = bpd)) +
  geom_histogram(binwidth = 5, fill = "steelblue", col = "steelblue4") +
  ggtitle("Distribution of bpd values")
```


Optional (2): you might want to make a fancier boxplot (of bw as a function of the classes "small" and "large" ad) using the ggplot2 package (!):

```{r ggplot_boxplot-setup}
ad_categories <- cut(dat1$ad, c(-Inf, 100, Inf), labels = c("small", "large"))
dat1$ad_categories <- ad_categories
```

```{r ggplot_boxplot, exercise = TRUE}
ggplot(data = your_data_frame, aes(x = your_categories, y = your_variable_of_interest, fill = your_categories)) + # creates a ggplot object
  geom_boxplot(outlier.shape = NA) +  # creates boxplots
  geom_jitter(height = 0, width = 0.1) + # add dots on the top of it
  ggtitle("Distribution of Birth weight as a function of classes of abdominal diameter") + # add a tittle
  xlab("Abdominal diameter") + # add a x-axis label
  ylab("Weight at birth") + # add a y-axis label
  theme_classic() # use a simple background 
```

```{r ggplot_boxplot-solution}
ggplot(data = dat1, aes(x = ad_categories, y = bw, fill = ad_categories)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(height = 0, width = 0.1) +
  ggtitle("Distribution of Birth weight as a function of classes of abdominal diameter") +
  xlab("Abdominal diameter") +
  ylab("Weight at birth") +
  theme_classic()
```