--- title: "Preliminary practice" output: learnr::tutorial: progressive: true allow_skip: true runtime: shiny_prerendered description: > Train before the Hypothesis test module with a few exercises --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) library(learnr) library(ggplot2) url_file = "https://raw.githubusercontent.com/vguillemot/prelim/master/BirthWeight.txt" dat1 <- read.table(url_file, header = T) # assign("dat1", dat1, env = globalenv()) ``` ## Instruction This document is interactive! You don't need to start Rstudio to run the commands, intead, you just need to enter your commands in interactive displays such as *this one just below*. Try any command (e.g. `1+1`), press **Run code** and see what happens. ```{r test1, exercise = TRUE} ``` In this documents, we give you the solution: click on the button "Solution" to reveal (a proposition of) solution. In the next interactive display, compute the square of 2 ($2^2$) and pu the result in an object called `x`. ```{r test2, exercise = TRUE} ``` ```{r test2-solution} x <- 2^2 ``` Sometimes, we also give you a hint, and then the solution: click on the button "Hints" to reveal the hint, and on the button "Next hint" to reveal the solution. In the next interactive display, compute the square-root of 2 ($\sqrt{2}$) and put the result in a object called `y`. ```{r test3, exercise = TRUE} ``` ```{r test3-hint} ?sqrt ``` ```{r test3-solution} y <- sqrt(2) ``` _NB: anytime, you can clear your work using by clicking "Start Over" (in the left panel, below section titles)._ ## 1. Import and handle data ### Do you remember this ? We want to import the file _BirthWeight.txt_ (in the folder data) which is tab delimited: ```{r load_showed, echo = TRUE, eval = FALSE} dat1 <- read.table("BirthWeight.txt", header = T) ``` We then check whether it was correctly imported using the function _head()_ (NB: you might also want to use the function _View()_ in a more interactive fashion). ```{r head} head(dat1) ``` __Tip 1:__ _read.table()_ may handle various field delimiters such as ";" or ",". They may be specified as follow: read.table("my_file.txt", sep = ",") __Tip 2:__ if your files are genuine .csv files, you may import them using _read.csv()_ [when fields are delimited by ","] or _read.csv2()_ [when fields are delimited by ";"] __Tip 3:__ you may also directly import .xlsx (or .xls) files by using _ad hoc_ functions such as _read.xls()_ in the package _gdata_ ### R commands What is the class of `dat1`? ```{r class, exercise = TRUE} dat1 ``` ```{r class-hint} ?class ``` ```{r class-solution} class(dat1) ``` What is the structure of `dat1`? ```{r structure, exercise = TRUE} dat1 ``` ```{r structure-hint} ?str ``` ```{r structure-solution} str(dat1) ``` What are the variable names (i.e. the columns) in `dat1`? ```{r colnames, exercise = TRUE} dat1 ``` ```{r colnames-hint} ?colnames ``` ```{r colnames-solution} colnames(dat1) ``` Call the variable _bw_ with the `$` syntax. ```{r calldollar, exercise = TRUE} dat1 ``` ```{r calldollar-solution} dat1$bw ``` Call the variable _bw_ with the `["NAME_OF_THE_VARIABLE"]` syntax. ```{r callsqarebrackets, exercise = TRUE} dat1 ``` ```{r callsqarebrackets-solution} dat1[, "bw"] ``` Select values in _bw_ higher or equal to 2000, and put the result in an object called `sel_dat1`. ```{r select, exercise = TRUE} dat1 ``` ```{r select-solution} sel_dat1 <- dat1$bw[dat1$bw >= 2000] ``` Finally, subset the data frame such that it contains only values of $bw >= 2000$ and $bpd >= 90$ and exclude the 4th column (ID number), and put the result in an object called `sub_dat1`. ```{r subset, exercise = TRUE} dat1 ``` ```{r subset-solution} sub_dat1 <- dat1[dat1$bw >= 2000 & dat1$bpd >= 90, c("bw", "bpd", "ad")] ``` ## 2. Very basic commands in statistics ### Do you remember this kind of statement ? ```{r fig_quantile, echo = FALSE, eval = TRUE} threshold = qnorm(0.75) df_gauss <- data.frame(x = seq(-4, 4, 0.01), density = dnorm(seq(-4, 4, 0.01))) df_75p100 <- rbind(df_gauss[df_gauss$x < threshold, ], data.frame(x = c(threshold, threshold), density = c(dnorm(threshold), 0))) ggplot() + geom_line(data = df_gauss, aes(x = x, y = density)) + geom_polygon(data = df_75p100, aes(x = x, y = density), fill = "steelblue", alpha = 0.35) + geom_segment(aes(x = threshold, xend = threshold, y = 0.075, yend = 0), arrow = arrow(type = "closed", angle = 20), col = "steelblue4") + geom_text(aes(x = threshold, y = 0.09, label = round(threshold, 3)), col = "steelblue4") + geom_text(aes(x = -0.5, y = 0.17, label = "Area in blue \n = 0.75"), col = "steelblue4") + ggtitle("In a Gaussian distribution (mean = 0, sd = 1), \n the quantile of order 0.75 is 0.674") + theme_classic() ``` ### Commands in R Calculate the quantile of order 0.975 from a Gaussian distribution (mean = 0, standard-deviation = 1) ```{r gaussian_quantile, exercise = TRUE} 0.975 ``` ```{r gaussian_quantile-hint} ?qnorm ``` ```{r gaussian_quantile-solution} qnorm(0.975) ``` Now, could you calculate quantile of order 0.975 from a Gaussian distribution of mean = 3 and standard-deviation = 5)? ```{r gaussian_quantile2, exercise = TRUE} 0.975 ``` ```{r gaussian_quantile2-hint} ?qnorm ``` ```{r gaussian_quantile2-solution} qnorm(0.975, mean = 3, sd = 5) ``` Make a histogram of the distribution of _bpd_ (in dataset _dat1_). ```{r hist_basic, exercise = TRUE} dat1 ``` ```{r hist_basic-hint} ?hist ``` ```{r hist_basic-solution} hist(dat1$bpd) ``` Compute the average, the median, the standard deviation as well as the 25% and 75% empirical quartiles for the distribution of bpd: ```{r basic_stats, exercise = TRUE} dat1$bpd ``` ```{r basic_stats-solution} # average mean(dat1$bpd) # median median(dat1$bpd) # standard deviation sd(dat1$bpd) # average quantile(dat1$bpd, c(0.25, 0.75)) ``` Transform the _ad_ variable, wich is continuous, into two classes: small, $<= 100$ and large $> 100$ and create a factor object called `ad_categories`. ```{r continous_to_classes, exercise = TRUE} dat1$ad ``` ```{r continous_to_classes-hint} ?cut ``` ```{r continous_to_classes-solution} ad_categories <- cut(dat1$ad, c(-Inf, 100, Inf), labels = c("small", "large")) ``` ```{r add_to_dat1-setup} ad_categories <- cut(dat1$ad, c(-Inf, 100, Inf), labels = c("small", "large")) ``` Add the factor `ad_categories` to the existing data frame, as a new column (also) called `ad_categories`. ```{r add_to_dat1, exercise = TRUE} ad_categories ``` ```{r add_to_dat1-hint} "Remember the syntax df$new_column <- new_object" ``` ```{r add_to_dat1-solution} dat1$ad_categories <- factor(ad_categories) ``` Finally, make a boxplot of the _ad_ as a function of the latter classes (i.e., small and large). ```{r basic_boxplot-setup} ad_categories <- cut(dat1$ad, c(-Inf, 100, Inf), labels = c("small", "large")) dat1$ad_categories <- ad_categories ``` ```{r basic_boxplot, exercise = TRUE} dat1$bw ``` ```{r basic_boxplot-hint} ?boxplot ``` ```{r basic_boxplot-solution} boxplot(bw ~ ad_categories, data = dat1) ``` ### To go further Optional (1): you might want to make a fancier histogram (of the distribution of bpd) using the ggplot2 package (!) Tweak the code below to make it fit our dataset and objects: ```{r ggplot_histogram, exercise = TRUE} ggplot(data = your_data_frame, aes(x = your_variable_of_interest)) + # creates a ggplot object geom_histogram(binwidth = 5, fill = "steelblue", col = "steelblue4") + # add the histogram ggtitle("Distribution of bpd values") # add a title ``` ```{r ggplot_histogram-solution} ggplot(data = dat1, aes(x = bpd)) + geom_histogram(binwidth = 5, fill = "steelblue", col = "steelblue4") + ggtitle("Distribution of bpd values") ``` Optional (2): you might want to make a fancier boxplot (of bw as a function of the classes "small" and "large" ad) using the ggplot2 package (!): ```{r ggplot_boxplot-setup} ad_categories <- cut(dat1$ad, c(-Inf, 100, Inf), labels = c("small", "large")) dat1$ad_categories <- ad_categories ``` ```{r ggplot_boxplot, exercise = TRUE} ggplot(data = your_data_frame, aes(x = your_categories, y = your_variable_of_interest, fill = your_categories)) + # creates a ggplot object geom_boxplot(outlier.shape = NA) + # creates boxplots geom_jitter(height = 0, width = 0.1) + # add dots on the top of it ggtitle("Distribution of Birth weight as a function of classes of abdominal diameter") + # add a tittle xlab("Abdominal diameter") + # add a x-axis label ylab("Weight at birth") + # add a y-axis label theme_classic() # use a simple background ``` ```{r ggplot_boxplot-solution} ggplot(data = dat1, aes(x = ad_categories, y = bw, fill = ad_categories)) + geom_boxplot(outlier.shape = NA) + geom_jitter(height = 0, width = 0.1) + ggtitle("Distribution of Birth weight as a function of classes of abdominal diameter") + xlab("Abdominal diameter") + ylab("Weight at birth") + theme_classic() ```