About Me Blog
A tutorial on tidy cross-validation with R Analyzing NetHack data, part 1: What kills the players Analyzing NetHack data, part 2: What players kill the most Building a shiny app to explore historical newspapers: a step-by-step guide Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 1 Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 2 Curly-Curly, the successor of Bang-Bang Dealing with heteroskedasticity; regression with robust standard errors using R Easy time-series prediction with R: a tutorial with air traffic data from Lux Airport Exporting editable plots from R to Powerpoint: making ggplot2 purrr with officer Fast food, causality and R packages, part 1 Fast food, causality and R packages, part 2 For posterity: install {xml2} on GNU/Linux distros Forecasting my weight with R From webscraping data to releasing it as an R package to share with the world: a full tutorial with data from NetHack Get text from pdfs or images using OCR: a tutorial with {tesseract} and {magick} Getting data from pdfs using the pdftools package Getting the data from the Luxembourguish elections out of Excel Going from a human readable Excel file to a machine-readable csv with {tidyxl} Historical newspaper scraping with {tesseract} and R How Luxembourguish residents spend their time: a small {flexdashboard} demo using the Time use survey data Imputing missing values in parallel using {furrr} Intermittent demand, Croston and Die Hard Looking into 19th century ads from a Luxembourguish newspaper with R Making sense of the METS and ALTO XML standards Manipulate dates easily with {lubridate} Manipulating strings with the {stringr} package Maps with pie charts on top of each administrative division: an example with Luxembourg's elections data Missing data imputation and instrumental variables regression: the tidy approach Modern R with the tidyverse is available on Leanpub Objects types and some useful R functions for beginners Pivoting data frames just got easier thanks to `pivot_wide()` and `pivot_long()` R or Python? Why not both? Using Anaconda Python within R with {reticulate} Searching for the optimal hyper-parameters of an ARIMA model in parallel: the tidy gridsearch approach Some fun with {gganimate} Split-apply-combine for Maximum Likelihood Estimation of a linear model Statistical matching, or when one single data source is not enough The best way to visit Luxembourguish castles is doing data science + combinatorial optimization The never-ending editor war (?) The year of the GNU+Linux desktop is upon us: using user ratings of Steam Play compatibility to play around with regex and the tidyverse Using Data Science to read 10 years of Luxembourguish newspapers from the 19th century Using a genetic algorithm for the hyperparameter optimization of a SARIMA model Using cosine similarity to find matching documents: a tutorial using Seneca's letters to his friend Lucilius Using linear models with binary dependent variables, a simulation study Using the tidyverse for more than data manipulation: estimating pi with Monte Carlo methods What hyper-parameters are, and what to do with them; an illustration with ridge regression {disk.frame} is epic {pmice}, an experimental package for missing data imputation in parallel using {mice} and {furrr} Building formulae Functional peace of mind Get basic summary statistics for all the variables in a data frame Getting {sparklyr}, {h2o}, {rsparkling} to work together and some fun with bash Importing 30GB of data into R with sparklyr Introducing brotools It's lists all the way down It's lists all the way down, part 2: We need to go deeper Keep trying that api call with purrr::possibly() Lesser known dplyr 0.7* tricks Lesser known dplyr tricks Lesser known purrr tricks Make ggplot2 purrr Mapping a list of functions to a list of datasets with a list of columns as arguments Predicting job search by training a random forest on an unbalanced dataset Teaching the tidyverse to beginners Why I find tidyeval useful tidyr::spread() and dplyr::rename_at() in action Easy peasy STATA-like marginal effects with R Functional programming and unit testing for data munging with R available on Leanpub How to use jailbreakr My free book has a cover! Work on lists of datasets instead of individual datasets by using functional programming Method of Simulated Moments with R New website! Nonlinear Gmm with R - Example with a logistic regression Simulated Maximum Likelihood with R Bootstrapping standard errors for difference-in-differences estimation with R Careful with tryCatch Data frame columns as arguments to dplyr functions Export R output to a file I've started writing a 'book': Functional programming and unit testing for data munging with R Introduction to programming econometrics with R Merge a list of datasets together Object Oriented Programming with R: An example with a Cournot duopoly R, R with Atlas, R with OpenBLAS and Revolution R Open: which is fastest? Read a lot of datasets at once with R Unit testing with R Update to Introduction to programming econometrics with R Using R as a Computer Algebra System with Ryacas

Teaching the tidyverse to beginners

End October I tweeted this:

and it generated some discussion. Some people believe that this is the right approach, and some others think that one should first present base R and then show how the tidyverse complements it. This year, I taught three classes; a 12-hour class to colleagues that work with me, a 15-hour class to master’s students and 3 hours again to some of my colleagues. Each time, I decided to focus on the tidyverse(almost) entirely, and must say that I am not disappointed with the results!

The 12 hour class was divided in two 6 hours days. It was a bit intense, especially the last 3 hours that took place Friday afternoon. The crowd was composed of some economists that had experience with STATA, some others that were mostly using Excel and finally some colleagues from the IT department that sometimes need to dig into some data themselves. Apart from 2 people, all the other never had any experience with R.

We went from 0 to being able to do the plot below after the end of the first day (so 6 hours in). Keep in mind that practically none of them even had opened RStudio before. I show the code so you can see the progress made in just a few hours:

bwages = Bwages %>%
  mutate(educ_level = case_when(educ == 1 ~ "Primary School",
                                educ == 2 ~ "High School",
                                educ == 3 ~ "Some university",
                                educ == 4 ~ "Master's degree",
                                educ == 5 ~ "Doctoral degree"))

ggplot(bwages) +
  geom_smooth(aes(exper, wage, colour = educ_level)) +
  theme_minimal() +
  theme(legend.position = "bottom", legend.title = element_blank())
## `geom_smooth()` using method = 'loess'

Of course some of them needed some help here and there, and I also gave them hints (for example I told them about case_when() and try to use it inside mutate() instead of nested ifs) but it was mostly due to lack of experience and because they hadn’t had the time to fully digest R’s syntax which was for most people involved completely new.

On the second day I showed purrr::map() and purrr::reduce() and overall it went quite well too. I even showed list-columns, and this is where I started losing some of them; I did not insist too much on it though, only wanted to show them the flexibility of data.frame objects. Some of them were quite impressed by list-columns! Then I started showing (for and while) loops and writing functions. I even showed them tidyeval and again, it went relatively well. Once they had the opportunity to play a bit around with it, I think it clicked (plus they have lots of code examples to go back too).

At the end, people seemed to have enjoyed the course, but told me that Friday was heavy; indeed it was, but I feel that it was mostly because 12 hours spread on 2 days is not the best format for this type of material, but we all had time constraints.

The 15 hour Master’s course was spread over 4 days, and covered basically the same. I just used the last 3 hours to show the students some basic functions for model estimation (linear, count, logit/probit and survival models). Again, the students were quite impressed by how easily they could get descriptive statistics by first grouping by some variables. Through their questions, I even got to show them scoped versions of dplyr verbs, such as select_if() and summarise_at(). I was expecting to lose them there, but actually most of them got these scoped versions quite fast. These students already had some experience with R though, but none with the tidyverse.

Finally the 3 hour course was perhaps the most interesting; I only had 100% total beginners. Some just knew R by name and had never heard/seen/opened RStudio (with the exception of one person)! I did not show them any loops, function definitions and no plots. I only showed them how RStudio looked and worked, what were (and how to install) packages (as well as the CRAN Task Views) and then how to import data with rio and do descriptive statistics only with dplyr. They were really interested and quite impressed by rio (“what do you mean I can use the same code for importing any dataset, in any format?”) but also by the simplicity of dplyr.

In all the courses, I did show the $ primitive to refer to columns inside a data.frame. First I showed them lists which is where I introduced $. Then it was easy to explain to them why it was the same for a column inside a data.frame; a data.frame is simply a list! This is also the distinction I made from the previous years; I simply mentioned (and showed really quickly) matrices and focused almost entirely on lists. Most participants, if not all, had learned to program statistics by thinking about linear algebra and matrices. Nothing wrong with that, but I feel that R really shines when you focus on lists and on how to work with them.

Overall as the teacher, I think that focusing on the tidyverse might be a very good strategy. I might have to do some adjustments here and there for the future courses, but my hunch is that the difficulties that some participants had were not necessarily due to the tidyverse but simply to lack of time to digest what was shown, as well as a total lack of experience with R. I do not think that these participants would have better understood a more traditional, base, matrix-oriented course.