About Me Blog
A tutorial on tidy cross-validation with R Analyzing NetHack data, part 1: What kills the players Analyzing NetHack data, part 2: What players kill the most Dealing with heteroskedasticity; regression with robust standard errors using R Easy time-series prediction with R: a tutorial with air traffic data from Lux Airport Exporting editable plots from R to Powerpoint: making ggplot2 purrr with officer Forecasting my weight with R From webscraping data to releasing it as an R package to share with the world: a full tutorial with data from NetHack Getting data from pdfs using the pdftools package Getting the data from the Luxembourguish elections out of Excel Going from a human readable Excel file to a machine-readable csv with {tidyxl} How Luxembourguish residents spend their time: a small {flexdashboard} demo using the Time use survey data Imputing missing values in parallel using {furrr} Looking into 19th century ads from a Luxembourguish newspaper with R Making sense of the METS and ALTO XML standards Manipulate dates easily with {lubridate} Maps with pie charts on top of each administrative division: an example with Luxembourg's elections data Missing data imputation and instrumental variables regression: the tidy approach Objects types and some useful R functions for beginners R or Python? Why not both? Using Anaconda Python within R with {reticulate} Searching for the optimal hyper-parameters of an ARIMA model in parallel: the tidy gridsearch approach Some fun with {gganimate} The best way to visit Luxembourguish castles is doing data science + combinatorial optimization The year of the GNU+Linux desktop is upon us: using user ratings of Steam Play compatibility to play around with regex and the tidyverse Using a genetic algorithm for the hyperparameter optimization of a SARIMA model Using the tidyverse for more than data manipulation: estimating pi with Monte Carlo methods What hyper-parameters are, and what to do with them; an illustration with ridge regression {pmice}, an experimental package for missing data imputation in parallel using {mice} and {furrr} Building formulae Functional peace of mind Get basic summary statistics for all the variables in a data frame Getting {sparklyr}, {h2o}, {rsparkling} to work together and some fun with bash Importing 30GB of data into R with sparklyr Introducing brotools It's lists all the way down It's lists all the way down, part 2: We need to go deeper Keep trying that api call with purrr::possibly() Lesser known dplyr 0.7* tricks Lesser known dplyr tricks Lesser known purrr tricks Make ggplot2 purrr Mapping a list of functions to a list of datasets with a list of columns as arguments Predicting job search by training a random forest on an unbalanced dataset Teaching the tidyverse to beginners Why I find tidyeval useful tidyr::spread() and dplyr::rename_at() in action Easy peasy STATA-like marginal effects with R Functional programming and unit testing for data munging with R available on Leanpub How to use jailbreakr My free book has a cover! Work on lists of datasets instead of individual datasets by using functional programming Method of Simulated Moments with R New website! Nonlinear Gmm with R - Example with a logistic regression Simulated Maximum Likelihood with R Bootstrapping standard errors for difference-in-differences estimation with R Careful with tryCatch Data frame columns as arguments to dplyr functions Export R output to a file I've started writing a 'book': Functional programming and unit testing for data munging with R Introduction to programming econometrics with R Merge a list of datasets together Object Oriented Programming with R: An example with a Cournot duopoly R, R with Atlas, R with OpenBLAS and Revolution R Open: which is fastest? Read a lot of datasets at once with R Unit testing with R Update to Introduction to programming econometrics with R Using R as a Computer Algebra System with Ryacas

Get basic summary statistics for all the variables in a data frame

I have added a new function to my {brotools} package, called describe(), which takes a data frame as an argument, and returns another data frame with descriptive statistics. It is very much inspired by the {skmir} package but also by assist::describe() (click on the packages to be redirected to the respective Github repos) but I wanted to write my own for two reasons: first, as an exercice, and second I really only needed the function skim_to_wide() from {skimr}. So instead of installing a whole package for a single function, I decided to write my own (since I use {brotools} daily).

Below you can see it in action:

## # A tibble: 10 x 13
##    variable  type   nobs  mean    sd mode     min   max   q25 median   q75
##    <chr>     <chr> <int> <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
##  1 birth_ye… Nume…    87  87.6 155.  19         8   896  35       52  72  
##  2 height    Nume…    87 174.   34.8 172       66   264 167      180 191  
##  3 mass      Nume…    87  97.3 169.  77        15  1358  55.6     79  84.5
##  4 eye_color Char…    87  NA    NA   blue      NA    NA  NA       NA  NA  
##  5 gender    Char…    87  NA    NA   male      NA    NA  NA       NA  NA  
##  6 hair_col… Char…    87  NA    NA   blond     NA    NA  NA       NA  NA  
##  7 homeworld Char…    87  NA    NA   Tatoo…    NA    NA  NA       NA  NA  
##  8 name      Char…    87  NA    NA   Luke …    NA    NA  NA       NA  NA  
##  9 skin_col… Char…    87  NA    NA   fair      NA    NA  NA       NA  NA  
## 10 species   Char…    87  NA    NA   Human     NA    NA  NA       NA  NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>

As you can see, the object that is returned by describe() is a tibble.

For now, this function does not handle dates, but it’s in the pipeline.

You can also only describe certain columns:

brotools::describe(starwars, height, mass, name)
## # A tibble: 3 x 13
##   variable type    nobs  mean    sd mode      min   max   q25 median   q75
##   <chr>    <chr>  <int> <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 height   Numer…    87 174.   34.8 172        66   264 167      180 191  
## 2 mass     Numer…    87  97.3 169.  77         15  1358  55.6     79  84.5
## 3 name     Chara…    87  NA    NA   Luke S…    NA    NA  NA       NA  NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>

If you want to try it out, you can install {brotools} from Github:


If you found this blog post useful, you might want to follow me on twitter for blog post updates.