About Me Blog
Imputing missing values in parallel using {furrr} {pmice}, an experimental package for missing data imputation in parallel using {mice} and {furrr} Building formulae Functional peace of mind Get basic summary statistics for all the variables in a data frame Getting {sparklyr}, {h2o}, {rsparkling} to work together and some fun with bash Importing 30GB of data into R with sparklyr Introducing brotools It's lists all the way down It's lists all the way down, part 2: We need to go deeper Keep trying that api call with purrr::possibly() Lesser known dplyr 0.7* tricks Lesser known dplyr tricks Lesser known purrr tricks Make ggplot2 purrr Mapping a list of functions to a list of datasets with a list of columns as arguments Predicting job search by training a random forest on an unbalanced dataset Teaching the tidyverse to beginners Why I find tidyeval useful tidyr::spread() and dplyr::rename_at() in action Easy peasy STATA-like marginal effects with R Functional programming and unit testing for data munging with R available on Leanpub How to use jailbreakr My free book has a cover! Work on lists of datasets instead of individual datasets by using functional programming Method of Simulated Moments with R New website! Nonlinear Gmm with R - Example with a logistic regression Simulated Maximum Likelihood with R Bootstrapping standard errors for difference-in-differences estimation with R Careful with tryCatch Data frame columns as arguments to dplyr functions Export R output to a file I've started writing a 'book': Functional programming and unit testing for data munging with R Introduction to programming econometrics with R Merge a list of datasets together Object Oriented Programming with R: An example with a Cournot duopoly R, R with Atlas, R with OpenBLAS and Revolution R Open: which is fastest? Read a lot of datasets at once with R Unit testing with R Update to Introduction to programming econometrics with R Using R as a Computer Algebra System with Ryacas

Get basic summary statistics for all the variables in a data frame

I have added a new function to my {brotools} package, called describe(), which takes a data frame as an argument, and returns another data frame with descriptive statistics. It is very much inspired by the {skmir} package but also by assist::describe() (click on the packages to be redirected to the respective Github repos) but I wanted to write my own for two reasons: first, as an exercice, and second I really only needed the function skim_to_wide() from {skimr}. So instead of installing a whole package for a single function, I decided to write my own (since I use {brotools} daily).

Below you can see it in action:

library(dplyr)
data(starwars)
brotools::describe(starwars)
## # A tibble: 13 x 12
##    variable   type     mean    sd mode        min   max   q25 median   q75
##    <chr>      <chr>   <dbl> <dbl> <chr>     <dbl> <dbl> <dbl>  <dbl> <dbl>
##  1 birth_year Numeric  87.6 155   <NA>       8.00   896  35.0   52.0  72.0
##  2 height     Numeric 174    34.8 <NA>      66.0    264 167    180   191  
##  3 mass       Numeric  97.3 169   <NA>      15.0   1358  55.6   79.0  84.5
##  4 eye_color  Charac…  NA    NA   blue      NA       NA  NA     NA    NA  
##  5 gender     Charac…  NA    NA   male      NA       NA  NA     NA    NA  
##  6 hair_color Charac…  NA    NA   blond     NA       NA  NA     NA    NA  
##  7 homeworld  Charac…  NA    NA   Tatooine  NA       NA  NA     NA    NA  
##  8 name       Charac…  NA    NA   Luke Sky… NA       NA  NA     NA    NA  
##  9 skin_color Charac…  NA    NA   fair      NA       NA  NA     NA    NA  
## 10 species    Charac…  NA    NA   Human     NA       NA  NA     NA    NA  
## 11 films      List     NA    NA   <NA>      NA       NA  NA     NA    NA  
## 12 starships  List     NA    NA   <NA>      NA       NA  NA     NA    NA  
## 13 vehicles   List     NA    NA   <NA>      NA       NA  NA     NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>

As you can see, the object that is returned by describe() is a tibble.

For now, this function does not handle dates, but it’s in the pipeline.

You can also only describe certain columns:

brotools::describe(starwars, height, mass, name)
## # A tibble: 3 x 12
##   variable type      mean    sd mode          min   max   q25 median   q75
##   <chr>    <chr>    <dbl> <dbl> <chr>       <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 height   Numeric  174    34.8 <NA>         66.0   264 167    180   191  
## 2 mass     Numeric   97.3 169   <NA>         15.0  1358  55.6   79.0  84.5
## 3 name     Charact…  NA    NA   Luke Skywa…  NA      NA  NA     NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>

If you want to try it out, you can install {brotools} from Github:

devtools::install_github("b-rodrigues/brotools")

If you found this blog post useful, you might want to follow me on twitter for blog post updates.