About Me Blog
A tutorial on tidy cross-validation with R Analyzing NetHack data, part 1: What kills the players Analyzing NetHack data, part 2: What players kill the most Building a shiny app to explore historical newspapers: a step-by-step guide Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 1 Classification of historical newspapers content: a tutorial combining R, bash and Vowpal Wabbit, part 2 Curly-Curly, the successor of Bang-Bang Dealing with heteroskedasticity; regression with robust standard errors using R Easy time-series prediction with R: a tutorial with air traffic data from Lux Airport Exporting editable plots from R to Powerpoint: making ggplot2 purrr with officer Fast food, causality and R packages, part 1 Fast food, causality and R packages, part 2 For posterity: install {xml2} on GNU/Linux distros Forecasting my weight with R From webscraping data to releasing it as an R package to share with the world: a full tutorial with data from NetHack Get text from pdfs or images using OCR: a tutorial with {tesseract} and {magick} Getting data from pdfs using the pdftools package Getting the data from the Luxembourguish elections out of Excel Going from a human readable Excel file to a machine-readable csv with {tidyxl} Historical newspaper scraping with {tesseract} and R How Luxembourguish residents spend their time: a small {flexdashboard} demo using the Time use survey data Imputing missing values in parallel using {furrr} Intermittent demand, Croston and Die Hard Looking into 19th century ads from a Luxembourguish newspaper with R Making sense of the METS and ALTO XML standards Manipulate dates easily with {lubridate} Manipulating strings with the {stringr} package Maps with pie charts on top of each administrative division: an example with Luxembourg's elections data Missing data imputation and instrumental variables regression: the tidy approach Modern R with the tidyverse is available on Leanpub Objects types and some useful R functions for beginners Pivoting data frames just got easier thanks to `pivot_wide()` and `pivot_long()` R or Python? Why not both? Using Anaconda Python within R with {reticulate} Searching for the optimal hyper-parameters of an ARIMA model in parallel: the tidy gridsearch approach Some fun with {gganimate} Split-apply-combine for Maximum Likelihood Estimation of a linear model Statistical matching, or when one single data source is not enough The best way to visit Luxembourguish castles is doing data science + combinatorial optimization The never-ending editor war (?) The year of the GNU+Linux desktop is upon us: using user ratings of Steam Play compatibility to play around with regex and the tidyverse Using Data Science to read 10 years of Luxembourguish newspapers from the 19th century Using a genetic algorithm for the hyperparameter optimization of a SARIMA model Using cosine similarity to find matching documents: a tutorial using Seneca's letters to his friend Lucilius Using linear models with binary dependent variables, a simulation study Using the tidyverse for more than data manipulation: estimating pi with Monte Carlo methods What hyper-parameters are, and what to do with them; an illustration with ridge regression {disk.frame} is epic {pmice}, an experimental package for missing data imputation in parallel using {mice} and {furrr} Building formulae Functional peace of mind Get basic summary statistics for all the variables in a data frame Getting {sparklyr}, {h2o}, {rsparkling} to work together and some fun with bash Importing 30GB of data into R with sparklyr Introducing brotools It's lists all the way down It's lists all the way down, part 2: We need to go deeper Keep trying that api call with purrr::possibly() Lesser known dplyr 0.7* tricks Lesser known dplyr tricks Lesser known purrr tricks Make ggplot2 purrr Mapping a list of functions to a list of datasets with a list of columns as arguments Predicting job search by training a random forest on an unbalanced dataset Teaching the tidyverse to beginners Why I find tidyeval useful tidyr::spread() and dplyr::rename_at() in action Easy peasy STATA-like marginal effects with R Functional programming and unit testing for data munging with R available on Leanpub How to use jailbreakr My free book has a cover! Work on lists of datasets instead of individual datasets by using functional programming Method of Simulated Moments with R New website! Nonlinear Gmm with R - Example with a logistic regression Simulated Maximum Likelihood with R Bootstrapping standard errors for difference-in-differences estimation with R Careful with tryCatch Data frame columns as arguments to dplyr functions Export R output to a file I've started writing a 'book': Functional programming and unit testing for data munging with R Introduction to programming econometrics with R Merge a list of datasets together Object Oriented Programming with R: An example with a Cournot duopoly R, R with Atlas, R with OpenBLAS and Revolution R Open: which is fastest? Read a lot of datasets at once with R Unit testing with R Update to Introduction to programming econometrics with R Using R as a Computer Algebra System with Ryacas

Synthetic micro-datasets: a promising middle ground between data privacy and data analysis

Intro: the need for microdata, and the risk of disclosure

Survey and administrative data are essential for scientific research, however accessing such datasets can be very tricky, or even impossible. In my previous job I was responsible for getting access to such “scientific micro-datasets” from institutions like Eurostat. In general, getting access to these micro datasets was only a question of filling out some forms and signing NDAs. But this was true only because my previous employer was an accredited research entity. Companies from the private sector or unaffiliated, individual, researchers cannot get access to the microdata sets. This is because institutions that produce such datasets absolutely do not want any type of personal information to be disclosed.

For instance, with the labour force survey, a National Statistical Institute (NSI) collects information about wages, family structure, educational attainment and much more. If, say, a politician would answer to the survey and his answers would leak to the public that would be disastrous for NSIs. So this is why access is restricted to accredited research institutions. You may be asking yourself, “how could the politicians answers leak? The data is anonymized!” Indeed it is, but in some cases that may not be enough to ensure that information does not get disclosed. Suppose that the dataset contains enough information to allow you to know for certain that you found said politician, assume that this politician is a 43 year old man, has two children, a PhD in theology and lives in Strassen, one of Luxembourg-City very nice neighborhoods. It would be quite easy to find him in the dataset and then find out his wage.

To avoid this, researchers are required to perform output checking, which means going through the set of outputs (summary tables, graphs, tables with regression coefficients…) and making sure that it is not possible to find out individuals. For instance, in Luxembourg there are two companies in the tobacco industry. Luxembourg’s NSI cannot release the total turnover of the industry, because then company A would subtract its turnover from the total and find out its competitor’s turnover. Now these are all hypothetical examples, and we might argue that the risk of leakage is quite low, especially if NSIs make sure to lower the precision of the variables, by providing age categories instead of the exact age for example. Or capping wages that exceed a certain fixed amount. In any case for now most NSIs don’t release micro data to the public, and this poses some challenges for research. First of all, even for researchers, it would be great if the data was freely accessible. It would allow research to go straight to data analysis and look at the structure of the data before applying for access, with the risk of getting access to useless data. And of course it would be great for the public at large to be able to freely access such data, for educational purposes at the very least. It would also increase competition between research institutions and the private sector when it comes to conducting studies using such data. Free access to the microdata would level the playing field. Now, some NSIs do release micro data to the public, see Eustat, the NSI from the Basque country, an autonomous region of Spain. It is not clear to me if they also have more detailed data that is only accessible to researchers, but the data they offer is already quite interesting.

A middle ground between only releasing data to researchers and making it completely freely accessible is to create a synthetic dataset, which does not contain any of the original records, but which still allows to perform meaningful analyses.

I’m not yet very familiar with the details of the procedure, but in this blog post I’ll use Eustat’s microdata to generate a synthetic dataset. I’ll then perform the same analysis on both the original dataset and the synthetic dataset. The dataset I’ll be using can be found here, and is called Population with relation to activity (PRA):

The Survey on the Population in Relation to Activity operation is a continuous source of information on the characteristics and dynamics of the labour force of the Basque Country. It records the relation to productive activity of the population resident in family households, as well as the changes produced in labour situations; it produces indicators of conjunctural variations in the evolution of the active population; it also estimates the degree of participation of the population in economically non-productive activities. It offers information on the province and capital level.

I’ll then compare the results of the analyses performed on the two datasets which will hopefully be very similar. To create the synthetic dataset, I’ll be using the {synthpop} package. You can read the detailed vignette here - pdf warning -. First, let me perform some cleaning steps. There are four datasets included in the archive. Let’s load them:

library(tidyverse)
library(tidymodels)
library(readxl)
library(synthpop)

list_data <- Sys.glob("MICRO*.csv")

dataset <- map(list_data, read_csv2) %>%
  bind_rows()

head(dataset)

The columns are labeled in Spanish so I’m copy pasting the labels into Google translate and paste them back into my script. I saved the English names into the english.rds object for posterity. These steps are detailed in the next lines:

dictionary <- read_xlsx("Microdatos_PRA_2019/diseño_registro_microdatos_pra.xlsx", sheet="Valores",
                        col_names = FALSE)
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
col_names <- dictionary %>%
  filter(!is.na(...1)) %>%
  dplyr::select(1:2)

# copy to clipboard, paste to google translate
# couldn't be bothered to use an api and google cloud or whatever
#clipr::write_clip(col_names$`...2`)

#english <- clipr::read_clip()

english <- readRDS("english_col_names.rds")

col_names$english <- english

colnames(dataset) <- col_names$english

dataset <- janitor::clean_names(dataset)

I now create a function that will perform the cleaning steps:

basic_cleaning <- function(dataset){
  dataset %>%
  dplyr::filter(age %in% c("05", "06", "07", "08", "09", "10", "11")) %>%
  dplyr::filter(!is.na(job_search)) %>%  
  dplyr::select(territory, capital, sex, place_of_birth, age, nationality, level_of_studies_completed,
                job_search, main_occupation, type_of_contract, hours) %>%
  dplyr::mutate_at(.vars = vars(-hours), .funs=as.factor)
}

Putting on my econometricians hat

Let’s now suppose that I’m only interested in running a logistic regression:

pra <- basic_cleaning(dataset)

head(pra)
## # A tibble: 6 x 11
##   territory capital sex   place_of_birth age   nationality level_of_studie…
##   <fct>     <fct>   <fct> <fct>          <fct> <fct>       <fct>           
## 1 48        9       6     1              09    1           1               
## 2 48        9       1     1              09    1           2               
## 3 48        1       1     1              11    1           3               
## 4 48        1       6     1              10    1           3               
## 5 48        9       6     1              07    1           3               
## 6 48        9       1     1              09    1           1               
## # … with 4 more variables: job_search <fct>, main_occupation <fct>,
## #   type_of_contract <fct>, hours <dbl>
logit_model <- glm(job_search ~ ., data = pra, family = binomial())

# Create a tidy dataset with the results of the regression
tidy_logit_model <- tidy(logit_model, conf.int = TRUE) %>%
  mutate(dataset = "true")

Let’s now take a look at the coefficients, by plotting their value along with their confidence intervals:

ggplot(tidy_logit_model, aes(x = term, y = estimate)) +
  geom_point(colour = "#82518c") +
  geom_hline(yintercept = 0, colour = "red") +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), colour = "#657b83") +
  brotools::theme_blog() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

Ok, so now, how would the results change if I run the same analysis on the synthetic dataset? First, I need to generate this synthetic dataset:

my_seed <- 1234

synthetic_data <- syn(pra, seed = my_seed)
## Synthesis
## -----------
##  territory capital sex place_of_birth age nationality level_of_studies_completed job_search main_occupation type_of_contract
##  hours

The synthetic data is generated by a single call to the syn() function included in the {synthpop} package. Let’s take a look at the generated object:

synthetic_data
## Call:
## ($call) syn(data = pra, seed = my_seed)
## 
## Number of synthesised data sets: 
## ($m)  1 
## 
## First rows of synthesised data set: 
## ($syn)
##   territory capital sex place_of_birth age nationality
## 1        48       9   1              1  06           1
## 2        01       9   6              3  09           1
## 3        48       3   1              1  08           1
## 4        48       9   6              1  11           1
## 5        20       2   6              1  09           1
## 6        48       1   6              1  11           1
##   level_of_studies_completed job_search main_occupation type_of_contract hours
## 1                          3          N               2                1    40
## 2                          1          S               9                6    10
## 3                          1          N               6             <NA>    32
## 4                          2          N               4                1    32
## 5                          3          N               5             <NA>    40
## 6                          1          S               7             <NA>    NA
## ...
## 
## Synthesising methods: 
## ($method)
##                  territory                    capital 
##                   "sample"                     "cart" 
##                        sex             place_of_birth 
##                     "cart"                     "cart" 
##                        age                nationality 
##                     "cart"                     "cart" 
## level_of_studies_completed                 job_search 
##                     "cart"                     "cart" 
##            main_occupation           type_of_contract 
##                     "cart"                     "cart" 
##                      hours 
##                     "cart" 
## 
## Order of synthesis: 
## ($visit.sequence)
##                  territory                    capital 
##                          1                          2 
##                        sex             place_of_birth 
##                          3                          4 
##                        age                nationality 
##                          5                          6 
## level_of_studies_completed                 job_search 
##                          7                          8 
##            main_occupation           type_of_contract 
##                          9                         10 
##                      hours 
##                         11 
## 
## Matrix of predictors: 
## ($predictor.matrix)
##                            territory capital sex place_of_birth age nationality
## territory                          0       0   0              0   0           0
## capital                            1       0   0              0   0           0
## sex                                1       1   0              0   0           0
## place_of_birth                     1       1   1              0   0           0
## age                                1       1   1              1   0           0
## nationality                        1       1   1              1   1           0
## level_of_studies_completed         1       1   1              1   1           1
## job_search                         1       1   1              1   1           1
## main_occupation                    1       1   1              1   1           1
## type_of_contract                   1       1   1              1   1           1
## hours                              1       1   1              1   1           1
##                            level_of_studies_completed job_search
## territory                                           0          0
## capital                                             0          0
## sex                                                 0          0
## place_of_birth                                      0          0
## age                                                 0          0
## nationality                                         0          0
## level_of_studies_completed                          0          0
## job_search                                          1          0
## main_occupation                                     1          1
## type_of_contract                                    1          1
## hours                                               1          1
##                            main_occupation type_of_contract hours
## territory                                0                0     0
## capital                                  0                0     0
## sex                                      0                0     0
## place_of_birth                           0                0     0
## age                                      0                0     0
## nationality                              0                0     0
## level_of_studies_completed               0                0     0
## job_search                               0                0     0
## main_occupation                          0                0     0
## type_of_contract                         1                0     0
## hours                                    1                1     0

As you can see, synthetic_data is a list with several elements. The data is inside the syn element. Let’s extract it, and perform the same analysis from before:

syn_pra <- synthetic_data$syn

head(syn_pra)
##   territory capital sex place_of_birth age nationality
## 1        48       9   1              1  06           1
## 2        01       9   6              3  09           1
## 3        48       3   1              1  08           1
## 4        48       9   6              1  11           1
## 5        20       2   6              1  09           1
## 6        48       1   6              1  11           1
##   level_of_studies_completed job_search main_occupation type_of_contract hours
## 1                          3          N               2                1    40
## 2                          1          S               9                6    10
## 3                          1          N               6             <NA>    32
## 4                          2          N               4                1    32
## 5                          3          N               5             <NA>    40
## 6                          1          S               7             <NA>    NA
syn_pra <- basic_cleaning(syn_pra)

logit_model_syn <- glm(job_search ~ ., data = syn_pra, family = binomial())

tidy_logit_syn <- tidy(logit_model_syn, conf.int = TRUE) %>%
  mutate(dataset = "syn")

ggplot(tidy_logit_syn, aes(x = term, y = estimate)) +
  geom_point(colour = "#82518c") +
  geom_hline(yintercept = 0, colour = "red") +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), colour = "#657b83") +
  brotools::theme_blog() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

To ease the comparison between the coefficients of the model, let’s create a single graph:

coeff_models <- bind_rows(list(tidy_logit_model, tidy_logit_syn))

ggplot(coeff_models, aes(x = term, y = estimate, colour = dataset)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high)) +
  brotools::theme_blog() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

This is quite interesting; generally, there is quite some overlap between the synthetic data and the real data! There are some differences though, for instance, main_occupation6 is significant with the synthetic data, but is not with the real data. There’s the possibility to generate more than one synthetic dataset, which would very likely reduce the noise.

Putting on my data scientist hat

Now let’s suppose that I am only interested into prediction. For this, I am going to split my dataset into a training and testing set, then run a logistic regression and a random forest, assess the models’ performance with 10-fold cross validation. I’ll do this on both the real and the synthetic data. To perform the analysis, I’ll be using the {tidymodels} framework; I’m going to explain the code that follows line by line, because I’ll very likely write a blog post focusing on {tidymodels} soon.

So, let’s write a function that does exactly what I explained above:

training_and_evaluating <- function(dataset){

  pra_split <- initial_split(dataset, prop = 0.8)
  
  pra_train <- training(pra_split)
  pra_test <- testing(pra_split)
  
  pra_cv_splits <- vfold_cv(pra_train, v = 10)
  
  preprocess <- recipe(job_search ~ ., data = pra) %>%
    step_knnimpute(all_predictors())
  
  logit_pra <- logistic_reg() %>%
    set_engine("glm")
  
  fitted_logit <- fit_resamples(preprocess,
                                model = logit_pra,
                                resamples = pra_cv_splits,
                                control = control_resamples(save_pred = TRUE))
  
  metric_logit <- fitted_logit$.metrics %>%
    bind_rows() %>%
    group_by(.metric) %>%
    summarise_at(.vars = vars(.estimate), .funs = lst(mean, sd)) %>%
    mutate(model = "logit")
  
  rf_pra <- rand_forest(mode = "classification") %>%
    set_engine(engine = "ranger")
  
  fitted_forest <- fit_resamples(preprocess,
                                model = rf_pra,
                                resamples = pra_cv_splits,
                                control = control_resamples(save_pred = TRUE))
  
  metric_forest <- fitted_forest$.metrics %>%
    bind_rows() %>%
    group_by(.metric) %>%
    summarise_at(.vars = vars(.estimate), .funs = lst(mean, sd)) %>%
    mutate(model = "forest")


  bind_rows(list(metric_logit, metric_forest))
}

Now I can run this function on both the real and the synthetic data, and look at the performance of the logistic regression and of the random forest:

true_data_performance <- training_and_evaluating(pra)

syn_data_performance <- training_and_evaluating(syn_pra)
true_data_performance
## # A tibble: 4 x 4
##   .metric   mean      sd model 
##   <chr>    <dbl>   <dbl> <chr> 
## 1 accuracy 0.882 0.00816 logit 
## 2 roc_auc  0.708 0.0172  logit 
## 3 accuracy 0.907 0.00619 forest
## 4 roc_auc  0.879 0.0123  forest
syn_data_performance
## # A tibble: 4 x 4
##   .metric   mean      sd model 
##   <chr>    <dbl>   <dbl> <chr> 
## 1 accuracy 0.882 0.00758 logit 
## 2 roc_auc  0.691 0.0182  logit 
## 3 accuracy 0.899 0.00615 forest
## 4 roc_auc  0.857 0.0124  forest

The performance is pretty much the same!

Generating synthetic data is a very promising approach, that I certainly will be using more; I think that such approaches can also be very interesting in the private sector (not only to ease access to microdata for researchers) especially within large companies. For instance, it can happen that the data owners from say, an insurance company, are not very keen on sharing sensitive client information with their data scientists. However, by generating a synthetic dataset and sharing the synthetic data with their data science team, the data owners avoid any chance of disclosure of sensitive information, while at the same time allowing their data scientists to develop interesting analyses or applications on the data!

Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and watch my youtube channel. If you want to support my blog and channel, you could buy me an espresso or paypal.me, or buy my ebook on Leanpub.

Buy me an EspressoBuy me an Espresso