class: center, middle, inverse, title-slide .title[ # inexact: an RStudio addin to supervise fuzzy joins ] .author[ ###
Andrés Cruz (PUC Chile) ] .institute[ ###
arcruz@uc.cl
@arcruz0
] .date[ ###
Toronto Data Workshop on Reproducibility, 2021 ] --- class: center, middle ## Learning how to join data sets ### *(in two steps)* --- ## 1) Amazement .pull-left[ ```r data_a ``` ``` ## country var_a ## 1 Brasil a ## 2 Chile b ## 3 Perú c ## 4 Venezuela d ``` ] .pull-right[ ```r data_b1 ``` ``` ## country var_b1 ## 1 Perú C ## 2 Chile B ## 3 Venezuela D ## 4 Brasil A ``` ] -- ```r dplyr::left_join(data_a, data_b1, by = "country") ``` ``` ## country var_a var_b1 ## 1 Brasil a A ## 2 Chile b B ## 3 Perú c C ## 4 Venezuela d D ``` --- ## 2) Reality check .pull-left-narrower[ ```r data_a ``` ``` ## country var_a ## 1 Brasil a ## 2 Chile b ## 3 Perú c ## 4 Venezuela d ``` ] .pull-right-wider[ ```r data_b2 ``` ``` ## country var_b2 ## 1 Peru C ## 2 Chile B ## 3 Venezuela (Bolivarian Republic of) D ## 4 Brazil A ``` ] -- ```r dplyr::left_join(data_a, data_b2, by = "country") ``` ``` ## country var_a var_b2 ## 1 Brasil a <NA> ## 2 Chile b B ## 3 Perú c <NA> ## 4 Venezuela d <NA> ``` --- ## Non-standardization is everywhere! <img src="data:image/png;base64,#imgs/aisen.png" width="90%" /> .right[ Source: [Ley Chile (2021)](https://www.leychile.cl/). ] --- ## Solutions -- - Fire up Excel! -- - Use code to edit the values "by hand". For instance, employing `dplyr::recode()`. -- - Create an *ad hoc* solution using regular expressions. -- - Use approximate string matching! --- ## Approximate string matching (I) -- - The `stringdist` ([van der Loo, 2014](https://cran.r-project.org/web/packages/stringdist/index.html)) R package provides tools to calculate distances between strings -- + Intuitively, it makes sense that the "Brasil" string is very close to the "Brazil" string, and not-so-close to the "Peru" string. -- - We can use this idea to help us with the join! --- layout: true ## Approximate string matching (II) --- ```r stringdist::stringdistmatrix( data_a$country, data_b2$country, method = "osa", useNames = T ) ``` ``` ## Peru Chile Venezuela (Bolivarian Republic of) Brazil ## Brasil 6 5 30 1 ## Chile 5 0 32 5 ## Perú 1 5 32 6 ## Venezuela 7 8 25 7 ``` --- ```r stringdist::stringdistmatrix( data_a$country, data_b2$country, method = "osa", useNames = T ) ``` ``` ## Peru Chile Venezuela (Bolivarian Republic of) Brazil ## Brasil 6 5 30 1 *## Chile 5 0 32 5 ## Perú 1 5 32 6 ## Venezuela 7 8 25 7 ``` --- ```r stringdist::stringdistmatrix( data_a$country, data_b2$country, method = "osa", useNames = T ) ``` ``` ## Peru Chile Venezuela (Bolivarian Republic of) Brazil *## Brasil 6 5 30 1 ## Chile 5 0 32 5 *## Perú 1 5 32 6 ## Venezuela 7 8 25 7 ``` --- ```r stringdist::stringdistmatrix( data_a$country, data_b2$country, method = "osa", useNames = T ) ``` ``` ## Peru Chile Venezuela (Bolivarian Republic of) Brazil ## Brasil 6 5 30 1 ## Chile 5 0 32 5 ## Perú 1 5 32 6 *## Venezuela 7 8 25 7 ``` --- layout: false ## `fuzzyjoin` - The `fuzzyjoin` package ([Robinson et al., 2020](https://cran.r-project.org/web/packages/fuzzyjoin/index.html)) makes using approximate string matching in joins very straightforward: ```r fuzzyjoin::stringdist_left_join( data_a, data_b2, by = "country", method = "osa", distance_col = "dist", max_dist = 2 # this is the default threshold ) ``` ``` ## country.x var_a country.y var_b2 dist ## 1 Brasil a Brazil A 1 ## 2 Chile b Chile B 0 ## 3 Perú c Peru C 1 ## 4 Venezuela d <NA> <NA> NA ``` -- - But what can we do about "Venezuela" here? **Human supervision** would help a lot! --- ## `inexact` -- - `inexact` is an RStudio addin (GUI) to reproducibly supervise fuzzy joins. -- - The package does not implement any algorithms (that is [`stringdist`](https://cran.r-project.org/web/packages/stringdist/index.html)'s job), it only provides a graphical way to supervise them. -- - It serves as a "coding assistant", without ever modifying the R global environment by itself. -- - `inexact` works within a programming environment (R), unlike other fuzzy join supervising tools (the [`reconcile-csv` OpenRefine extension](https://okfnlabs.org/reconcile-csv/); the [Fuzzy Lookup Excel add-in](https://www.microsoft.com/en-us/download/details.aspx?id=15011)). --- ## Our data again .pull-left-narrower[ ```r data_a ``` ``` ## country var_a ## 1 Brasil a ## 2 Chile b ## 3 Perú c ## 4 Venezuela d ``` ] .pull-right-wider[ ```r data_b2 ``` ``` ## country var_b2 ## 1 Peru C ## 2 Chile B ## 3 Venezuela (Bolivarian Republic of) D ## 4 Brazil A ``` ] --- class: center, middle <video width="80%" height="85%" controls> <source src="data:image/png;base64,#imgs/inexact_mp4.mp4" type="video/mp4"> </video> --- ## Code output (A) ```r # You added custom matches: inexact::inexact_join( x = data_a, y = data_b2, by = 'country', method = 'osa', mode = 'left', custom_match = c( 'Venezuela' = 'Venezuela (Bolivarian Republic of)' ) ) ``` ``` ## country var_a var_b2 ## 1 Brasil a A ## 2 Chile b B ## 3 Perú c C ## 4 Venezuela d D ``` --- ## Code output (B) ```r # You added custom matches: inexact::inexact_join( x = data_a, y = data_b2, by = 'country', method = 'osa', mode = 'left', custom_match = c( 'Venezuela' = 'Venezuela (Bolivarian Republic of)' ), match_cols = T ) ``` ``` ## country var_a var_b2 .match .dist ## 1 Brasil a A Brazil 1 ## 2 Chile b B Chile 0 ## 3 Perú c C Peru 1 ## 4 Venezuela d D Venezuela (Bolivarian Republic of) 25 ## .custom_match ## 1 FALSE ## 2 FALSE ## 3 FALSE ## 4 TRUE ``` --- ## Concluding remarks -- - Graphical User Interfaces (GUIs) are convenient and time-saving. But they can also be **helpful in reproducible workflows**, when serving as **coding assistants**. -- - An inspiration for this package was `questionr` ([Barnier, Briatte & Larmar, 2020](https://juba.github.io/questionr/)), which offers a similar workflow for recoding values, reordering factors, etc. Go check it out! -- - A long to-do list for `inexact`: more testing, better documentation, better interface, increased performances, new functions (e.g., join by more than one ID variable). -- - Feedback is very welcome! --- class: center, middle, inverse # **Thank you!** <br> <h3><a href='http://github.com/arcruz0/inexact/'><i class='fa fa-link fa-fw'></i>http://github.com/arcruz0/inexact/</a></h3> <br> <h3><a href='mailto:arcruz@uc.cl'><i class='fa fa-envelope'></i> arcruz@uc.cl</a> <a href='http://twitter.com/arcruz0'><i class='fa fa-twitter fa-fw'></i>@arcruz0</a></h3>