data_a
## country var_a## 1 Brasil a## 2 Chile b## 3 Perú c## 4 Venezuela d
data_b1
## country var_b1## 1 Perú C## 2 Chile B## 3 Venezuela D## 4 Brasil A
data_a
## country var_a## 1 Brasil a## 2 Chile b## 3 Perú c## 4 Venezuela d
data_b1
## country var_b1## 1 Perú C## 2 Chile B## 3 Venezuela D## 4 Brasil A
dplyr::left_join(data_a, data_b1, by = "country")
## country var_a var_b1## 1 Brasil a A## 2 Chile b B## 3 Perú c C## 4 Venezuela d D
data_a
## country var_a## 1 Brasil a## 2 Chile b## 3 Perú c## 4 Venezuela d
data_b2
## country var_b2## 1 Peru C## 2 Chile B## 3 Venezuela (Bolivarian Republic of) D## 4 Brazil A
data_a
## country var_a## 1 Brasil a## 2 Chile b## 3 Perú c## 4 Venezuela d
data_b2
## country var_b2## 1 Peru C## 2 Chile B## 3 Venezuela (Bolivarian Republic of) D## 4 Brazil A
dplyr::left_join(data_a, data_b2, by = "country")
## country var_a var_b2## 1 Brasil a <NA>## 2 Chile b B## 3 Perú c <NA>## 4 Venezuela d <NA>
Fire up Excel!
Use code to edit the values "by hand". For instance, employing dplyr::recode()
.
Fire up Excel!
Use code to edit the values "by hand". For instance, employing dplyr::recode()
.
Create an ad hoc solution using regular expressions.
Fire up Excel!
Use code to edit the values "by hand". For instance, employing dplyr::recode()
.
Create an ad hoc solution using regular expressions.
Use approximate string matching!
stringdist
(van der Loo, 2014) R package provides tools to calculate distances between stringsThe stringdist
(van der Loo, 2014) R package provides tools to calculate distances between strings
The stringdist
(van der Loo, 2014) R package provides tools to calculate distances between strings
stringdist::stringdistmatrix( data_a$country, data_b2$country, method = "osa", useNames = T )
## Peru Chile Venezuela (Bolivarian Republic of) Brazil## Brasil 6 5 30 1## Chile 5 0 32 5## Perú 1 5 32 6## Venezuela 7 8 25 7
stringdist::stringdistmatrix( data_a$country, data_b2$country, method = "osa", useNames = T )
## Peru Chile Venezuela (Bolivarian Republic of) Brazil## Brasil 6 5 30 1## Chile 5 0 32 5## Perú 1 5 32 6## Venezuela 7 8 25 7
stringdist::stringdistmatrix( data_a$country, data_b2$country, method = "osa", useNames = T )
## Peru Chile Venezuela (Bolivarian Republic of) Brazil## Brasil 6 5 30 1## Chile 5 0 32 5## Perú 1 5 32 6## Venezuela 7 8 25 7
stringdist::stringdistmatrix( data_a$country, data_b2$country, method = "osa", useNames = T )
## Peru Chile Venezuela (Bolivarian Republic of) Brazil## Brasil 6 5 30 1## Chile 5 0 32 5## Perú 1 5 32 6## Venezuela 7 8 25 7
fuzzyjoin
fuzzyjoin
package (Robinson et al., 2020) makes using approximate string matching in joins very straightforward:fuzzyjoin::stringdist_left_join( data_a, data_b2, by = "country", method = "osa", distance_col = "dist", max_dist = 2 # this is the default threshold)
## country.x var_a country.y var_b2 dist## 1 Brasil a Brazil A 1## 2 Chile b Chile B 0## 3 Perú c Peru C 1## 4 Venezuela d <NA> <NA> NA
fuzzyjoin
fuzzyjoin
package (Robinson et al., 2020) makes using approximate string matching in joins very straightforward:fuzzyjoin::stringdist_left_join( data_a, data_b2, by = "country", method = "osa", distance_col = "dist", max_dist = 2 # this is the default threshold)
## country.x var_a country.y var_b2 dist## 1 Brasil a Brazil A 1## 2 Chile b Chile B 0## 3 Perú c Peru C 1## 4 Venezuela d <NA> <NA> NA
inexact
inexact
inexact
is an RStudio addin (GUI) to reproducibly supervise fuzzy joins.inexact
inexact
is an RStudio addin (GUI) to reproducibly supervise fuzzy joins.
The package does not implement any algorithms (that is stringdist
's job), it only provides a graphical way to supervise them.
inexact
inexact
is an RStudio addin (GUI) to reproducibly supervise fuzzy joins.
The package does not implement any algorithms (that is stringdist
's job), it only provides a graphical way to supervise them.
It serves as a "coding assistant", without ever modifying the R global environment by itself.
inexact
inexact
is an RStudio addin (GUI) to reproducibly supervise fuzzy joins.
The package does not implement any algorithms (that is stringdist
's job), it only provides a graphical way to supervise them.
It serves as a "coding assistant", without ever modifying the R global environment by itself.
inexact
works within a programming environment (R), unlike other fuzzy join supervising tools (the reconcile-csv
OpenRefine extension; the Fuzzy Lookup Excel add-in).
data_a
## country var_a## 1 Brasil a## 2 Chile b## 3 Perú c## 4 Venezuela d
data_b2
## country var_b2## 1 Peru C## 2 Chile B## 3 Venezuela (Bolivarian Republic of) D## 4 Brazil A
# You added custom matches:inexact::inexact_join( x = data_a, y = data_b2, by = 'country', method = 'osa', mode = 'left', custom_match = c( 'Venezuela' = 'Venezuela (Bolivarian Republic of)' ))
## country var_a var_b2## 1 Brasil a A## 2 Chile b B## 3 Perú c C## 4 Venezuela d D
# You added custom matches:inexact::inexact_join( x = data_a, y = data_b2, by = 'country', method = 'osa', mode = 'left', custom_match = c( 'Venezuela' = 'Venezuela (Bolivarian Republic of)' ), match_cols = T)
## country var_a var_b2 .match .dist## 1 Brasil a A Brazil 1## 2 Chile b B Chile 0## 3 Perú c C Peru 1## 4 Venezuela d D Venezuela (Bolivarian Republic of) 25## .custom_match## 1 FALSE## 2 FALSE## 3 FALSE## 4 TRUE
Graphical User Interfaces (GUIs) are convenient and time-saving. But they can also be helpful in reproducible workflows, when serving as coding assistants.
An inspiration for this package was questionr
(Barnier, Briatte & Larmar, 2020), which offers a similar workflow for recoding values, reordering factors, etc. Go check it out!
Graphical User Interfaces (GUIs) are convenient and time-saving. But they can also be helpful in reproducible workflows, when serving as coding assistants.
An inspiration for this package was questionr
(Barnier, Briatte & Larmar, 2020), which offers a similar workflow for recoding values, reordering factors, etc. Go check it out!
A long to-do list for inexact
: more testing, better documentation, better interface, increased performances, new functions (e.g., join by more than one ID variable).
Graphical User Interfaces (GUIs) are convenient and time-saving. But they can also be helpful in reproducible workflows, when serving as coding assistants.
An inspiration for this package was questionr
(Barnier, Briatte & Larmar, 2020), which offers a similar workflow for recoding values, reordering factors, etc. Go check it out!
A long to-do list for inexact
: more testing, better documentation, better interface, increased performances, new functions (e.g., join by more than one ID variable).
Feedback is very welcome!
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |