+ - 0:00:00
Notes for current slide
Notes for next slide

inexact: an RStudio addin to supervise fuzzy joins


Andrés Cruz (PUC Chile)


Toronto Data Workshop on Reproducibility, 2021

1 / 19

Learning how to join data sets

(in two steps)

2 / 19

1) Amazement

data_a
## country var_a
## 1 Brasil a
## 2 Chile b
## 3 Perú c
## 4 Venezuela d
data_b1
## country var_b1
## 1 Perú C
## 2 Chile B
## 3 Venezuela D
## 4 Brasil A
3 / 19

1) Amazement

data_a
## country var_a
## 1 Brasil a
## 2 Chile b
## 3 Perú c
## 4 Venezuela d
data_b1
## country var_b1
## 1 Perú C
## 2 Chile B
## 3 Venezuela D
## 4 Brasil A
dplyr::left_join(data_a, data_b1, by = "country")
## country var_a var_b1
## 1 Brasil a A
## 2 Chile b B
## 3 Perú c C
## 4 Venezuela d D
3 / 19

2) Reality check

data_a
## country var_a
## 1 Brasil a
## 2 Chile b
## 3 Perú c
## 4 Venezuela d
data_b2
## country var_b2
## 1 Peru C
## 2 Chile B
## 3 Venezuela (Bolivarian Republic of) D
## 4 Brazil A
4 / 19

2) Reality check

data_a
## country var_a
## 1 Brasil a
## 2 Chile b
## 3 Perú c
## 4 Venezuela d
data_b2
## country var_b2
## 1 Peru C
## 2 Chile B
## 3 Venezuela (Bolivarian Republic of) D
## 4 Brazil A
dplyr::left_join(data_a, data_b2, by = "country")
## country var_a var_b2
## 1 Brasil a <NA>
## 2 Chile b B
## 3 Perú c <NA>
## 4 Venezuela d <NA>
4 / 19

Non-standardization is everywhere!

5 / 19

Solutions

6 / 19

Solutions

  • Fire up Excel!
6 / 19

Solutions

  • Fire up Excel!

  • Use code to edit the values "by hand". For instance, employing dplyr::recode().

6 / 19

Solutions

  • Fire up Excel!

  • Use code to edit the values "by hand". For instance, employing dplyr::recode().

  • Create an ad hoc solution using regular expressions.

6 / 19

Solutions

  • Fire up Excel!

  • Use code to edit the values "by hand". For instance, employing dplyr::recode().

  • Create an ad hoc solution using regular expressions.

  • Use approximate string matching!

6 / 19

Approximate string matching (I)

7 / 19

Approximate string matching (I)

  • The stringdist (van der Loo, 2014) R package provides tools to calculate distances between strings
7 / 19

Approximate string matching (I)

  • The stringdist (van der Loo, 2014) R package provides tools to calculate distances between strings

    • Intuitively, it makes sense that the "Brasil" string is very close to the "Brazil" string, and not-so-close to the "Peru" string.
7 / 19

Approximate string matching (I)

  • The stringdist (van der Loo, 2014) R package provides tools to calculate distances between strings

    • Intuitively, it makes sense that the "Brasil" string is very close to the "Brazil" string, and not-so-close to the "Peru" string.
  • We can use this idea to help us with the join!
7 / 19

Approximate string matching (II)

stringdist::stringdistmatrix(
data_a$country, data_b2$country, method = "osa", useNames = T
)
## Peru Chile Venezuela (Bolivarian Republic of) Brazil
## Brasil 6 5 30 1
## Chile 5 0 32 5
## Perú 1 5 32 6
## Venezuela 7 8 25 7
8 / 19

Approximate string matching (II)

stringdist::stringdistmatrix(
data_a$country, data_b2$country, method = "osa", useNames = T
)
## Peru Chile Venezuela (Bolivarian Republic of) Brazil
## Brasil 6 5 30 1
## Chile 5 0 32 5
## Perú 1 5 32 6
## Venezuela 7 8 25 7
9 / 19

Approximate string matching (II)

stringdist::stringdistmatrix(
data_a$country, data_b2$country, method = "osa", useNames = T
)
## Peru Chile Venezuela (Bolivarian Republic of) Brazil
## Brasil 6 5 30 1
## Chile 5 0 32 5
## Perú 1 5 32 6
## Venezuela 7 8 25 7
10 / 19

Approximate string matching (II)

stringdist::stringdistmatrix(
data_a$country, data_b2$country, method = "osa", useNames = T
)
## Peru Chile Venezuela (Bolivarian Republic of) Brazil
## Brasil 6 5 30 1
## Chile 5 0 32 5
## Perú 1 5 32 6
## Venezuela 7 8 25 7
11 / 19

fuzzyjoin

  • The fuzzyjoin package (Robinson et al., 2020) makes using approximate string matching in joins very straightforward:
fuzzyjoin::stringdist_left_join(
data_a, data_b2, by = "country",
method = "osa", distance_col = "dist",
max_dist = 2 # this is the default threshold
)
## country.x var_a country.y var_b2 dist
## 1 Brasil a Brazil A 1
## 2 Chile b Chile B 0
## 3 Perú c Peru C 1
## 4 Venezuela d <NA> <NA> NA
12 / 19

fuzzyjoin

  • The fuzzyjoin package (Robinson et al., 2020) makes using approximate string matching in joins very straightforward:
fuzzyjoin::stringdist_left_join(
data_a, data_b2, by = "country",
method = "osa", distance_col = "dist",
max_dist = 2 # this is the default threshold
)
## country.x var_a country.y var_b2 dist
## 1 Brasil a Brazil A 1
## 2 Chile b Chile B 0
## 3 Perú c Peru C 1
## 4 Venezuela d <NA> <NA> NA
  • But what can we do about "Venezuela" here? Human supervision would help a lot!
12 / 19

inexact

13 / 19

inexact

  • inexact is an RStudio addin (GUI) to reproducibly supervise fuzzy joins.
13 / 19

inexact

  • inexact is an RStudio addin (GUI) to reproducibly supervise fuzzy joins.

  • The package does not implement any algorithms (that is stringdist's job), it only provides a graphical way to supervise them.

13 / 19

inexact

  • inexact is an RStudio addin (GUI) to reproducibly supervise fuzzy joins.

  • The package does not implement any algorithms (that is stringdist's job), it only provides a graphical way to supervise them.

  • It serves as a "coding assistant", without ever modifying the R global environment by itself.

13 / 19

inexact

  • inexact is an RStudio addin (GUI) to reproducibly supervise fuzzy joins.

  • The package does not implement any algorithms (that is stringdist's job), it only provides a graphical way to supervise them.

  • It serves as a "coding assistant", without ever modifying the R global environment by itself.

  • inexact works within a programming environment (R), unlike other fuzzy join supervising tools (the reconcile-csv OpenRefine extension; the Fuzzy Lookup Excel add-in).

13 / 19

Our data again

data_a
## country var_a
## 1 Brasil a
## 2 Chile b
## 3 Perú c
## 4 Venezuela d
data_b2
## country var_b2
## 1 Peru C
## 2 Chile B
## 3 Venezuela (Bolivarian Republic of) D
## 4 Brazil A
14 / 19
15 / 19

Code output (A)

# You added custom matches:
inexact::inexact_join(
x = data_a,
y = data_b2,
by = 'country',
method = 'osa',
mode = 'left',
custom_match = c(
'Venezuela' = 'Venezuela (Bolivarian Republic of)'
)
)
## country var_a var_b2
## 1 Brasil a A
## 2 Chile b B
## 3 Perú c C
## 4 Venezuela d D
16 / 19

Code output (B)

# You added custom matches:
inexact::inexact_join(
x = data_a,
y = data_b2,
by = 'country',
method = 'osa',
mode = 'left',
custom_match = c(
'Venezuela' = 'Venezuela (Bolivarian Republic of)'
),
match_cols = T
)
## country var_a var_b2 .match .dist
## 1 Brasil a A Brazil 1
## 2 Chile b B Chile 0
## 3 Perú c C Peru 1
## 4 Venezuela d D Venezuela (Bolivarian Republic of) 25
## .custom_match
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 TRUE
17 / 19

Concluding remarks

18 / 19

Concluding remarks

  • Graphical User Interfaces (GUIs) are convenient and time-saving. But they can also be helpful in reproducible workflows, when serving as coding assistants.
18 / 19

Concluding remarks

  • Graphical User Interfaces (GUIs) are convenient and time-saving. But they can also be helpful in reproducible workflows, when serving as coding assistants.

  • An inspiration for this package was questionr (Barnier, Briatte & Larmar, 2020), which offers a similar workflow for recoding values, reordering factors, etc. Go check it out!

18 / 19

Concluding remarks

  • Graphical User Interfaces (GUIs) are convenient and time-saving. But they can also be helpful in reproducible workflows, when serving as coding assistants.

  • An inspiration for this package was questionr (Barnier, Briatte & Larmar, 2020), which offers a similar workflow for recoding values, reordering factors, etc. Go check it out!

  • A long to-do list for inexact: more testing, better documentation, better interface, increased performances, new functions (e.g., join by more than one ID variable).

18 / 19

Concluding remarks

  • Graphical User Interfaces (GUIs) are convenient and time-saving. But they can also be helpful in reproducible workflows, when serving as coding assistants.

  • An inspiration for this package was questionr (Barnier, Briatte & Larmar, 2020), which offers a similar workflow for recoding values, reordering factors, etc. Go check it out!

  • A long to-do list for inexact: more testing, better documentation, better interface, increased performances, new functions (e.g., join by more than one ID variable).

  • Feedback is very welcome!

18 / 19

Learning how to join data sets

(in two steps)

2 / 19
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow