inexact: un addin de RStudio para supervisar la unión fuzzy de datos

# inexact: un addin de RStudio para supervisar la unión <em>fuzzy</em> de datos
### <br>Andrés Cruz (PUC Chile / IMFD)
### <a href='mailto:arcruz@uc.cl'><i class='fa fa-paper-plane fa-fw'></i> arcruz@uc.cl</a>  <a href='http://twitter.com/arcruz0'><i class='fa fa-twitter fa-fw'></i><span class="citation">@arcruz0</span></a>
### <br>
<h3>
LatinR 2019, 26 de septiembre
</h3>

---

## Índice

1. Motivación
1. Unión *fuzzy* de bases de datos
1. Supervisión reproducible con `inexact`
1. Por hacer

---

## Unir bases de datos

```r
datos_a
```

```
##        pais var_a
## 1    Brasil     1
## 2     Chile     2
## 3      Perú     3
## 4 Venezuela     4
```
]

```r
datos_b1
```

```
##        pais var_b1
## 1    Brasil     11
## 2     Chile     12
## 3      Perú     13
## 4 Venezuela     14
```
]

---

```r
merge(datos_a, datos_b1, by = "pais", all.x = T)
```

```
##        pais var_a var_b1
## 1    Brasil     1     11
## 2     Chile     2     12
## 3      Perú     3     13
## 4 Venezuela     4     14
```

---

```r
dplyr::left_join(datos_a, datos_b1, by = "pais")
```

```
##        pais var_a var_b1
## 1    Brasil     1     11
## 2     Chile     2     12
## 3      Perú     3     13
## 4 Venezuela     4     14
```

---

## Unir bases de datos con columnas de identificación *inexactas*

```r
datos_a
```

```
##        pais var_a
## 1    Brasil     1
## 2     Chile     2
## 3      Perú     3
## 4 Venezuela     4
```
]

```r
datos_b2
```

```
##                                 pais var_b2
## 1                             Brazil     11
## 2                              Chile     12
## 3                               Peru     13
## 4 Venezuela (Bolivarian Republic of)     14
```
]

---

```r
dplyr::left_join(datos_a, datos_b2, by = "pais")
```

```
##        pais var_a var_b2
## 1    Brasil     1     NA
## 2     Chile     2     12
## 3      Perú     3     NA
## 4 Venezuela     4     NA
```

---

.right[
Fuente: [Biblioteca del Congreso Nacional (2019)](https://www.bcn.cl/historiapolitica/resenas_parlamentarias/wiki/Jos%C3%A9_Manuel_Edwards_Silva).
]

---

---

## Soluciones

- Editar los datos a mano en Excel/Sheets/Calc.

- Explorar y editar los datos con R. Por ejemplo, con `dplyr::recode()`.

- Crear un algoritmo *ad hoc* usando expresiones regulares.

- Implementar una unión *fuzzy* de datos.

---

## Unión *fuzzy* de datos

- Con el paquete `stringdist` ([van der Loo, 2014](https://cran.r-project.org/web/packages/stringdist/index.html)) es posible calcular una matriz de distancias entre las dos variables de identificación.

- Luego, para cada fila podríamos elegir el pareo con la mínima distancia (tal vez fijando un umbral máximo).

---

```r
stringdist::stringdistmatrix(
  datos_a$pais, datos_b2$pais, method = "osa", useNames = T
  )
```

```
##           Brazil Chile Peru Venezuela (Bolivarian Republic of)
## Brasil         1     5    6                                 30
## Chile          5     0    5                                 32
## Perú           6     5    1                                 32
## Venezuela      7     8    7                                 25
```

---

```r
stringdist::stringdistmatrix(
  datos_a$pais, datos_b2$pais, method = "osa", useNames = T
  )
```

```
##           Brazil Chile Peru Venezuela (Bolivarian Republic of)
## Brasil         1     5    6                                 30
*## Chile          5     0    5                                 32
## Perú           6     5    1                                 32
## Venezuela      7     8    7                                 25
```

---

```r
stringdist::stringdistmatrix(
  datos_a$pais, datos_b2$pais, method = "osa", useNames = T
  )
```

```
##           Brazil Chile Peru Venezuela (Bolivarian Republic of)
*## Brasil         1     5    6                                 30
## Chile          5     0    5                                 32
*## Perú           6     5    1                                 32
## Venezuela      7     8    7                                 25
```

---

```r
stringdist::stringdistmatrix(
  datos_a$pais, datos_b2$pais, method = "osa", useNames = T
  )
```

```
##           Brazil Chile Peru Venezuela (Bolivarian Republic of)
## Brasil         1     5    6                                 30
## Chile          5     0    5                                 32
## Perú           6     5    1                                 32
*## Venezuela      7     8    7                                 25
```

---

## Supervisar al algoritmo

- Podemos usar el paquete `fuzzyjoin` ([Robinson et al., 2019](https://cran.r-project.org/web/packages/fuzzyjoin/index.html)), modificar un poco sus resultados (¿fijar un umbral?), inspeccionarlos visualmente y luego hacer las modificaciones que nos parezcan pertinentes.

+ Contras: proceso tedioso, posibilidad de errores, no es el objetivo central del paquete.

- Otra alternativa fuera del ecosistema de R es utilizar `OpenRefine` ([OpenRefine, 2019](http://openrefine.org/)) junto a la extensión `reconcile-csv` ([Bauer, 2013](http://okfnlabs.org/reconcile-csv/)), que sí es una interfaz gráfica.

+ Contras: fuera de un ecosistema de programación, solamente un algoritmo de pareo disponible, dificultad para hacer múltiples uniones seguidas, problemas en la interfaz.

---

## `inexact`

- `inexact` es un addin de RStudio (GUI) que permite supervisar, de manera reproducible, la unión *fuzzy* de datos.

- El trabajo pesado es realizado tras bambalinas por `stringdist` ([van der Loo, 2014](https://cran.r-project.org/web/packages/stringdist/index.html)) y `fuzzyjoin` ([Robinson et al., 2019](https://cran.r-project.org/web/packages/fuzzyjoin/index.html)), `inexact` se encarga de la supervisión humana.

- La interfaz gráfica solo asiste en la creación de código, no modifica cosas por sí misma. Esto sigue el ejemplo de otros paquetes que implementan addins, como `questionr` ([Barnier, Briatte & Larmar, 2018](ttps://CRAN.R-project.org/package=questionr)).

- Su versión preliminar está [disponible en GitHub](https://github.com/arcruz0/inexact). Para instalarlo:

```r
remotes::install_github("arcruz0/inexact")
```

---

## Los datos de ejemplo

```r
datos_a
```

```
##        pais var_a
## 1    Brasil     1
## 2     Chile     2
## 3      Perú     3
## 4 Venezuela     4
```
]

```r
datos_b2
```

---

## `inexact`: panel inicial

---

## `inexact`: supervisión (I)

---

## `inexact`: supervisión (II)

---

## `inexact`: supervisión (III)

---

## `inexact`: código final

---

## `inexact`: resultado (I)

```r
inexact::inexact_join(
  x  = datos_a,
  y  = datos_b2,
  by = 'pais',
  method = 'osa',
  mode = 'left',
  custom_match = c(
   'Venezuela' = 'Venezuela (Bolivarian Republic of)'
  )
)
```

```
## # A tibble: 4 x 3
##   pais      var_a var_b2
##   <chr>     <int>  <int>
## 1 Brasil        1     11
## 2 Chile         2     12
## 3 Perú          3     13
## 4 Venezuela     4     14
```

---

## `inexact`: resultado (II)

```r
inexact::inexact_join(
  x  = datos_a,
  y  = datos_b2,
  by = 'pais',
  method = 'osa',
  mode = 'left',
  custom_match = c(
   'Venezuela' = 'Venezuela (Bolivarian Republic of)'
  ),
  match_cols = T
)
```

```
## # A tibble: 4 x 5
##   pais      var_a .pais_match                        .dist var_b2
##   <chr>     <int> <chr>                              <dbl>  <int>
## 1 Brasil        1 Brazil                                 1     11
## 2 Chile         2 Chile                                  0     12
## 3 Perú          3 Peru                                   1     13
## 4 Venezuela     4 Venezuela (Bolivarian Republic of)    -1     14
```

---

## Por hacer

- Testeo con más bases.

- Mejor documentación, interfaz, y rendimiento.

- Características planificadas:

+ Uniones por más de una variable de identificación (distintas posibilidades de pareos *fuzzy*).

--
    + Supervisión para *clustering*.

---

# **¡Gracias!**

<br>

<h3><a href='http://github.com/arcruz0/inexact/'><i class='fa fa-link fa-fw'></i>http://github.com/arcruz0/inexact/</a></h3>

<br>

<h3><a href='mailto:arcruz@uc.cl'><i class='fa fa-paper-plane fa-fw'></i> arcruz@uc.cl</a>&nbsp; <a href='http://twitter.com/arcruz0'><i class='fa fa-twitter fa-fw'></i>@arcruz0</a></h3>