Tables I

Column

LEFT TABLE: vab_pc

RiGHT TABLE: life_exp

Column

LEFT TABLE: vab_pc

LEFT TABLE: life_exp

INNER JOIN

Column


THE inner_join() FUNCTION (dplyr) RETURNS ALL ROWS FROM THE LEFT TABLE WHERE THERE ARE MATCHING VALUES IN THE RIGHT TABLE. THEREFORE, THE inner_join() RETURNS A TABLE WHICH COMBINE BOTH TABLES BASED ON THE JOIN PREDICATE. IF THERE ARE MULTIPLE MATCHES BETWEEN THE LEFT AND THE RIGHT TABLE ALL COMBINATIONS OF THE MATCHES ARE RETURNED. ALTERNATIVELY WE CAN USE THE merge() FUNCTION (BASE R) INDICATING all=FALSE

Columna

Dyplyr



inner_join(vab_pc, life_exp, by= c("country", "continent"))


# A tibble: 4 x 4
  country   continent gdpPercap lifeExp
  <fct>     <fct>     <chr>     <chr>  
1 Australia Oceania   34,435.37 81.23  
2 Canada    Americas  36,319.24 80.65  
3 Japan     Asia      31,656.07 82.60  
4 Spain     Europe    28,821.06 80.94  


inner_join(life_exp, vab_pc, by= c("country", "continent"))


# A tibble: 4 x 4
  country   continent lifeExp gdpPercap
  <fct>     <fct>     <chr>   <chr>    
1 Australia Oceania   81.23   34,435.37
2 Canada    Americas  80.65   36,319.24
3 Japan     Asia      82.60   31,656.07
4 Spain     Europe    80.94   28,821.06

Columna

Base R




merge(vab_pc, life_exp, all = FALSE)


    country continent gdpPercap lifeExp
1 Australia   Oceania 34,435.37   81.23
2    Canada  Americas 36,319.24   80.65
3     Japan      Asia 31,656.07   82.60
4     Spain    Europe 28,821.06   80.94




merge(life_exp, vab_pc, all = FALSE)


    country continent lifeExp gdpPercap
1 Australia   Oceania   81.23 34,435.37
2    Canada  Americas   80.65 36,319.24
3     Japan      Asia   82.60 31,656.07
4     Spain    Europe   80.94 28,821.06

LEFT JOIN

Columna


THE left_join() FUNCTION (dplyr) RETURNS ALL RECORDS FROM THE LEFT TABLE AND THE MATCHED RECORDS FROM THE RIGHT TABLE. THE RESULT IS NA FROM THE RIGHT SIDE IF THERE IS NO MATCH. IF THERE ARE MULTIPLE MATCHES BETWEEN THE LEFT TABLE AND THE RIGHT TABLE ALL COMBINATIONS OF THE MATCHES ARE RETURNED. IN BASE R WE CAN USE THE FUNCTION merge() INDICATING all.x = TRUE.

Columna

Dyplyr



vab_pc %>% left_join(life_exp, by = c("country", "continent"))


# A tibble: 6 x 4
  country   continent gdpPercap lifeExp
  <fct>     <fct>     <chr>     <chr>  
1 Australia Oceania   34,435.37 81.23  
2 Canada    Americas  36,319.24 80.65  
3 India     Asia      2,452.21  <NA>   
4 Japan     Asia      31,656.07 82.60  
5 Peru      Americas  7,408.91  <NA>   
6 Spain     Europe    28,821.06 80.94  


life_exp %>% left_join(vab_pc, by = c("country", "continent"))


# A tibble: 7 x 4
  country   continent lifeExp gdpPercap
  <fct>     <fct>     <chr>   <chr>    
1 Australia Oceania   81.23   34,435.37
2 Brazil    Americas  72.39   <NA>     
3 Canada    Americas  80.65   36,319.24
4 Japan     Asia      82.60   31,656.07
5 Mongolia  Asia      66.80   <NA>     
6 Singapore Asia      79.97   <NA>     
7 Spain     Europe    80.94   28,821.06

Columna

Base R




merge(vab_pc, life_exp, all.x = TRUE)


    country continent gdpPercap lifeExp
1 Australia   Oceania 34,435.37   81.23
2    Canada  Americas 36,319.24   80.65
3     India      Asia  2,452.21    <NA>
4     Japan      Asia 31,656.07   82.60
5      Peru  Americas  7,408.91    <NA>
6     Spain    Europe 28,821.06   80.94




merge(life_exp, vab_pc, all.x = TRUE)


    country continent lifeExp gdpPercap
1 Australia   Oceania   81.23 34,435.37
2    Brazil  Americas   72.39      <NA>
3    Canada  Americas   80.65 36,319.24
4     Japan      Asia   82.60 31,656.07
5  Mongolia      Asia   66.80      <NA>
6 Singapore      Asia   79.97      <NA>
7     Spain    Europe   80.94 28,821.06

RIGHT JOIN

Columna


THE right_join() FUNCTION (dplyr) RETURNS ALL RECORDS FROM THE RIGHT TABLE AND THE MATCHED RECORDS FROM THE LEFT TABLE. THE RESULT IS NA FROM THE LEFT SIDE IF THERE IS NO MATCH. IF THERE ARE MULTIPLE MATCHES BETWEEN THE LEFT TABLE AND THE RIGHT TABLE, ALL COMBINATIONS OF THE MATCHES ARE RETURNED. IN BASE R WE CAN USE THE FUNCTION merge() INDICATING all.y = TRUE.

Columna

Dyplyr



vab_pc %>% right_join(life_exp, by = c("country", "continent"))


# A tibble: 7 x 4
  country   continent gdpPercap lifeExp
  <fct>     <fct>     <chr>     <chr>  
1 Australia Oceania   34,435.37 81.23  
2 Brazil    Americas  <NA>      72.39  
3 Canada    Americas  36,319.24 80.65  
4 Japan     Asia      31,656.07 82.60  
5 Mongolia  Asia      <NA>      66.80  
6 Singapore Asia      <NA>      79.97  
7 Spain     Europe    28,821.06 80.94  


life_exp %>% right_join(vab_pc, by = c("country", "continent"))


# A tibble: 6 x 4
  country   continent lifeExp gdpPercap
  <fct>     <fct>     <chr>   <chr>    
1 Australia Oceania   81.23   34,435.37
2 Canada    Americas  80.65   36,319.24
3 India     Asia      <NA>    2,452.21 
4 Japan     Asia      82.60   31,656.07
5 Peru      Americas  <NA>    7,408.91 
6 Spain     Europe    80.94   28,821.06

Columna

Base R




merge(vab_pc, life_exp, all.y = TRUE)


    country continent gdpPercap lifeExp
1 Australia   Oceania 34,435.37   81.23
2    Brazil  Americas      <NA>   72.39
3    Canada  Americas 36,319.24   80.65
4     Japan      Asia 31,656.07   82.60
5  Mongolia      Asia      <NA>   66.80
6 Singapore      Asia      <NA>   79.97
7     Spain    Europe 28,821.06   80.94




merge(life_exp, vab_pc, all.y = TRUE)


    country continent lifeExp gdpPercap
1 Australia   Oceania   81.23 34,435.37
2    Canada  Americas   80.65 36,319.24
3     India      Asia    <NA>  2,452.21
4     Japan      Asia   82.60 31,656.07
5      Peru  Americas    <NA>  7,408.91
6     Spain    Europe   80.94 28,821.06

FULL JOIN

Columna


THE full_join() FUNCTION (dplyr) RETURNS ALL ROWS AND ALL COLUMNS FROM BOTH THE LEFT AND THE RIGHT TABLE. WHERE THERE ARE NOT MATCHING VALUES RETURNS NA FOR THE ONES MISSING. IN BASE R WE CAN USE THE FUNCTION merge() INDICATING all = TRUE.


Columna

Dyplyr



vab_pc %>% full_join(life_exp, by = c("country", "continent"))


# A tibble: 9 x 4
  country   continent gdpPercap lifeExp
  <fct>     <fct>     <chr>     <chr>  
1 Australia Oceania   34,435.37 81.23  
2 Canada    Americas  36,319.24 80.65  
3 India     Asia      2,452.21  <NA>   
4 Japan     Asia      31,656.07 82.60  
5 Peru      Americas  7,408.91  <NA>   
6 Spain     Europe    28,821.06 80.94  
7 Brazil    Americas  <NA>      72.39  
8 Mongolia  Asia      <NA>      66.80  
9 Singapore Asia      <NA>      79.97  


life_exp %>% full_join(vab_pc, by = c("country", "continent"))


# A tibble: 9 x 4
  country   continent lifeExp gdpPercap
  <fct>     <fct>     <chr>   <chr>    
1 Australia Oceania   81.23   34,435.37
2 Brazil    Americas  72.39   <NA>     
3 Canada    Americas  80.65   36,319.24
4 Japan     Asia      82.60   31,656.07
5 Mongolia  Asia      66.80   <NA>     
6 Singapore Asia      79.97   <NA>     
7 Spain     Europe    80.94   28,821.06
8 India     Asia      <NA>    2,452.21 
9 Peru      Americas  <NA>    7,408.91 

Columna

Base R




merge(vab_pc, life_exp, all = TRUE)


    country continent gdpPercap lifeExp
1 Australia   Oceania 34,435.37   81.23
2    Brazil  Americas      <NA>   72.39
3    Canada  Americas 36,319.24   80.65
4     India      Asia  2,452.21    <NA>
5     Japan      Asia 31,656.07   82.60
6  Mongolia      Asia      <NA>   66.80
7      Peru  Americas  7,408.91    <NA>
8 Singapore      Asia      <NA>   79.97
9     Spain    Europe 28,821.06   80.94




merge(life_exp, vab_pc, all = TRUE)


    country continent lifeExp gdpPercap
1 Australia   Oceania   81.23 34,435.37
2    Brazil  Americas   72.39      <NA>
3    Canada  Americas   80.65 36,319.24
4     India      Asia    <NA>  2,452.21
5     Japan      Asia   82.60 31,656.07
6  Mongolia      Asia   66.80      <NA>
7      Peru  Americas    <NA>  7,408.91
8 Singapore      Asia   79.97      <NA>
9     Spain    Europe   80.94 28,821.06

BIND COLS

Columna


THE FUNCTION bind_cols() (dplyr) AND THE FUNCTION cbind() (BASE R) JOINS TWO TABLES HORIZONTALLY AND DO NOT NEED TO SPECIFY A COMMON KEY. THEY HORIZONTALLY CONCATENATES THE LEFT AND RIGHT TABLE, BUT THE TABLES MUST HAVE THE SAME NUMBER OF ROWS AND BE SORTED IN THE SAME ORDER AS THEY ARE MATCHED BY POSITION (NOTE THAT IN THIS CASE BOTH TABLES MUST HAVE THE SAME OBSERVATIONS)

 


Columna

Dplyr

bind_cols(vab_pc_b, life_exp_b)

# A tibble: 6 x 6
  country   continent gdpPercap country1  continent1 gdpPercap1
  <fct>     <fct>         <dbl> <fct>     <fct>           <dbl>
1 Australia Oceania      34435. Australia Oceania        34435.
2 Canada    Americas     36319. Canada    Americas       36319.
3 India     Asia          2452. India     Asia            2452.
4 Japan     Asia         31656. Japan     Asia           31656.
5 Peru      Americas      7409. Peru      Americas        7409.
6 Spain     Europe       28821. Spain     Europe         28821.

bind_cols(life_exp_b, vab_pc_b)

# A tibble: 6 x 6
  country   continent gdpPercap country1  continent1 gdpPercap1
  <fct>     <fct>         <dbl> <fct>     <fct>           <dbl>
1 Australia Oceania      34435. Australia Oceania        34435.
2 Canada    Americas     36319. Canada    Americas       36319.
3 India     Asia          2452. India     Asia            2452.
4 Japan     Asia         31656. Japan     Asia           31656.
5 Peru      Americas      7409. Peru      Americas        7409.
6 Spain     Europe       28821. Spain     Europe         28821.

Base R

cbind(vab_pc_b, life_exp_b)

    country continent gdpPercap   country continent gdpPercap
1 Australia   Oceania 34435.367 Australia   Oceania 34435.367
2    Canada  Americas 36319.235    Canada  Americas 36319.235
3     India      Asia  2452.210     India      Asia  2452.210
4     Japan      Asia 31656.068     Japan      Asia 31656.068
5      Peru  Americas  7408.906      Peru  Americas  7408.906
6     Spain    Europe 28821.064     Spain    Europe 28821.064

cbind(life_exp_b, vab_pc_b)

    country continent gdpPercap   country continent gdpPercap
1 Australia   Oceania 34435.367 Australia   Oceania 34435.367
2    Canada  Americas 36319.235    Canada  Americas 36319.235
3     India      Asia  2452.210     India      Asia  2452.210
4     Japan      Asia 31656.068     Japan      Asia 31656.068
5      Peru  Americas  7408.906      Peru  Americas  7408.906
6     Spain    Europe 28821.064     Spain    Europe 28821.064

SEMI JOIN

Columna


THE semi_join() FUNCTION RETURNS ALL ROWS FROM THE LEFT TABLE WHERE THERE IS A CORRESPONDING MATCHING VALUE IN THE RIGHT TABLE, KEEPING JUST COLUMNS FROM THE FIRST ONE. INSTEAD OF RETURNING A TABLE COMBINING BOTH TABLES (AS IT DOES THE inner_join() FUNCTION, IT ONLY CONTAINS DATA FROM THE LEFT TABLE. THEREFORE, THE semi_join() FUNCTION WILL NEVER DUPLICATE ROWS OF THE LEFT TABLE.


Columna

Dyplyr


semi_join(vab_pc, life_exp)

# A tibble: 4 x 3
  country   continent gdpPercap
  <fct>     <fct>     <chr>    
1 Australia Oceania   34,435.37
2 Canada    Americas  36,319.24
3 Japan     Asia      31,656.07
4 Spain     Europe    28,821.06

semi_join(life_exp, vab_pc)

# A tibble: 4 x 3
  country   continent lifeExp
  <fct>     <fct>     <chr>  
1 Australia Oceania   81.23  
2 Canada    Americas  80.65  
3 Japan     Asia      82.60  
4 Spain     Europe    80.94  

semi_join(matches, by= c("country", "continent")

# A tibble: 4 x 3
  country   continent gdpPercap
  <fct>     <fct>     <chr>    
1 Australia Oceania   34,435.37
2 Canada    Americas  36,319.24
3 Japan     Asia      31,656.07
4 Spain     Europe    28,821.06

vab_pc %>% filter(country %in% life_exp$country)

# A tibble: 4 x 3
  country   continent gdpPercap
  <fct>     <fct>     <chr>    
1 Australia Oceania   34,435.37
2 Canada    Americas  36,319.24
3 Japan     Asia      31,656.07
4 Spain     Europe    28,821.06

ANTI JOIN

Columna


THE anti_join() FUNCTION RETURNS ALL ROWS FROM THE LEFT TABLE WHERE THERE ARE NOT MATCHING VALUES IN THE RIGHT TABLE, KEEPING JUST COLUMNS FROM THE LEFT TABLE. THE anti_join() CAN BE VERY USEFUL AS IT ALLOWS TO DETECT THE OBSERVATIONS IN A DATASET (LEFT TABLE) THAT ARE NOT INCLUDED IN A SECOND TABLE (RIGHT TABLE)


Columna

Dyplyr


anti_join(vab_pc, life_exp)

# A tibble: 2 x 3
  country continent gdpPercap
  <fct>   <fct>     <chr>    
1 India   Asia      2,452.21 
2 Peru    Americas  7,408.91 

anti_join(life_exp, vab_pc)

# A tibble: 3 x 3
  country   continent lifeExp
  <fct>     <fct>     <chr>  
1 Brazil    Americas  72.39  
2 Mongolia  Asia      66.80  
3 Singapore Asia      79.97  

anti_join(life_exp, vab_pc, by = c("country", "continent"))

# A tibble: 2 x 3
  country continent gdpPercap
  <fct>   <fct>     <chr>    
1 India   Asia      2,452.21 
2 Peru    Americas  7,408.91 

vab_pc %>% filter(country %nin% life_exp$country)

# A tibble: 2 x 3
  country continent gdpPercap
  <fct>   <fct>     <chr>    
1 India   Asia      2,452.21 
2 Peru    Americas  7,408.91 

Tables II

Columna

LEFT TABLE: A

RiGHT TABLE: B

Column

LEFT TABLE: A

LEFT TABLE: B

INTERSECT

Columna


THE intersect() FUNCTION TAKES THE ROWS THAT APPEARS IN BOTH TABLES AND RETURNS A DATAFRAME WITH THOSE ROWS THAT ARE BOTH IN THE LEFT AND IN THE RIGHT TABLE. THIS FUNCTION IS SIMILAR TO THE inner_join() BUT IN THIS CASE BOTH DATASETS MUST HAVE THE SAME COLUMNS. BEWARE THAT THE intersect() FUNCTION REMOVES DUPLICATES.


Columna

Dyplyr


intersect(A,B)

# A tibble: 4 x 4
  country   continent gdpPercap lifeExp
  <fct>     <fct>     <chr>     <chr>  
1 Australia Oceania   34,435.37 81.23  
2 Canada    Americas  36,319.24 80.65  
3 Japan     Asia      31,656.07 82.60  
4 Spain     Europe    28,821.06 80.94  

intersect(B,A)

# A tibble: 4 x 4
  country   continent gdpPercap lifeExp
  <fct>     <fct>     <chr>     <chr>  
1 Australia Oceania   34,435.37 81.23  
2 Canada    Americas  36,319.24 80.65  
3 Japan     Asia      31,656.07 82.60  
4 Spain     Europe    28,821.06 80.94  

UNION

Columna


THE union() FUNCTION COMBINES ALL ROWS FROM THE LEFT AND THE RIGHT TABLE BUT REMOVES DUPLICATE RECORDS.


Columna

Dyplyr


union(A,B)

# A tibble: 9 x 4
  country   continent gdpPercap lifeExp
  <fct>     <fct>     <chr>     <chr>  
1 Australia Oceania   34,435.37 81.23  
2 Canada    Americas  36,319.24 80.65  
3 India     Asia      2,452.21  64.70  
4 Japan     Asia      31,656.07 82.60  
5 Peru      Americas  7,408.91  71.42  
6 Spain     Europe    28,821.06 80.94  
7 Brazil    Americas  9,065.80  72.39  
8 Mongolia  Asia      3,095.77  66.80  
9 Singapore Asia      47,143.18 79.97  

union(B,A)

# A tibble: 9 x 4
  country   continent gdpPercap lifeExp
  <fct>     <fct>     <chr>     <chr>  
1 Australia Oceania   34,435.37 81.23  
2 Brazil    Americas  9,065.80  72.39  
3 Canada    Americas  36,319.24 80.65  
4 Japan     Asia      31,656.07 82.60  
5 Mongolia  Asia      3,095.77  66.80  
6 Singapore Asia      47,143.18 79.97  
7 Spain     Europe    28,821.06 80.94  
8 India     Asia      2,452.21  64.70  
9 Peru      Americas  7,408.91  71.42  

UNION ALL | BIND ROWS

Columna


THE union_all() OR THE bind_rows() (dplyr) FUNCTIONS, OR THE FUNCTION rbind() IN BASE R, COMBINE ALL ROWS FROM THE LEFT AND THE RIGHT TABLE WITHOUT REMOVING THE DUPLICATE RECORDS.


Columna

dyplyr

union_all(A,B)

# A tibble: 13 x 4
   country   continent gdpPercap lifeExp
   <fct>     <fct>     <chr>     <chr>  
 1 Australia Oceania   34,435.37 81.23  
 2 Canada    Americas  36,319.24 80.65  
 3 India     Asia      2,452.21  64.70  
 4 Japan     Asia      31,656.07 82.60  
 5 Peru      Americas  7,408.91  71.42  
 6 Spain     Europe    28,821.06 80.94  
 7 Australia Oceania   34,435.37 81.23  
 8 Brazil    Americas  9,065.80  72.39  
 9 Canada    Americas  36,319.24 80.65  
10 Japan     Asia      31,656.07 82.60  
11 Mongolia  Asia      3,095.77  66.80  
12 Singapore Asia      47,143.18 79.97  
13 Spain     Europe    28,821.06 80.94  

union_all(B,A)

# A tibble: 13 x 4
   country   continent gdpPercap lifeExp
   <fct>     <fct>     <chr>     <chr>  
 1 Australia Oceania   34,435.37 81.23  
 2 Brazil    Americas  9,065.80  72.39  
 3 Canada    Americas  36,319.24 80.65  
 4 Japan     Asia      31,656.07 82.60  
 5 Mongolia  Asia      3,095.77  66.80  
 6 Singapore Asia      47,143.18 79.97  
 7 Spain     Europe    28,821.06 80.94  
 8 Australia Oceania   34,435.37 81.23  
 9 Canada    Americas  36,319.24 80.65  
10 India     Asia      2,452.21  64.70  
11 Japan     Asia      31,656.07 82.60  
12 Peru      Americas  7,408.91  71.42  
13 Spain     Europe    28,821.06 80.94  

bind_rows(A, B)

# A tibble: 13 x 4
   country   continent gdpPercap lifeExp
   <fct>     <fct>     <chr>     <chr>  
 1 Australia Oceania   34,435.37 81.23  
 2 Canada    Americas  36,319.24 80.65  
 3 India     Asia      2,452.21  64.70  
 4 Japan     Asia      31,656.07 82.60  
 5 Peru      Americas  7,408.91  71.42  
 6 Spain     Europe    28,821.06 80.94  
 7 Australia Oceania   34,435.37 81.23  
 8 Brazil    Americas  9,065.80  72.39  
 9 Canada    Americas  36,319.24 80.65  
10 Japan     Asia      31,656.07 82.60  
11 Mongolia  Asia      3,095.77  66.80  
12 Singapore Asia      47,143.18 79.97  
13 Spain     Europe    28,821.06 80.94  

Columna

base R

rbind(A,B)

# A tibble: 13 x 4
   country   continent gdpPercap lifeExp
   <fct>     <fct>     <chr>     <chr>  
 1 Australia Oceania   34,435.37 81.23  
 2 Canada    Americas  36,319.24 80.65  
 3 India     Asia      2,452.21  64.70  
 4 Japan     Asia      31,656.07 82.60  
 5 Peru      Americas  7,408.91  71.42  
 6 Spain     Europe    28,821.06 80.94  
 7 Australia Oceania   34,435.37 81.23  
 8 Brazil    Americas  9,065.80  72.39  
 9 Canada    Americas  36,319.24 80.65  
10 Japan     Asia      31,656.07 82.60  
11 Mongolia  Asia      3,095.77  66.80  
12 Singapore Asia      47,143.18 79.97  
13 Spain     Europe    28,821.06 80.94  

rbind(B,A)

# A tibble: 13 x 4
   country   continent gdpPercap lifeExp
   <fct>     <fct>     <chr>     <chr>  
 1 Australia Oceania   34,435.37 81.23  
 2 Brazil    Americas  9,065.80  72.39  
 3 Canada    Americas  36,319.24 80.65  
 4 Japan     Asia      31,656.07 82.60  
 5 Mongolia  Asia      3,095.77  66.80  
 6 Singapore Asia      47,143.18 79.97  
 7 Spain     Europe    28,821.06 80.94  
 8 Australia Oceania   34,435.37 81.23  
 9 Canada    Americas  36,319.24 80.65  
10 India     Asia      2,452.21  64.70  
11 Japan     Asia      31,656.07 82.60  
12 Peru      Americas  7,408.91  71.42  
13 Spain     Europe    28,821.06 80.94  

SETDIFF

Columna


THE setdiff() FUNCTION (dplyr) RETURNS THE ROWS THAT APPEAR IN THE LEFT TABLE BUT NOT IN THE RIGHT TABLE. BEWARE THAT THE setdiff() FUNCTION REMOVES DUPLICATES.


Columna

Dyplyr


setdiff(A,B)

# A tibble: 2 x 4
  country continent gdpPercap lifeExp
  <fct>   <fct>     <chr>     <chr>  
1 India   Asia      2,452.21  64.70  
2 Peru    Americas  7,408.91  71.42  

setdiff(B,A)

# A tibble: 3 x 4
  country   continent gdpPercap lifeExp
  <fct>     <fct>     <chr>     <chr>  
1 Brazil    Americas  9,065.80  72.39  
2 Mongolia  Asia      3,095.77  66.80  
3 Singapore Asia      47,143.18 79.97  

SELF JOIN

Columna


Dataset: gapminder

Columna




gapminder %>% filter(year == c(1952,2007)) %>% select(country, continent, year, lifeExp) %>% spread(key= year, lifeExp) %>% mutate(crecimiento = ((2007-1952)/1952)*100) %>% head(n=10)


# A tibble: 20 x 5
   country                continent `1952` `2007` crecimiento
   <fct>                  <fct>      <dbl>  <dbl>       <dbl>
 1 Afghanistan            Asia        28.8   43.8       52.2 
 2 Albania                Europe      55.2   76.4       38.4 
 3 Algeria                Africa      43.1   72.3       67.8 
 4 Angola                 Africa      30.0   42.7       42.4 
 5 Argentina              Americas    62.5   75.3       20.5 
 6 Australia              Oceania     69.1   81.2       17.5 
 7 Austria                Europe      66.8   79.8       19.5 
 8 Bahrain                Asia        50.9   75.6       48.5 
 9 Bangladesh             Asia        37.5   64.1       70.9 
10 Belgium                Europe      68     79.4       16.8 
11 Benin                  Africa      38.2   56.7       48.4 
12 Bolivia                Americas    40.4   65.6       62.2 
13 Bosnia and Herzegovina Europe      53.8   74.9       39.1 
14 Botswana               Africa      47.6   50.7        6.52
15 Brazil                 Americas    50.9   72.4       42.2 
16 Bulgaria               Europe      59.6   73.0       22.5 
17 Burkina Faso           Africa      32.0   52.3       63.5 
18 Burundi                Africa      39.0   49.6       27.0 
19 Cambodia               Asia        39.4   59.7       51.5 
20 Cameroon               Africa      38.5   50.4       30.9 
---
title: "Data Wrangling: Joining Data"
output: 
  flexdashboard::flex_dashboard:
    storyboard: true
    orientation: columns
    source_code: embed
    vertical_layout: fill
    theme: simplex
---

```{r setup, include=FALSE}


# PACKAGES / LIBRARIES:

library(flexdashboard)
library(tidyverse)
library(gapminder)
library(ggthemes)
library(ggthemr)
library(DT)



# MAIN DATASET:

gapminder <- gapminder::gapminder


```


Tables I
====================================================================================


Column 
-----------------------------------------------------------------------

### LEFT TABLE: `vab_pc`

```{r}

vab_pc <- gapminder %>%
  filter(year == 2007) %>%
  filter(country %in% c("Spain", "Peru", "India", "Australia", "Japan", "Canada")) %>%
  select(country, continent, gdpPercap)

vab_pc$gdpPercap <- formatC(vab_pc$gdpPercap, format = "f", digits = 2, big.mark = ",")

datatable(vab_pc, rownames = TRUE, list(
  columnDefs = list(list(className = 'dt-center', targets = 0:3))))


```


### RiGHT TABLE: `life_exp`

```{r}

life_exp <- gapminder %>%
  filter(year == 2007) %>%
  filter(country %in% c("Spain", "Australia", "Singapore", "Canada", "Japan", "Mongolia", "Brazil")) %>%
  select(country, continent, lifeExp)

life_exp$lifeExp <- formatC(life_exp$lifeExp, format = "f", digits = 2)

datatable(life_exp, rownames = TRUE, list(
  columnDefs = list(list(className = 'dt-center', targets = 0:3))))


```



Column 
-----------------------------------------------------------------------


### LEFT TABLE: `vab_pc`

```{r}

library(png)
library(grid)

vabpc <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\vab_pc.png")
grid.raster(vabpc)


```



### LEFT TABLE: `life_exp`

```{r}


lifeexp <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\life_exp.png")
grid.raster(lifeexp)


```





INNER JOIN
====================================================================================

Column 
-----------------------------------------------------------------------

\

### THE `inner_join()` FUNCTION (dplyr) RETURNS ALL ROWS FROM THE LEFT TABLE WHERE THERE ARE MATCHING VALUES IN THE RIGHT TABLE. THEREFORE, THE `inner_join()` RETURNS A TABLE WHICH COMBINE BOTH TABLES BASED ON THE JOIN PREDICATE. IF THERE ARE MULTIPLE MATCHES BETWEEN THE LEFT AND THE RIGHT TABLE ALL COMBINATIONS OF THE MATCHES ARE RETURNED. ALTERNATIVELY WE CAN USE THE `merge()` FUNCTION (BASE R) INDICATING `all=FALSE`


```{r}

inner_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\inner_join.png")
grid.raster(inner_join)


```


Columna {data-width=200}
-----------------------------------------------------------------------

### Dyplyr {data-commentary-width=400}

*** 

\

`inner_join(vab_pc, life_exp, by= c("country", "continent"))`

\

```{r}

inner_join(vab_pc, life_exp, by= c("country", "continent"))

```

*** 


\

`inner_join(life_exp, vab_pc, by= c("country", "continent"))`

\

```{r}

inner_join(life_exp, vab_pc,  by= c("country", "continent"))

```



Columna {data-width=200}
-----------------------------------------------------------------------

### Base R {data-commentary-width=100}

*** 

\

\

`merge(vab_pc, life_exp, all = FALSE)`

\

```{r}

merge(vab_pc, life_exp, all = FALSE)

```


*** 

\

\

\

`merge(life_exp, vab_pc, all = FALSE)`

\

```{r}

merge(life_exp, vab_pc, all = FALSE)

```





LEFT JOIN
====================================================================================

Columna {data-width=600}
-----------------------------------------------------------------

\

### THE `left_join()` FUNCTION (dplyr) RETURNS ALL RECORDS FROM THE LEFT TABLE AND THE MATCHED RECORDS FROM THE RIGHT TABLE. THE RESULT IS `NA` FROM THE RIGHT SIDE IF THERE IS NO MATCH. IF THERE ARE MULTIPLE MATCHES BETWEEN THE LEFT TABLE AND THE RIGHT TABLE ALL COMBINATIONS OF THE MATCHES ARE RETURNED. IN BASE R WE CAN USE THE FUNCTION `merge()` INDICATING `all.x = TRUE`. 



```{r}

left_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\left_join.png")
grid.raster(left_join)

```

Columna {data-width=200}
-----------------------------------------------------------------------


### Dyplyr {data-commentary-width=400}

*** 

\

`vab_pc %>% 
    left_join(life_exp, by = c("country", "continent"))`

\

```{r}

vab_pc %>%
  left_join(life_exp, by = c("country", "continent"))

```


*** 

\

`life_exp %>% 
    left_join(vab_pc, by = c("country", "continent"))`

\

```{r}

life_exp %>%
  left_join(vab_pc, by = c("country", "continent"))

```

Columna {data-width=200}
-----------------------------------------------------------------------

### Base R {data-commentary-width=100}

*** 

\

\

`merge(vab_pc, life_exp, all.x = TRUE)`

\

```{r}

merge(vab_pc, life_exp, all.x = TRUE)

```


*** 

\

\

\

`merge(life_exp, vab_pc, all.x = TRUE)`

\

```{r}

merge(life_exp, vab_pc, all.x = TRUE)

```




RIGHT JOIN
====================================================================================

Columna {data-width=600}
-----------------------------------------------------------------

\

### THE `right_join()` FUNCTION (dplyr) RETURNS ALL RECORDS FROM THE RIGHT TABLE AND THE MATCHED RECORDS FROM THE LEFT TABLE. THE RESULT IS `NA` FROM THE LEFT SIDE IF THERE IS NO MATCH. IF THERE ARE MULTIPLE MATCHES BETWEEN THE LEFT TABLE AND THE RIGHT TABLE, ALL COMBINATIONS OF THE MATCHES ARE RETURNED. IN BASE R WE CAN USE THE FUNCTION `merge()` INDICATING `all.y = TRUE`. 


```{r}


right_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\right_join.png")
grid.raster(right_join)


```



Columna {data-width=200}
-----------------------------------------------------------------------


### Dyplyr {data-commentary-width=400}

*** 

\

`vab_pc %>% 
    right_join(life_exp, by = c("country", "continent"))`

\

```{r}

vab_pc %>%
  right_join(life_exp, by = c("country", "continent"))

```


*** 

\

`life_exp %>% 
    right_join(vab_pc, by = c("country", "continent"))`

\

```{r}

life_exp %>%
  right_join(vab_pc, by = c("country", "continent"))

```

Columna {data-width=200}
-----------------------------------------------------------------------

### Base R {data-commentary-width=100}

*** 

\

\

`merge(vab_pc, life_exp, all.y = TRUE)`

\

```{r}

merge(vab_pc, life_exp, all.y = TRUE)

```


*** 

\

\

\

`merge(life_exp, vab_pc, all.y = TRUE)`

\

```{r}

merge(life_exp, vab_pc, all.y = TRUE)

```



FULL JOIN
====================================================================================

Columna {data-width=600}
-----------------------------------------------------------------

\

### THE `full_join()` FUNCTION (dplyr) RETURNS ALL ROWS AND ALL COLUMNS FROM BOTH THE LEFT AND THE RIGHT TABLE. WHERE THERE ARE NOT MATCHING VALUES RETURNS `NA` FOR THE ONES MISSING. IN BASE R WE CAN USE THE FUNCTION `merge()` INDICATING `all = TRUE`. 

\

```{r}


full_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\full_join.png")
grid.raster(full_join)


```



Columna {data-width=200}
-----------------------------------------------------------------------


### Dyplyr {data-commentary-width=400}

*** 

\

`vab_pc %>% 
    full_join(life_exp, by = c("country", "continent"))`

\

```{r}

vab_pc %>%
  full_join(life_exp, by = c("country", "continent"))

```


*** 

\

`life_exp %>% 
    full_join(vab_pc, by = c("country", "continent"))`

\

```{r}

life_exp %>%
  full_join(vab_pc, by = c("country", "continent"))

```


Columna {data-width=200}
-----------------------------------------------------------------------

### Base R {data-commentary-width=100}

*** 

\

\

`merge(vab_pc, life_exp, all = TRUE)`

\

```{r}

merge(vab_pc, life_exp, all = TRUE)

```


*** 

\

\

\

`merge(life_exp, vab_pc, all = TRUE)`

\

```{r}

merge(life_exp, vab_pc, all = TRUE)

```




BIND COLS
====================================================================================

Columna {data-width=700}
-----------------------------------------------------------------

\

### THE FUNCTION `bind_cols()` (dplyr) AND THE FUNCTION `cbind()` (BASE R) JOINS TWO TABLES HORIZONTALLY AND DO NOT NEED TO SPECIFY A COMMON KEY. THEY HORIZONTALLY CONCATENATES THE LEFT AND RIGHT TABLE, BUT THE TABLES MUST HAVE THE SAME NUMBER OF ROWS AND BE SORTED IN THE SAME ORDER AS THEY ARE MATCHED BY POSITION (**NOTE THAT IN THIS CASE BOTH TABLES MUST HAVE THE SAME OBSERVATIONS**)

\  

\

```{r}


cbind_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\cbind_join.png")
grid.raster(cbind_join)


```



Columna {data-width=300}
-----------------------------------------------------------------------


### Dplyr


`bind_cols(vab_pc_b, life_exp_b)`

```{r}

vab_pc_b <- gapminder %>%
  filter(year == 2007) %>%
  filter(country %in% c("Spain", "Peru", "India", "Australia", "Japan", "Canada")) %>%
  select(country, continent, gdpPercap)

life_exp_b <- gapminder %>%
  filter(year == 2007) %>%
  filter(country %in% c("Spain", "Peru", "India", "Australia", "Japan", "Canada")) %>%
  select(country, continent, gdpPercap)

bind_cols(vab_pc_b, life_exp_b)

```


`bind_cols(life_exp_b, vab_pc_b)`


```{r}

bind_cols(list(life_exp_b, vab_pc_b))

```

### Base R 

`cbind(vab_pc_b, life_exp_b)`


```{r}

cbind(vab_pc_b, life_exp_b)

```


`cbind(life_exp_b, vab_pc_b)`


```{r}

cbind(life_exp_b, vab_pc_b)

```



SEMI JOIN
====================================================================================

Columna {data-width=600}
-----------------------------------------------------------------

\

### THE `semi_join()` FUNCTION RETURNS ALL ROWS FROM THE LEFT TABLE WHERE THERE IS A CORRESPONDING MATCHING VALUE IN THE RIGHT TABLE, KEEPING JUST COLUMNS FROM THE FIRST ONE. INSTEAD OF RETURNING A TABLE COMBINING BOTH TABLES (AS IT DOES THE `inner_join()` FUNCTION, IT ONLY CONTAINS DATA FROM THE LEFT TABLE. THEREFORE, THE `semi_join()` FUNCTION WILL NEVER DUPLICATE ROWS OF THE LEFT TABLE.  

\

```{r}


semi_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\semi_join.png")
grid.raster(semi_join)


```





Columna {data-width=400}
-----------------------------------------------------------------------


### Dyplyr 

*** 

`semi_join(vab_pc, life_exp)`


```{r}

semi_join(vab_pc, life_exp)

```



*** 

`semi_join(life_exp, vab_pc)`


```{r}

semi_join(life_exp, vab_pc)

```


*** 

`semi_join(matches, by= c("country", "continent")`

```{r}

vab_pc %>%
  semi_join(life_exp, by= c("country", "continent"))

```


*** 

`vab_pc %>%
  filter(country %in% life_exp$country)`

```{r}

vab_pc %>%
  filter(country %in% life_exp$country)

```



ANTI JOIN
====================================================================================

Columna {data-width=600}
-----------------------------------------------------------------

\

### THE `anti_join()` FUNCTION RETURNS ALL ROWS FROM THE LEFT TABLE WHERE THERE ARE NOT MATCHING VALUES IN THE RIGHT TABLE, KEEPING JUST COLUMNS FROM THE LEFT TABLE. THE `anti_join()` CAN BE VERY USEFUL AS IT ALLOWS TO DETECT THE OBSERVATIONS IN A DATASET (LEFT TABLE) THAT ARE NOT INCLUDED IN A SECOND TABLE (RIGHT TABLE)

\

```{r}


anti_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\anti_join.png")
grid.raster(anti_join)


```





Columna {data-width=400}
-----------------------------------------------------------------------


### Dyplyr {data-commentary-width=400}

*** 


`anti_join(vab_pc, life_exp)`


```{r}

anti_join(vab_pc, life_exp)

```



*** 


`anti_join(life_exp, vab_pc)`


```{r}

anti_join(life_exp, vab_pc)

```



*** 


`anti_join(life_exp, vab_pc, by = c("country", "continent"))`


```{r}

anti_join(vab_pc, life_exp, by = c("country", "continent"))

```


*** 


`vab_pc %>%
  filter(country %nin% life_exp$country)`


```{r}

library(Hmisc)

vab_pc %>%
  filter(country %nin% life_exp$country)

```



Tables II
====================================================================================


Columna 
-----------------------------------------------------------------------

### LEFT TABLE: `A`

```{r}

gapminder <-gapminder::gapminder

A <- gapminder %>%
  filter(year == 2007) %>%
  filter(country %in% c("Spain", "Peru", "India", "Australia", "Japan", "Canada")) %>%
  select(country, continent, gdpPercap, lifeExp)

A$gdpPercap <- formatC(A$gdpPercap, format = "f", digits = 2, big.mark = ",")
A$lifeExp <- formatC(A$lifeExp, format = "f", digits = 2)

datatable(A, rownames = TRUE, list(
  columnDefs = list(list(className = 'dt-center', targets = 0:4))))

```


### RiGHT TABLE: `B`

```{r}

B <- gapminder %>%
  filter(year == 2007) %>%
  filter(country %in% c("Spain", "Australia", "Singapore", "Canada", "Japan", "Mongolia", "Brazil")) %>%
  select(country, continent, gdpPercap, lifeExp)

B$gdpPercap <- formatC(B$gdpPercap, format = "f", digits = 2, big.mark = ",")
B$lifeExp <- formatC(B$lifeExp, format = "f", digits = 2)


datatable(B, rownames = TRUE, list(
  columnDefs = list(list(className = 'dt-center', targets = 0:4))))

```



Column 
-----------------------------------------------------------------------


### LEFT TABLE: `A`

```{r}


A2 <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\A.png")
grid.raster(A2)


```



### LEFT TABLE: `B`

```{r}


B2 <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\B.png")
grid.raster(B2)


```



INTERSECT
====================================================================================

Columna {data-width=600}
-----------------------------------------------------------------

\

### THE `intersect()` FUNCTION TAKES THE ROWS THAT APPEARS IN BOTH  TABLES AND RETURNS A DATAFRAME WITH THOSE ROWS THAT ARE BOTH IN THE LEFT AND IN THE RIGHT TABLE. THIS FUNCTION IS SIMILAR TO THE `inner_join()` BUT IN THIS CASE **BOTH DATASETS MUST HAVE THE SAME COLUMNS**. BEWARE THAT THE `intersect()` FUNCTION REMOVES DUPLICATES. 

\

```{r}


intersect_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\intersect_join.png")
grid.raster(intersect_join)


```





Columna {data-width=400}
-----------------------------------------------------------------------


### Dyplyr {data-commentary-width=400}

*** 


`intersect(A,B)`


```{r}

intersect(A,B)

```


*** 


`intersect(B,A)`


```{r}

intersect(B,A)

```





UNION
====================================================================================

Columna {data-width=600}
-----------------------------------------------------------------

\

### THE `union()` FUNCTION COMBINES ALL ROWS FROM THE LEFT AND THE RIGHT TABLE BUT **REMOVES DUPLICATE RECORDS**.  

\

```{r}


union_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\union_join.png")
grid.raster(union_join)


```





Columna {data-width=400}
-----------------------------------------------------------------------


### Dyplyr {data-commentary-width=400}

*** 


`union(A,B)`


```{r}

union(A,B)

```


*** 


`union(B,A)`


```{r}

union(B,A)

```



UNION ALL | BIND ROWS
====================================================================================

Columna {data-width=600}
-----------------------------------------------------------------

\

### THE `union_all()` OR THE `bind_rows()` (dplyr) FUNCTIONS, OR THE FUNCTION `rbind()` IN BASE R, COMBINE ALL ROWS FROM THE LEFT AND THE RIGHT TABLE **WITHOUT REMOVING THE DUPLICATE RECORDS**. 

\

```{r}


unionall_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\unionall_join.png")
grid.raster(unionall_join)


```



Columna {data-width=200}
-----------------------------------------------------------------------

### dyplyr 

`union_all(A,B)`


```{r}

union_all(A,B)

```


`union_all(B,A)`


```{r}

union_all(B,A)

```


`bind_rows(A, B)`

```{r}

bind_rows(A, B)

```



Columna {data-width=200}
-----------------------------------------------------------------------


### base R


`rbind(A,B)`

```{r}

rbind(A,B)

```


`rbind(B,A)`

```{r}

rbind(B, A)

```





SETDIFF
====================================================================================

Columna {data-width=600}
-----------------------------------------------------------------

\

### THE `setdiff()` FUNCTION (dplyr) RETURNS THE ROWS THAT APPEAR IN THE LEFT TABLE BUT NOT IN THE RIGHT TABLE.  BEWARE THAT THE `setdiff()` FUNCTION REMOVES DUPLICATES. 

\

```{r}


except_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\except_join.png")
grid.raster(except_join)


```





Columna {data-width=400}
-----------------------------------------------------------------------


### Dyplyr {data-commentary-width=400}

*** 


`setdiff(A,B)`


```{r}

setdiff(A,B)

```


*** 

`setdiff(B,A)`


```{r}

setdiff(B,A)

```




SELF JOIN
====================================================================================

Columna {data-width=600}
-----------------------------------------------------------------

\

### Dataset: `gapminder` 


```{r}

datatable(gapminder, rownames = TRUE, list(
  columnDefs = list(list(className = 'dt-center', targets = 0:6))))

```


Columna {data-width=400}
-----------------------------------------------------------------

\

\

***

`gapminder %>%
  filter(year == c(1952,2007)) %>%
  select(country, continent, year, lifeExp) %>%
  spread(key= year, lifeExp) %>%
  mutate(crecimiento = ((`2007`-`1952`)/`1952`)*100) %>%
  head(n=10)`
  
\

```{r}

gapminder %>%
  filter(year == c(1952,2007)) %>%
  select(country, continent, year, lifeExp) %>%
  spread(key= year, lifeExp) %>%
  mutate(crecimiento = ((`2007`-`1952`)/`1952`)*100) %>%
  head(n=20)

```