vab_pc
life_exp
vab_pc
life_exp
inner_join()
FUNCTION (dplyr) RETURNS ALL ROWS FROM THE LEFT TABLE WHERE THERE ARE MATCHING VALUES IN THE RIGHT TABLE. THEREFORE, THE inner_join()
RETURNS A TABLE WHICH COMBINE BOTH TABLES BASED ON THE JOIN PREDICATE. IF THERE ARE MULTIPLE MATCHES BETWEEN THE LEFT AND THE RIGHT TABLE ALL COMBINATIONS OF THE MATCHES ARE RETURNED. ALTERNATIVELY WE CAN USE THE merge()
FUNCTION (BASE R) INDICATING all=FALSE
inner_join(vab_pc, life_exp, by= c("country", "continent"))
# A tibble: 4 x 4
country continent gdpPercap lifeExp
<fct> <fct> <chr> <chr>
1 Australia Oceania 34,435.37 81.23
2 Canada Americas 36,319.24 80.65
3 Japan Asia 31,656.07 82.60
4 Spain Europe 28,821.06 80.94
inner_join(life_exp, vab_pc, by= c("country", "continent"))
# A tibble: 4 x 4
country continent lifeExp gdpPercap
<fct> <fct> <chr> <chr>
1 Australia Oceania 81.23 34,435.37
2 Canada Americas 80.65 36,319.24
3 Japan Asia 82.60 31,656.07
4 Spain Europe 80.94 28,821.06
merge(vab_pc, life_exp, all = FALSE)
country continent gdpPercap lifeExp
1 Australia Oceania 34,435.37 81.23
2 Canada Americas 36,319.24 80.65
3 Japan Asia 31,656.07 82.60
4 Spain Europe 28,821.06 80.94
merge(life_exp, vab_pc, all = FALSE)
country continent lifeExp gdpPercap
1 Australia Oceania 81.23 34,435.37
2 Canada Americas 80.65 36,319.24
3 Japan Asia 82.60 31,656.07
4 Spain Europe 80.94 28,821.06
left_join()
FUNCTION (dplyr) RETURNS ALL RECORDS FROM THE LEFT TABLE AND THE MATCHED RECORDS FROM THE RIGHT TABLE. THE RESULT IS NA
FROM THE RIGHT SIDE IF THERE IS NO MATCH. IF THERE ARE MULTIPLE MATCHES BETWEEN THE LEFT TABLE AND THE RIGHT TABLE ALL COMBINATIONS OF THE MATCHES ARE RETURNED. IN BASE R WE CAN USE THE FUNCTION merge()
INDICATING all.x = TRUE
.
vab_pc %>% left_join(life_exp, by = c("country", "continent"))
# A tibble: 6 x 4
country continent gdpPercap lifeExp
<fct> <fct> <chr> <chr>
1 Australia Oceania 34,435.37 81.23
2 Canada Americas 36,319.24 80.65
3 India Asia 2,452.21 <NA>
4 Japan Asia 31,656.07 82.60
5 Peru Americas 7,408.91 <NA>
6 Spain Europe 28,821.06 80.94
life_exp %>% left_join(vab_pc, by = c("country", "continent"))
# A tibble: 7 x 4
country continent lifeExp gdpPercap
<fct> <fct> <chr> <chr>
1 Australia Oceania 81.23 34,435.37
2 Brazil Americas 72.39 <NA>
3 Canada Americas 80.65 36,319.24
4 Japan Asia 82.60 31,656.07
5 Mongolia Asia 66.80 <NA>
6 Singapore Asia 79.97 <NA>
7 Spain Europe 80.94 28,821.06
merge(vab_pc, life_exp, all.x = TRUE)
country continent gdpPercap lifeExp
1 Australia Oceania 34,435.37 81.23
2 Canada Americas 36,319.24 80.65
3 India Asia 2,452.21 <NA>
4 Japan Asia 31,656.07 82.60
5 Peru Americas 7,408.91 <NA>
6 Spain Europe 28,821.06 80.94
merge(life_exp, vab_pc, all.x = TRUE)
country continent lifeExp gdpPercap
1 Australia Oceania 81.23 34,435.37
2 Brazil Americas 72.39 <NA>
3 Canada Americas 80.65 36,319.24
4 Japan Asia 82.60 31,656.07
5 Mongolia Asia 66.80 <NA>
6 Singapore Asia 79.97 <NA>
7 Spain Europe 80.94 28,821.06
right_join()
FUNCTION (dplyr) RETURNS ALL RECORDS FROM THE RIGHT TABLE AND THE MATCHED RECORDS FROM THE LEFT TABLE. THE RESULT IS NA
FROM THE LEFT SIDE IF THERE IS NO MATCH. IF THERE ARE MULTIPLE MATCHES BETWEEN THE LEFT TABLE AND THE RIGHT TABLE, ALL COMBINATIONS OF THE MATCHES ARE RETURNED. IN BASE R WE CAN USE THE FUNCTION merge()
INDICATING all.y = TRUE
.
vab_pc %>% right_join(life_exp, by = c("country", "continent"))
# A tibble: 7 x 4
country continent gdpPercap lifeExp
<fct> <fct> <chr> <chr>
1 Australia Oceania 34,435.37 81.23
2 Brazil Americas <NA> 72.39
3 Canada Americas 36,319.24 80.65
4 Japan Asia 31,656.07 82.60
5 Mongolia Asia <NA> 66.80
6 Singapore Asia <NA> 79.97
7 Spain Europe 28,821.06 80.94
life_exp %>% right_join(vab_pc, by = c("country", "continent"))
# A tibble: 6 x 4
country continent lifeExp gdpPercap
<fct> <fct> <chr> <chr>
1 Australia Oceania 81.23 34,435.37
2 Canada Americas 80.65 36,319.24
3 India Asia <NA> 2,452.21
4 Japan Asia 82.60 31,656.07
5 Peru Americas <NA> 7,408.91
6 Spain Europe 80.94 28,821.06
merge(vab_pc, life_exp, all.y = TRUE)
country continent gdpPercap lifeExp
1 Australia Oceania 34,435.37 81.23
2 Brazil Americas <NA> 72.39
3 Canada Americas 36,319.24 80.65
4 Japan Asia 31,656.07 82.60
5 Mongolia Asia <NA> 66.80
6 Singapore Asia <NA> 79.97
7 Spain Europe 28,821.06 80.94
merge(life_exp, vab_pc, all.y = TRUE)
country continent lifeExp gdpPercap
1 Australia Oceania 81.23 34,435.37
2 Canada Americas 80.65 36,319.24
3 India Asia <NA> 2,452.21
4 Japan Asia 82.60 31,656.07
5 Peru Americas <NA> 7,408.91
6 Spain Europe 80.94 28,821.06
full_join()
FUNCTION (dplyr) RETURNS ALL ROWS AND ALL COLUMNS FROM BOTH THE LEFT AND THE RIGHT TABLE. WHERE THERE ARE NOT MATCHING VALUES RETURNS NA
FOR THE ONES MISSING. IN BASE R WE CAN USE THE FUNCTION merge()
INDICATING all = TRUE
.
vab_pc %>% full_join(life_exp, by = c("country", "continent"))
# A tibble: 9 x 4
country continent gdpPercap lifeExp
<fct> <fct> <chr> <chr>
1 Australia Oceania 34,435.37 81.23
2 Canada Americas 36,319.24 80.65
3 India Asia 2,452.21 <NA>
4 Japan Asia 31,656.07 82.60
5 Peru Americas 7,408.91 <NA>
6 Spain Europe 28,821.06 80.94
7 Brazil Americas <NA> 72.39
8 Mongolia Asia <NA> 66.80
9 Singapore Asia <NA> 79.97
life_exp %>% full_join(vab_pc, by = c("country", "continent"))
# A tibble: 9 x 4
country continent lifeExp gdpPercap
<fct> <fct> <chr> <chr>
1 Australia Oceania 81.23 34,435.37
2 Brazil Americas 72.39 <NA>
3 Canada Americas 80.65 36,319.24
4 Japan Asia 82.60 31,656.07
5 Mongolia Asia 66.80 <NA>
6 Singapore Asia 79.97 <NA>
7 Spain Europe 80.94 28,821.06
8 India Asia <NA> 2,452.21
9 Peru Americas <NA> 7,408.91
merge(vab_pc, life_exp, all = TRUE)
country continent gdpPercap lifeExp
1 Australia Oceania 34,435.37 81.23
2 Brazil Americas <NA> 72.39
3 Canada Americas 36,319.24 80.65
4 India Asia 2,452.21 <NA>
5 Japan Asia 31,656.07 82.60
6 Mongolia Asia <NA> 66.80
7 Peru Americas 7,408.91 <NA>
8 Singapore Asia <NA> 79.97
9 Spain Europe 28,821.06 80.94
merge(life_exp, vab_pc, all = TRUE)
country continent lifeExp gdpPercap
1 Australia Oceania 81.23 34,435.37
2 Brazil Americas 72.39 <NA>
3 Canada Americas 80.65 36,319.24
4 India Asia <NA> 2,452.21
5 Japan Asia 82.60 31,656.07
6 Mongolia Asia 66.80 <NA>
7 Peru Americas <NA> 7,408.91
8 Singapore Asia 79.97 <NA>
9 Spain Europe 80.94 28,821.06
bind_cols()
(dplyr) AND THE FUNCTION cbind()
(BASE R) JOINS TWO TABLES HORIZONTALLY AND DO NOT NEED TO SPECIFY A COMMON KEY. THEY HORIZONTALLY CONCATENATES THE LEFT AND RIGHT TABLE, BUT THE TABLES MUST HAVE THE SAME NUMBER OF ROWS AND BE SORTED IN THE SAME ORDER AS THEY ARE MATCHED BY POSITION (NOTE THAT IN THIS CASE BOTH TABLES MUST HAVE THE SAME OBSERVATIONS)Â
bind_cols(vab_pc_b, life_exp_b)
# A tibble: 6 x 6
country continent gdpPercap country1 continent1 gdpPercap1
<fct> <fct> <dbl> <fct> <fct> <dbl>
1 Australia Oceania 34435. Australia Oceania 34435.
2 Canada Americas 36319. Canada Americas 36319.
3 India Asia 2452. India Asia 2452.
4 Japan Asia 31656. Japan Asia 31656.
5 Peru Americas 7409. Peru Americas 7409.
6 Spain Europe 28821. Spain Europe 28821.
bind_cols(life_exp_b, vab_pc_b)
# A tibble: 6 x 6
country continent gdpPercap country1 continent1 gdpPercap1
<fct> <fct> <dbl> <fct> <fct> <dbl>
1 Australia Oceania 34435. Australia Oceania 34435.
2 Canada Americas 36319. Canada Americas 36319.
3 India Asia 2452. India Asia 2452.
4 Japan Asia 31656. Japan Asia 31656.
5 Peru Americas 7409. Peru Americas 7409.
6 Spain Europe 28821. Spain Europe 28821.
cbind(vab_pc_b, life_exp_b)
country continent gdpPercap country continent gdpPercap
1 Australia Oceania 34435.367 Australia Oceania 34435.367
2 Canada Americas 36319.235 Canada Americas 36319.235
3 India Asia 2452.210 India Asia 2452.210
4 Japan Asia 31656.068 Japan Asia 31656.068
5 Peru Americas 7408.906 Peru Americas 7408.906
6 Spain Europe 28821.064 Spain Europe 28821.064
cbind(life_exp_b, vab_pc_b)
country continent gdpPercap country continent gdpPercap
1 Australia Oceania 34435.367 Australia Oceania 34435.367
2 Canada Americas 36319.235 Canada Americas 36319.235
3 India Asia 2452.210 India Asia 2452.210
4 Japan Asia 31656.068 Japan Asia 31656.068
5 Peru Americas 7408.906 Peru Americas 7408.906
6 Spain Europe 28821.064 Spain Europe 28821.064
semi_join()
FUNCTION RETURNS ALL ROWS FROM THE LEFT TABLE WHERE THERE IS A CORRESPONDING MATCHING VALUE IN THE RIGHT TABLE, KEEPING JUST COLUMNS FROM THE FIRST ONE. INSTEAD OF RETURNING A TABLE COMBINING BOTH TABLES (AS IT DOES THE inner_join()
FUNCTION, IT ONLY CONTAINS DATA FROM THE LEFT TABLE. THEREFORE, THE semi_join()
FUNCTION WILL NEVER DUPLICATE ROWS OF THE LEFT TABLE.
semi_join(vab_pc, life_exp)
# A tibble: 4 x 3
country continent gdpPercap
<fct> <fct> <chr>
1 Australia Oceania 34,435.37
2 Canada Americas 36,319.24
3 Japan Asia 31,656.07
4 Spain Europe 28,821.06
semi_join(life_exp, vab_pc)
# A tibble: 4 x 3
country continent lifeExp
<fct> <fct> <chr>
1 Australia Oceania 81.23
2 Canada Americas 80.65
3 Japan Asia 82.60
4 Spain Europe 80.94
semi_join(matches, by= c("country", "continent")
# A tibble: 4 x 3
country continent gdpPercap
<fct> <fct> <chr>
1 Australia Oceania 34,435.37
2 Canada Americas 36,319.24
3 Japan Asia 31,656.07
4 Spain Europe 28,821.06
vab_pc %>% filter(country %in% life_exp$country)
# A tibble: 4 x 3
country continent gdpPercap
<fct> <fct> <chr>
1 Australia Oceania 34,435.37
2 Canada Americas 36,319.24
3 Japan Asia 31,656.07
4 Spain Europe 28,821.06
anti_join()
FUNCTION RETURNS ALL ROWS FROM THE LEFT TABLE WHERE THERE ARE NOT MATCHING VALUES IN THE RIGHT TABLE, KEEPING JUST COLUMNS FROM THE LEFT TABLE. THE anti_join()
CAN BE VERY USEFUL AS IT ALLOWS TO DETECT THE OBSERVATIONS IN A DATASET (LEFT TABLE) THAT ARE NOT INCLUDED IN A SECOND TABLE (RIGHT TABLE)
anti_join(vab_pc, life_exp)
# A tibble: 2 x 3
country continent gdpPercap
<fct> <fct> <chr>
1 India Asia 2,452.21
2 Peru Americas 7,408.91
anti_join(life_exp, vab_pc)
# A tibble: 3 x 3
country continent lifeExp
<fct> <fct> <chr>
1 Brazil Americas 72.39
2 Mongolia Asia 66.80
3 Singapore Asia 79.97
anti_join(life_exp, vab_pc, by = c("country", "continent"))
# A tibble: 2 x 3
country continent gdpPercap
<fct> <fct> <chr>
1 India Asia 2,452.21
2 Peru Americas 7,408.91
vab_pc %>% filter(country %nin% life_exp$country)
# A tibble: 2 x 3
country continent gdpPercap
<fct> <fct> <chr>
1 India Asia 2,452.21
2 Peru Americas 7,408.91
A
B
A
B
intersect()
FUNCTION TAKES THE ROWS THAT APPEARS IN BOTH TABLES AND RETURNS A DATAFRAME WITH THOSE ROWS THAT ARE BOTH IN THE LEFT AND IN THE RIGHT TABLE. THIS FUNCTION IS SIMILAR TO THE inner_join()
BUT IN THIS CASE BOTH DATASETS MUST HAVE THE SAME COLUMNS. BEWARE THAT THE intersect()
FUNCTION REMOVES DUPLICATES.
intersect(A,B)
# A tibble: 4 x 4
country continent gdpPercap lifeExp
<fct> <fct> <chr> <chr>
1 Australia Oceania 34,435.37 81.23
2 Canada Americas 36,319.24 80.65
3 Japan Asia 31,656.07 82.60
4 Spain Europe 28,821.06 80.94
intersect(B,A)
# A tibble: 4 x 4
country continent gdpPercap lifeExp
<fct> <fct> <chr> <chr>
1 Australia Oceania 34,435.37 81.23
2 Canada Americas 36,319.24 80.65
3 Japan Asia 31,656.07 82.60
4 Spain Europe 28,821.06 80.94
union()
FUNCTION COMBINES ALL ROWS FROM THE LEFT AND THE RIGHT TABLE BUT REMOVES DUPLICATE RECORDS.
union(A,B)
# A tibble: 9 x 4
country continent gdpPercap lifeExp
<fct> <fct> <chr> <chr>
1 Australia Oceania 34,435.37 81.23
2 Canada Americas 36,319.24 80.65
3 India Asia 2,452.21 64.70
4 Japan Asia 31,656.07 82.60
5 Peru Americas 7,408.91 71.42
6 Spain Europe 28,821.06 80.94
7 Brazil Americas 9,065.80 72.39
8 Mongolia Asia 3,095.77 66.80
9 Singapore Asia 47,143.18 79.97
union(B,A)
# A tibble: 9 x 4
country continent gdpPercap lifeExp
<fct> <fct> <chr> <chr>
1 Australia Oceania 34,435.37 81.23
2 Brazil Americas 9,065.80 72.39
3 Canada Americas 36,319.24 80.65
4 Japan Asia 31,656.07 82.60
5 Mongolia Asia 3,095.77 66.80
6 Singapore Asia 47,143.18 79.97
7 Spain Europe 28,821.06 80.94
8 India Asia 2,452.21 64.70
9 Peru Americas 7,408.91 71.42
union_all()
OR THE bind_rows()
(dplyr) FUNCTIONS, OR THE FUNCTION rbind()
IN BASE R, COMBINE ALL ROWS FROM THE LEFT AND THE RIGHT TABLE WITHOUT REMOVING THE DUPLICATE RECORDS.
union_all(A,B)
# A tibble: 13 x 4
country continent gdpPercap lifeExp
<fct> <fct> <chr> <chr>
1 Australia Oceania 34,435.37 81.23
2 Canada Americas 36,319.24 80.65
3 India Asia 2,452.21 64.70
4 Japan Asia 31,656.07 82.60
5 Peru Americas 7,408.91 71.42
6 Spain Europe 28,821.06 80.94
7 Australia Oceania 34,435.37 81.23
8 Brazil Americas 9,065.80 72.39
9 Canada Americas 36,319.24 80.65
10 Japan Asia 31,656.07 82.60
11 Mongolia Asia 3,095.77 66.80
12 Singapore Asia 47,143.18 79.97
13 Spain Europe 28,821.06 80.94
union_all(B,A)
# A tibble: 13 x 4
country continent gdpPercap lifeExp
<fct> <fct> <chr> <chr>
1 Australia Oceania 34,435.37 81.23
2 Brazil Americas 9,065.80 72.39
3 Canada Americas 36,319.24 80.65
4 Japan Asia 31,656.07 82.60
5 Mongolia Asia 3,095.77 66.80
6 Singapore Asia 47,143.18 79.97
7 Spain Europe 28,821.06 80.94
8 Australia Oceania 34,435.37 81.23
9 Canada Americas 36,319.24 80.65
10 India Asia 2,452.21 64.70
11 Japan Asia 31,656.07 82.60
12 Peru Americas 7,408.91 71.42
13 Spain Europe 28,821.06 80.94
bind_rows(A, B)
# A tibble: 13 x 4
country continent gdpPercap lifeExp
<fct> <fct> <chr> <chr>
1 Australia Oceania 34,435.37 81.23
2 Canada Americas 36,319.24 80.65
3 India Asia 2,452.21 64.70
4 Japan Asia 31,656.07 82.60
5 Peru Americas 7,408.91 71.42
6 Spain Europe 28,821.06 80.94
7 Australia Oceania 34,435.37 81.23
8 Brazil Americas 9,065.80 72.39
9 Canada Americas 36,319.24 80.65
10 Japan Asia 31,656.07 82.60
11 Mongolia Asia 3,095.77 66.80
12 Singapore Asia 47,143.18 79.97
13 Spain Europe 28,821.06 80.94
rbind(A,B)
# A tibble: 13 x 4
country continent gdpPercap lifeExp
<fct> <fct> <chr> <chr>
1 Australia Oceania 34,435.37 81.23
2 Canada Americas 36,319.24 80.65
3 India Asia 2,452.21 64.70
4 Japan Asia 31,656.07 82.60
5 Peru Americas 7,408.91 71.42
6 Spain Europe 28,821.06 80.94
7 Australia Oceania 34,435.37 81.23
8 Brazil Americas 9,065.80 72.39
9 Canada Americas 36,319.24 80.65
10 Japan Asia 31,656.07 82.60
11 Mongolia Asia 3,095.77 66.80
12 Singapore Asia 47,143.18 79.97
13 Spain Europe 28,821.06 80.94
rbind(B,A)
# A tibble: 13 x 4
country continent gdpPercap lifeExp
<fct> <fct> <chr> <chr>
1 Australia Oceania 34,435.37 81.23
2 Brazil Americas 9,065.80 72.39
3 Canada Americas 36,319.24 80.65
4 Japan Asia 31,656.07 82.60
5 Mongolia Asia 3,095.77 66.80
6 Singapore Asia 47,143.18 79.97
7 Spain Europe 28,821.06 80.94
8 Australia Oceania 34,435.37 81.23
9 Canada Americas 36,319.24 80.65
10 India Asia 2,452.21 64.70
11 Japan Asia 31,656.07 82.60
12 Peru Americas 7,408.91 71.42
13 Spain Europe 28,821.06 80.94
setdiff()
FUNCTION (dplyr) RETURNS THE ROWS THAT APPEAR IN THE LEFT TABLE BUT NOT IN THE RIGHT TABLE. BEWARE THAT THE setdiff()
FUNCTION REMOVES DUPLICATES.
setdiff(A,B)
# A tibble: 2 x 4
country continent gdpPercap lifeExp
<fct> <fct> <chr> <chr>
1 India Asia 2,452.21 64.70
2 Peru Americas 7,408.91 71.42
setdiff(B,A)
# A tibble: 3 x 4
country continent gdpPercap lifeExp
<fct> <fct> <chr> <chr>
1 Brazil Americas 9,065.80 72.39
2 Mongolia Asia 3,095.77 66.80
3 Singapore Asia 47,143.18 79.97
gapminder
gapminder %>% filter(year == c(1952,2007)) %>% select(country, continent, year, lifeExp) %>% spread(key= year, lifeExp) %>% mutate(crecimiento = ((
2007-
1952)/
1952)*100) %>% head(n=10)
# A tibble: 20 x 5
country continent `1952` `2007` crecimiento
<fct> <fct> <dbl> <dbl> <dbl>
1 Afghanistan Asia 28.8 43.8 52.2
2 Albania Europe 55.2 76.4 38.4
3 Algeria Africa 43.1 72.3 67.8
4 Angola Africa 30.0 42.7 42.4
5 Argentina Americas 62.5 75.3 20.5
6 Australia Oceania 69.1 81.2 17.5
7 Austria Europe 66.8 79.8 19.5
8 Bahrain Asia 50.9 75.6 48.5
9 Bangladesh Asia 37.5 64.1 70.9
10 Belgium Europe 68 79.4 16.8
11 Benin Africa 38.2 56.7 48.4
12 Bolivia Americas 40.4 65.6 62.2
13 Bosnia and Herzegovina Europe 53.8 74.9 39.1
14 Botswana Africa 47.6 50.7 6.52
15 Brazil Americas 50.9 72.4 42.2
16 Bulgaria Europe 59.6 73.0 22.5
17 Burkina Faso Africa 32.0 52.3 63.5
18 Burundi Africa 39.0 49.6 27.0
19 Cambodia Asia 39.4 59.7 51.5
20 Cameroon Africa 38.5 50.4 30.9
---
title: "Data Wrangling: Joining Data"
output:
flexdashboard::flex_dashboard:
storyboard: true
orientation: columns
source_code: embed
vertical_layout: fill
theme: simplex
---
```{r setup, include=FALSE}
# PACKAGES / LIBRARIES:
library(flexdashboard)
library(tidyverse)
library(gapminder)
library(ggthemes)
library(ggthemr)
library(DT)
# MAIN DATASET:
gapminder <- gapminder::gapminder
```
Tables I
====================================================================================
Column
-----------------------------------------------------------------------
### LEFT TABLE: `vab_pc`
```{r}
vab_pc <- gapminder %>%
filter(year == 2007) %>%
filter(country %in% c("Spain", "Peru", "India", "Australia", "Japan", "Canada")) %>%
select(country, continent, gdpPercap)
vab_pc$gdpPercap <- formatC(vab_pc$gdpPercap, format = "f", digits = 2, big.mark = ",")
datatable(vab_pc, rownames = TRUE, list(
columnDefs = list(list(className = 'dt-center', targets = 0:3))))
```
### RiGHT TABLE: `life_exp`
```{r}
life_exp <- gapminder %>%
filter(year == 2007) %>%
filter(country %in% c("Spain", "Australia", "Singapore", "Canada", "Japan", "Mongolia", "Brazil")) %>%
select(country, continent, lifeExp)
life_exp$lifeExp <- formatC(life_exp$lifeExp, format = "f", digits = 2)
datatable(life_exp, rownames = TRUE, list(
columnDefs = list(list(className = 'dt-center', targets = 0:3))))
```
Column
-----------------------------------------------------------------------
### LEFT TABLE: `vab_pc`
```{r}
library(png)
library(grid)
vabpc <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\vab_pc.png")
grid.raster(vabpc)
```
### LEFT TABLE: `life_exp`
```{r}
lifeexp <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\life_exp.png")
grid.raster(lifeexp)
```
INNER JOIN
====================================================================================
Column
-----------------------------------------------------------------------
\
### THE `inner_join()` FUNCTION (dplyr) RETURNS ALL ROWS FROM THE LEFT TABLE WHERE THERE ARE MATCHING VALUES IN THE RIGHT TABLE. THEREFORE, THE `inner_join()` RETURNS A TABLE WHICH COMBINE BOTH TABLES BASED ON THE JOIN PREDICATE. IF THERE ARE MULTIPLE MATCHES BETWEEN THE LEFT AND THE RIGHT TABLE ALL COMBINATIONS OF THE MATCHES ARE RETURNED. ALTERNATIVELY WE CAN USE THE `merge()` FUNCTION (BASE R) INDICATING `all=FALSE`
```{r}
inner_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\inner_join.png")
grid.raster(inner_join)
```
Columna {data-width=200}
-----------------------------------------------------------------------
### Dyplyr {data-commentary-width=400}
***
\
`inner_join(vab_pc, life_exp, by= c("country", "continent"))`
\
```{r}
inner_join(vab_pc, life_exp, by= c("country", "continent"))
```
***
\
`inner_join(life_exp, vab_pc, by= c("country", "continent"))`
\
```{r}
inner_join(life_exp, vab_pc, by= c("country", "continent"))
```
Columna {data-width=200}
-----------------------------------------------------------------------
### Base R {data-commentary-width=100}
***
\
\
`merge(vab_pc, life_exp, all = FALSE)`
\
```{r}
merge(vab_pc, life_exp, all = FALSE)
```
***
\
\
\
`merge(life_exp, vab_pc, all = FALSE)`
\
```{r}
merge(life_exp, vab_pc, all = FALSE)
```
LEFT JOIN
====================================================================================
Columna {data-width=600}
-----------------------------------------------------------------
\
### THE `left_join()` FUNCTION (dplyr) RETURNS ALL RECORDS FROM THE LEFT TABLE AND THE MATCHED RECORDS FROM THE RIGHT TABLE. THE RESULT IS `NA` FROM THE RIGHT SIDE IF THERE IS NO MATCH. IF THERE ARE MULTIPLE MATCHES BETWEEN THE LEFT TABLE AND THE RIGHT TABLE ALL COMBINATIONS OF THE MATCHES ARE RETURNED. IN BASE R WE CAN USE THE FUNCTION `merge()` INDICATING `all.x = TRUE`.
```{r}
left_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\left_join.png")
grid.raster(left_join)
```
Columna {data-width=200}
-----------------------------------------------------------------------
### Dyplyr {data-commentary-width=400}
***
\
`vab_pc %>%
left_join(life_exp, by = c("country", "continent"))`
\
```{r}
vab_pc %>%
left_join(life_exp, by = c("country", "continent"))
```
***
\
`life_exp %>%
left_join(vab_pc, by = c("country", "continent"))`
\
```{r}
life_exp %>%
left_join(vab_pc, by = c("country", "continent"))
```
Columna {data-width=200}
-----------------------------------------------------------------------
### Base R {data-commentary-width=100}
***
\
\
`merge(vab_pc, life_exp, all.x = TRUE)`
\
```{r}
merge(vab_pc, life_exp, all.x = TRUE)
```
***
\
\
\
`merge(life_exp, vab_pc, all.x = TRUE)`
\
```{r}
merge(life_exp, vab_pc, all.x = TRUE)
```
RIGHT JOIN
====================================================================================
Columna {data-width=600}
-----------------------------------------------------------------
\
### THE `right_join()` FUNCTION (dplyr) RETURNS ALL RECORDS FROM THE RIGHT TABLE AND THE MATCHED RECORDS FROM THE LEFT TABLE. THE RESULT IS `NA` FROM THE LEFT SIDE IF THERE IS NO MATCH. IF THERE ARE MULTIPLE MATCHES BETWEEN THE LEFT TABLE AND THE RIGHT TABLE, ALL COMBINATIONS OF THE MATCHES ARE RETURNED. IN BASE R WE CAN USE THE FUNCTION `merge()` INDICATING `all.y = TRUE`.
```{r}
right_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\right_join.png")
grid.raster(right_join)
```
Columna {data-width=200}
-----------------------------------------------------------------------
### Dyplyr {data-commentary-width=400}
***
\
`vab_pc %>%
right_join(life_exp, by = c("country", "continent"))`
\
```{r}
vab_pc %>%
right_join(life_exp, by = c("country", "continent"))
```
***
\
`life_exp %>%
right_join(vab_pc, by = c("country", "continent"))`
\
```{r}
life_exp %>%
right_join(vab_pc, by = c("country", "continent"))
```
Columna {data-width=200}
-----------------------------------------------------------------------
### Base R {data-commentary-width=100}
***
\
\
`merge(vab_pc, life_exp, all.y = TRUE)`
\
```{r}
merge(vab_pc, life_exp, all.y = TRUE)
```
***
\
\
\
`merge(life_exp, vab_pc, all.y = TRUE)`
\
```{r}
merge(life_exp, vab_pc, all.y = TRUE)
```
FULL JOIN
====================================================================================
Columna {data-width=600}
-----------------------------------------------------------------
\
### THE `full_join()` FUNCTION (dplyr) RETURNS ALL ROWS AND ALL COLUMNS FROM BOTH THE LEFT AND THE RIGHT TABLE. WHERE THERE ARE NOT MATCHING VALUES RETURNS `NA` FOR THE ONES MISSING. IN BASE R WE CAN USE THE FUNCTION `merge()` INDICATING `all = TRUE`.
\
```{r}
full_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\full_join.png")
grid.raster(full_join)
```
Columna {data-width=200}
-----------------------------------------------------------------------
### Dyplyr {data-commentary-width=400}
***
\
`vab_pc %>%
full_join(life_exp, by = c("country", "continent"))`
\
```{r}
vab_pc %>%
full_join(life_exp, by = c("country", "continent"))
```
***
\
`life_exp %>%
full_join(vab_pc, by = c("country", "continent"))`
\
```{r}
life_exp %>%
full_join(vab_pc, by = c("country", "continent"))
```
Columna {data-width=200}
-----------------------------------------------------------------------
### Base R {data-commentary-width=100}
***
\
\
`merge(vab_pc, life_exp, all = TRUE)`
\
```{r}
merge(vab_pc, life_exp, all = TRUE)
```
***
\
\
\
`merge(life_exp, vab_pc, all = TRUE)`
\
```{r}
merge(life_exp, vab_pc, all = TRUE)
```
BIND COLS
====================================================================================
Columna {data-width=700}
-----------------------------------------------------------------
\
### THE FUNCTION `bind_cols()` (dplyr) AND THE FUNCTION `cbind()` (BASE R) JOINS TWO TABLES HORIZONTALLY AND DO NOT NEED TO SPECIFY A COMMON KEY. THEY HORIZONTALLY CONCATENATES THE LEFT AND RIGHT TABLE, BUT THE TABLES MUST HAVE THE SAME NUMBER OF ROWS AND BE SORTED IN THE SAME ORDER AS THEY ARE MATCHED BY POSITION (**NOTE THAT IN THIS CASE BOTH TABLES MUST HAVE THE SAME OBSERVATIONS**)
\
\
```{r}
cbind_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\cbind_join.png")
grid.raster(cbind_join)
```
Columna {data-width=300}
-----------------------------------------------------------------------
### Dplyr
`bind_cols(vab_pc_b, life_exp_b)`
```{r}
vab_pc_b <- gapminder %>%
filter(year == 2007) %>%
filter(country %in% c("Spain", "Peru", "India", "Australia", "Japan", "Canada")) %>%
select(country, continent, gdpPercap)
life_exp_b <- gapminder %>%
filter(year == 2007) %>%
filter(country %in% c("Spain", "Peru", "India", "Australia", "Japan", "Canada")) %>%
select(country, continent, gdpPercap)
bind_cols(vab_pc_b, life_exp_b)
```
`bind_cols(life_exp_b, vab_pc_b)`
```{r}
bind_cols(list(life_exp_b, vab_pc_b))
```
### Base R
`cbind(vab_pc_b, life_exp_b)`
```{r}
cbind(vab_pc_b, life_exp_b)
```
`cbind(life_exp_b, vab_pc_b)`
```{r}
cbind(life_exp_b, vab_pc_b)
```
SEMI JOIN
====================================================================================
Columna {data-width=600}
-----------------------------------------------------------------
\
### THE `semi_join()` FUNCTION RETURNS ALL ROWS FROM THE LEFT TABLE WHERE THERE IS A CORRESPONDING MATCHING VALUE IN THE RIGHT TABLE, KEEPING JUST COLUMNS FROM THE FIRST ONE. INSTEAD OF RETURNING A TABLE COMBINING BOTH TABLES (AS IT DOES THE `inner_join()` FUNCTION, IT ONLY CONTAINS DATA FROM THE LEFT TABLE. THEREFORE, THE `semi_join()` FUNCTION WILL NEVER DUPLICATE ROWS OF THE LEFT TABLE.
\
```{r}
semi_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\semi_join.png")
grid.raster(semi_join)
```
Columna {data-width=400}
-----------------------------------------------------------------------
### Dyplyr
***
`semi_join(vab_pc, life_exp)`
```{r}
semi_join(vab_pc, life_exp)
```
***
`semi_join(life_exp, vab_pc)`
```{r}
semi_join(life_exp, vab_pc)
```
***
`semi_join(matches, by= c("country", "continent")`
```{r}
vab_pc %>%
semi_join(life_exp, by= c("country", "continent"))
```
***
`vab_pc %>%
filter(country %in% life_exp$country)`
```{r}
vab_pc %>%
filter(country %in% life_exp$country)
```
ANTI JOIN
====================================================================================
Columna {data-width=600}
-----------------------------------------------------------------
\
### THE `anti_join()` FUNCTION RETURNS ALL ROWS FROM THE LEFT TABLE WHERE THERE ARE NOT MATCHING VALUES IN THE RIGHT TABLE, KEEPING JUST COLUMNS FROM THE LEFT TABLE. THE `anti_join()` CAN BE VERY USEFUL AS IT ALLOWS TO DETECT THE OBSERVATIONS IN A DATASET (LEFT TABLE) THAT ARE NOT INCLUDED IN A SECOND TABLE (RIGHT TABLE)
\
```{r}
anti_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\anti_join.png")
grid.raster(anti_join)
```
Columna {data-width=400}
-----------------------------------------------------------------------
### Dyplyr {data-commentary-width=400}
***
`anti_join(vab_pc, life_exp)`
```{r}
anti_join(vab_pc, life_exp)
```
***
`anti_join(life_exp, vab_pc)`
```{r}
anti_join(life_exp, vab_pc)
```
***
`anti_join(life_exp, vab_pc, by = c("country", "continent"))`
```{r}
anti_join(vab_pc, life_exp, by = c("country", "continent"))
```
***
`vab_pc %>%
filter(country %nin% life_exp$country)`
```{r}
library(Hmisc)
vab_pc %>%
filter(country %nin% life_exp$country)
```
Tables II
====================================================================================
Columna
-----------------------------------------------------------------------
### LEFT TABLE: `A`
```{r}
gapminder <-gapminder::gapminder
A <- gapminder %>%
filter(year == 2007) %>%
filter(country %in% c("Spain", "Peru", "India", "Australia", "Japan", "Canada")) %>%
select(country, continent, gdpPercap, lifeExp)
A$gdpPercap <- formatC(A$gdpPercap, format = "f", digits = 2, big.mark = ",")
A$lifeExp <- formatC(A$lifeExp, format = "f", digits = 2)
datatable(A, rownames = TRUE, list(
columnDefs = list(list(className = 'dt-center', targets = 0:4))))
```
### RiGHT TABLE: `B`
```{r}
B <- gapminder %>%
filter(year == 2007) %>%
filter(country %in% c("Spain", "Australia", "Singapore", "Canada", "Japan", "Mongolia", "Brazil")) %>%
select(country, continent, gdpPercap, lifeExp)
B$gdpPercap <- formatC(B$gdpPercap, format = "f", digits = 2, big.mark = ",")
B$lifeExp <- formatC(B$lifeExp, format = "f", digits = 2)
datatable(B, rownames = TRUE, list(
columnDefs = list(list(className = 'dt-center', targets = 0:4))))
```
Column
-----------------------------------------------------------------------
### LEFT TABLE: `A`
```{r}
A2 <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\A.png")
grid.raster(A2)
```
### LEFT TABLE: `B`
```{r}
B2 <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\B.png")
grid.raster(B2)
```
INTERSECT
====================================================================================
Columna {data-width=600}
-----------------------------------------------------------------
\
### THE `intersect()` FUNCTION TAKES THE ROWS THAT APPEARS IN BOTH TABLES AND RETURNS A DATAFRAME WITH THOSE ROWS THAT ARE BOTH IN THE LEFT AND IN THE RIGHT TABLE. THIS FUNCTION IS SIMILAR TO THE `inner_join()` BUT IN THIS CASE **BOTH DATASETS MUST HAVE THE SAME COLUMNS**. BEWARE THAT THE `intersect()` FUNCTION REMOVES DUPLICATES.
\
```{r}
intersect_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\intersect_join.png")
grid.raster(intersect_join)
```
Columna {data-width=400}
-----------------------------------------------------------------------
### Dyplyr {data-commentary-width=400}
***
`intersect(A,B)`
```{r}
intersect(A,B)
```
***
`intersect(B,A)`
```{r}
intersect(B,A)
```
UNION
====================================================================================
Columna {data-width=600}
-----------------------------------------------------------------
\
### THE `union()` FUNCTION COMBINES ALL ROWS FROM THE LEFT AND THE RIGHT TABLE BUT **REMOVES DUPLICATE RECORDS**.
\
```{r}
union_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\union_join.png")
grid.raster(union_join)
```
Columna {data-width=400}
-----------------------------------------------------------------------
### Dyplyr {data-commentary-width=400}
***
`union(A,B)`
```{r}
union(A,B)
```
***
`union(B,A)`
```{r}
union(B,A)
```
UNION ALL | BIND ROWS
====================================================================================
Columna {data-width=600}
-----------------------------------------------------------------
\
### THE `union_all()` OR THE `bind_rows()` (dplyr) FUNCTIONS, OR THE FUNCTION `rbind()` IN BASE R, COMBINE ALL ROWS FROM THE LEFT AND THE RIGHT TABLE **WITHOUT REMOVING THE DUPLICATE RECORDS**.
\
```{r}
unionall_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\unionall_join.png")
grid.raster(unionall_join)
```
Columna {data-width=200}
-----------------------------------------------------------------------
### dyplyr
`union_all(A,B)`
```{r}
union_all(A,B)
```
`union_all(B,A)`
```{r}
union_all(B,A)
```
`bind_rows(A, B)`
```{r}
bind_rows(A, B)
```
Columna {data-width=200}
-----------------------------------------------------------------------
### base R
`rbind(A,B)`
```{r}
rbind(A,B)
```
`rbind(B,A)`
```{r}
rbind(B, A)
```
SETDIFF
====================================================================================
Columna {data-width=600}
-----------------------------------------------------------------
\
### THE `setdiff()` FUNCTION (dplyr) RETURNS THE ROWS THAT APPEAR IN THE LEFT TABLE BUT NOT IN THE RIGHT TABLE. BEWARE THAT THE `setdiff()` FUNCTION REMOVES DUPLICATES.
\
```{r}
except_join <- readPNG("C:\\Users\\Usuario\\Desktop\\r_que_r\\r_que_r\\content\\datasets\\joins_sql\\except_join.png")
grid.raster(except_join)
```
Columna {data-width=400}
-----------------------------------------------------------------------
### Dyplyr {data-commentary-width=400}
***
`setdiff(A,B)`
```{r}
setdiff(A,B)
```
***
`setdiff(B,A)`
```{r}
setdiff(B,A)
```
SELF JOIN
====================================================================================
Columna {data-width=600}
-----------------------------------------------------------------
\
### Dataset: `gapminder`
```{r}
datatable(gapminder, rownames = TRUE, list(
columnDefs = list(list(className = 'dt-center', targets = 0:6))))
```
Columna {data-width=400}
-----------------------------------------------------------------
\
\
***
`gapminder %>%
filter(year == c(1952,2007)) %>%
select(country, continent, year, lifeExp) %>%
spread(key= year, lifeExp) %>%
mutate(crecimiento = ((`2007`-`1952`)/`1952`)*100) %>%
head(n=10)`
\
```{r}
gapminder %>%
filter(year == c(1952,2007)) %>%
select(country, continent, year, lifeExp) %>%
spread(key= year, lifeExp) %>%
mutate(crecimiento = ((`2007`-`1952`)/`1952`)*100) %>%
head(n=20)
```