Transforming and summarising data

class: center, middle, inverse, title-slide

# Transforming and summarising data
### 16/11/2021

---

# Plotting using ggplot2

.pull-left[

```r
ggplot(data = mpg, 
       mapping = aes(x = displ,
                     y = hwy,
                     colour = factor(cyl))) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  labs(x = "Engine displacement (litres)",
       y = "Highway miles per gallon",
       colour = "Cylinders")
```

```
## `geom_smooth()` using formula 'y ~ x'
```
]
.pull-right[
![](Week-6-data-wrangling_files/figure-html/scatter-cyls-smo-1.png)
]

---
class: middle, center, inverse
# Importing your data

---
# Fear of Crime Dataset

[Ellis & Renouf (2018)](https://doi.org/10.1080/14789949.2017.1410562) - the relationship between fear of crime and various personality measures.

Their data is openly available, stored as text in a *comma-separated-values* format (*.csv*).

Once again, we can use the import button or some code (with `read_csv()`) to load this data in and automatically format it into a *tibble*.

```r
library(readr)
FearofCrime <- read_csv("data/FearofCrime.CSV")
```

---
# Fear of Crime Dataset

Ellis & Renouf (2018) collected data online using Qualtrics.

The file contains one column for each question that the participants answered, for a total of 169(!) columns.

Each row is a single participant's answers, and their demographic information.

```r
FearofCrime
```

```
## # A tibble: 301 x 169
##    ResponseID  ResponseSet Name  ExternalDataRef~ Status StartDate EndDate Finished
##    <chr>       <chr>       <chr> <lgl>             <dbl> <chr>     <chr>      <dbl>
##  1 R_ai4tgG1G~ Default Re~ Anon~ NA                    0 19/10/14~ 19/10/~        1
##  2 R_d5OiATV0~ Default Re~ Anon~ NA                    0 20/10/14~ 20/10/~        1
##  3 R_aaBVZUe9~ Default Re~ Anon~ NA                    0 20/10/14~ 20/10/~        1
##  4 R_6nxInLKQ~ Default Re~ Anon~ NA                    0 20/10/14~ 20/10/~        1
##  5 R_6SCYbhOP~ Default Re~ Anon~ NA                    0 20/10/14~ 20/10/~        1
##  6 R_5pCxWA6q~ Default Re~ Anon~ NA                    0 20/10/14~ 20/10/~        1
##  7 R_d1nji6V7~ Default Re~ Anon~ NA                    0 20/10/14~ 20/10/~        1
##  8 R_9v6ZgUhK~ Default Re~ Anon~ NA                    0 20/10/14~ 20/10/~        1
##  9 R_5Bg7VjBh~ Default Re~ Anon~ NA                    0 20/10/14~ 20/10/~        1
## 10 R_9Sv17lQG~ Default Re~ Anon~ NA                    0 20/10/14~ 20/10/~        1
## # ... with 291 more rows, and 161 more variables: ...
```

---
# Prison population

Last week, we looked at some data regarding the UK's prison population.

The data is contained in an [Excel spreadsheet](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/676248/prison-population-data-tool-31-december-2017.xlsx), downloaded from [data.gov.uk](https://data.gov.uk).

```r
library(readxl)
prison_pop <- read_excel("data/prison-population-data-tool-31-december-2017.xlsx",
                         sheet = "PT Data")
```

We use the `read_excel()` function to read Excel files.

Note how the file name and location come first, and then I specify a specific *sheet*.

Excel spreadsheets often have multiple sheets with different information.

---
background-image: url(images/05/import-from-excel.png)
background-size: contain
class: inverse

---
background-image: url(images/05/import-from-excel-change.png)
background-size: contain
class: inverse

---
background-image: url(images/05/import-from-excel-changed.png)
background-size: contain
class: inverse

---
# Prison population

Once the data is imported, we have a `tibble`.

We can immediately see there are 6 columns with 22409 rows.

```r
prison_pop
```

```
## # A tibble: 22,409 x 6
##    View           Date   Establishment Sex    `Age / Custody / National~ Population
##    <chr>          <chr>  <chr>         <chr>  <chr>                           <dbl>
##  1 a Establishme~ 2015-~ Altcourse     Male   Adults (21+)                      922
##  2 a Establishme~ 2015-~ Altcourse     Male   Juveniles and Young Adult~        169
##  3 a Establishme~ 2015-~ Ashfield      Male   Adults (21+)                      389
##  4 a Establishme~ 2015-~ Askham Grange Female Adults (21+)                       NA
##  5 a Establishme~ 2015-~ Askham Grange Female Juveniles and Young Adult~         NA
##  6 a Establishme~ 2015-~ Aylesbury     Male   Adults (21+)                      113
##  7 a Establishme~ 2015-~ Aylesbury     Male   Juveniles and Young Adult~        268
##  8 a Establishme~ 2015-~ Bedford       Male   Adults (21+)                      459
##  9 a Establishme~ 2015-~ Bedford       Male   Juveniles and Young Adult~         30
## 10 a Establishme~ 2015-~ Belmarsh      Male   Adults (21+)                      794
## # ... with 22,399 more rows
```

We need to do more work to make this file useable...

---
background-image: url(images/05/dplyr-logo.png)
background-size: 6%
background-position: 50% 75%
class: middle, center, inverse
# dpylr and data transformation

---
background-image: url(images/05/dplyr-logo.png)
background-size: 6%
background-position: 90% 5%
# Data transformation

With datasets like those we've loaded, there are often organisational issues.

For example, there could be many columns or rows we don't need, or the data would make more sense if it were sorted.

This is where `dplyr` comes in!

|Function |Effect|
|------------|----|
| select()   |Include or exclude variables (columns)|
| arrange()  |Change the order of observations (rows)|
| filter()   |Include or exclude observations (rows)|
| mutate()   |Create new variables (columns)|
| group_by() |Create groups of observations|
| summarise()|Aggregate or summarise groups of observations (rows)|

---
class: inverse, middle, center
# Selecting columns

---
background-image: url(images/05/dplyr-logo.png)
background-size: 6%
background-position: 90% 5%
# Selecting columns

.large[
Sometimes only some columns are of interest.

The Fear of Crime dataset has 169 columns. Only some of them are useful; here are the first ten.
]

```r
names(FearofCrime)[1:10]
```

```
##  [1] "ResponseID"                                                                                             
##  [2] "ResponseSet"                                                                                            
##  [3] "Name"                                                                                                   
##  [4] "ExternalDataReference"                                                                                  
##  [5] "Status"                                                                                                 
##  [6] "StartDate"                                                                                              
##  [7] "EndDate"                                                                                                
##  [8] "Finished"                                                                                               
##  [9] "Consent Form / This study includes a range of questionnaires collecting / demographic and individual..."
## [10] "sex"
```

---
background-image: url(images/05/dplyr-logo.png)
background-size: 6%
background-position: 90% 5%

# Selecting columns

.pull-left[
.large[
We pass the name of the data frame that we want to select from, and the names of each column we want to keep after that.

Suppose that, first of all, we were only interested in the age and sex of our participants.
]
]
.pull-right[

```r
select(FearofCrime, age, sex)
```

```
## # A tibble: 301 x 2
##      age   sex
##    <dbl> <dbl>
##  1    26     2
##  2    66     2
##  3    41     1
##  4    46     1
##  5    53     2
##  6    33     1
##  7    41     2
##  8    39     1
##  9    38     2
## 10    19     2
## # ... with 291 more rows
```
]

---
background-image: url(images/05/dplyr-logo.png)
background-size: 6%
background-position: 90% 5%

# Selecting columns

.pull-left[
.large[
The HEXACO-PI-R is a personality questionnaire that aims to measure six factors - Honesty-Humility, Emotionality, Extraversion, Agreeableness, Conscientiousness, and Openness to Experience.

The Fear of Crime dataset has the participants answers to the 60 questions of the HEXACO-PI-R in 60 columns. 
]
]
.pull-right[

```r
select(FearofCrime, hexaco1,
       hexaco2, hexaco3)
```

```
## # A tibble: 8 x 3
##   hexaco1 hexaco2 hexaco3
##     <dbl>   <dbl>   <dbl>
## 1       4       5       2
## 2       2       4       2
## 3       1       5       2
## 4       1       5       2
## 5       2       4       4
## 6       2       4       2
## 7       1       5       4
## 8       2       4       3
```
]

---
background-image: url(images/05/dplyr-logo.png)
background-size: 6%
background-position: 90% 5%
# Selecting columns

Typing these out one by one would be ... *laborious*.

Fortunately, there are some shorthands.

The colon (*:*) operator can be used to say "everything between these columns (inclusive)".

```r
select(FearofCrime, hexaco1:hexaco5)
```

```
## # A tibble: 301 x 5
##    hexaco1 hexaco2 hexaco3 hexaco4 hexaco5
##      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1       4       5       2       4       1
##  2       2       4       2       4       4
##  3       1       5       2       3       2
##  4       1       5       2       4       1
##  5       2       4       4       5       5
##  6       2       4       2       2       2
##  7       1       5       4       4       4
##  8       2       4       3       2       2
##  9       1       2       4       2       5
## 10       4       4       2       3       2
## # ... with 291 more rows
```

---
background-image: url(images/05/dplyr-logo.png)
background-size: 6%
background-position: 90% 5%

# Selecting columns

Note that you can also tell `select()` to *remove* columns using the minus (*-*) sign.

```r
select(FearofCrime, -ResponseSet, -Name, -Status, -ExternalDataReference)
```

```
## # A tibble: 301 x 165
##    ResponseID  StartDate  EndDate Finished `Consent Form / Thi~   sex   age hexaco1
##    <chr>       <chr>      <chr>      <dbl>                <dbl> <dbl> <dbl>   <dbl>
##  1 R_ai4tgG1G~ 19/10/14 ~ 19/10/~        1                    1     2    26       4
##  2 R_d5OiATV0~ 20/10/14 ~ 20/10/~        1                    1     2    66       2
##  3 R_aaBVZUe9~ 20/10/14 ~ 20/10/~        1                    1     1    41       1
##  4 R_6nxInLKQ~ 20/10/14 ~ 20/10/~        1                    1     1    46       1
##  5 R_6SCYbhOP~ 20/10/14 ~ 20/10/~        1                    1     2    53       2
##  6 R_5pCxWA6q~ 20/10/14 ~ 20/10/~        1                    1     1    33       2
##  7 R_d1nji6V7~ 20/10/14 ~ 20/10/~        1                    1     2    41       1
##  8 R_9v6ZgUhK~ 20/10/14 ~ 20/10/~        1                    1     1    39       2
##  9 R_5Bg7VjBh~ 20/10/14 ~ 20/10/~        1                    1     2    38       1
## 10 R_9Sv17lQG~ 20/10/14 ~ 20/10/~        1                    1     2    19       4
## # ... with 291 more rows, and 157 more variables: hexaco2 <dbl>, hexaco3 <dbl>,
## #   hexaco4 <dbl>, hexaco5 <dbl>, hexaco6 <dbl>, hexaco7 <dbl>, hexaco8 <dbl>,
## #   hexaco9 <dbl>, hexaco10 <dbl>, hexaco11 <dbl>, hexaco12 <dbl>, hexaco13 <dbl>,
## #   hexaco14 <dbl>, hexaco15 <dbl>, hexaco16 <dbl>, hexaco17 <dbl>,
## #   hexaco18 <dbl>, hexaco19 <dbl>, hexaco20 <dbl>, hexaco21 <dbl>,
## #   hexaco22 <dbl>, hexaco23 <dbl>, hexaco24 <dbl>, hexaco25 <dbl>,
## #   hexaco26 <dbl>, hexaco27 <dbl>, hexaco28 <dbl>, hexaco29 <dbl>, ...
```

---
class: inverse, center, middle
# Creating new columns

---
background-image: url(images/05/dplyr-logo.png)
background-size: 6%
background-position: 90% 5%

# Creating new columns

Here is a version of the Fear of Crime data where participants' overall scores on the various personality measures have been calculated.

```r
crime
```

```
## # A tibble: 301 x 15
##    Participant   sex     age victim_crime     H     E     X     A     C     O    SA
##    <chr>         <chr> <dbl> <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 R_01TjXgC191~ male     55 yes            3.7   3     3.4   3.9   3.2   3.6  1.15
##  2 R_0dN5YeULcy~ fema~    20 no             2.5   3.1   2.5   2.4   2.2   3.1  2.05
##  3 R_0DPiPYWhnc~ male     57 yes            2.6   3.1   3.3   3.1   4.3   2.8  2   
##  4 R_0f7bSsH6Up~ male     19 no             3.5   1.8   3.3   3.4   2.1   2.7  1.55
##  5 R_0rov2RoSkP~ fema~    20 no             3.3   3.4   3.9   3.2   2.8   3.9  1.3 
##  6 R_0wioqGERxE~ fema~    20 no             2.6   2.6   3     2.6   2.9   3.4  2.55
##  7 R_0wRO8lNe0k~ male     34 yes            3.2   2.5   3.2   2.8   4     3.2  1.85
##  8 R_116nEdFsGD~ fema~    19 no             2.9   4     3.9   4.2   3.7   1.9  1.1 
##  9 R_11ZmBd5VEk~ fema~    19 yes            3.4   3.4   3.3   3.4   3.2   3.2  2.2 
## 10 R_12i26Qzosm~ male     20 no             2.4   2.1   1.8   2.2   3.4   2.9  2.15
## # ... with 291 more rows, and 4 more variables: TA <dbl>, OHQ <dbl>, FoC <dbl>,
## #   Foc2 <dbl>
```

---
background-image: url(images/05/dplyr-logo.png)
background-size: 6%
background-position: 90% 5%

# Creating new columns

```r
crime_sub <- select(crime,
                    age, SA, TA, sex)
mutate(crime_sub, age_group = ifelse(age > 40,
                                     "Over 40",
                                     "40 or under"))
```

```
## # A tibble: 301 x 5
##      age    SA    TA sex    age_group  
##    <dbl> <dbl> <dbl> <chr>  <chr>      
##  1    55  1.15  1.55 male   Over 40    
##  2    20  2.05  2.95 female 40 or under
##  3    57  2     2.6  male   Over 40    
##  4    19  1.55  2.1  male   40 or under
##  5    20  1.3   1.8  female 40 or under
##  6    20  2.55  1.5  female 40 or under
##  7    34  1.85  1.75 male   40 or under
##  8    19  1.1   2    female 40 or under
##  9    19  2.2   2.9  female 40 or under
## 10    20  2.15  2.4  male   40 or under
## # ... with 291 more rows
```

---
background-image: url(images/05/dplyr-logo.png)
background-size: 6%
background-position: 90% 5%

# Arranging rows

Having calculated each person's *state anxiety* score, perhaps we'd now like to check who has the lowest and highest scores (note: this can be a good way to check for extreme values!).

.pull-left[

```r
arrange(crime_sub, SA)
```

```
## # A tibble: 301 x 4
##      age    SA    TA sex   
##    <dbl> <dbl> <dbl> <chr> 
##  1    20  1     1.05 male  
##  2    53  1     1.55 female
##  3    49  1     1.65 male  
##  4    19  1.05  1.5  female
##  5    19  1.1   2    female
##  6    19  1.1   1.4  male  
##  7    29  1.1   1.5  female
##  8    19  1.1   1.3  female
##  9    20  1.1   1.8  female
## 10    21  1.1   2.1  male  
## # ... with 291 more rows
```
]
.pull-right[

```r
arrange(crime_sub, desc(SA))
```

```
## # A tibble: 301 x 4
##      age    SA    TA sex   
##    <dbl> <dbl> <dbl> <chr> 
##  1    19  3.85  3.85 female
##  2    20  3.6   3.6  female
##  3    20  3.6   3.55 female
##  4    18  3.4   4    female
##  5    19  3.4   3.35 female
##  6    20  3.35  2.8  female
##  7    20  3.3   3.5  male  
##  8    19  3.2   2.95 male  
##  9    19  3.1   3.1  female
## 10    20  3.1   3.15 female
## # ... with 291 more rows
```
]
---
class: inverse, middle, center
# Grouping and summarizing

---
background-image: url(images/05/dplyr-logo.png)
background-size: 6%
background-position: 90% 5%

# Summarising rows

.pull-left[
A common task when analyzing data is to create summaries of statistical characteristics.

Here I calculate the *mean*, *standard deviation*, and *variance* of the State Anxiety variable.

Other possible summmary functions (other than `mean()`, `sd()`, or `var()`) include `max()`, `min()`, `IQR()`, or `median()`.
]
.pull-right[

```r
summarise(crime,
          mean = mean(SA),
          standard_dev = sd(SA),
          variance = var(SA))
```

```
## # A tibble: 1 x 3
##    mean standard_dev variance
##   <dbl>        <dbl>    <dbl>
## 1  1.92        0.554    0.307
```
]

---
background-image: url(images/05/dplyr-logo.png)
background-size: 6%
background-position: 90% 5%

# Grouping observations

`group_by()` is used to organise data frames into groups according to categorical variables.

```r
grouped_crime <- group_by(crime, sex, victim_crime)
grouped_crime
```

```
## # A tibble: 301 x 15
## # Groups:   sex, victim_crime [4]
##    Participant   sex     age victim_crime     H     E     X     A     C     O    SA
##    <chr>         <chr> <dbl> <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 R_01TjXgC191~ male     55 yes            3.7   3     3.4   3.9   3.2   3.6  1.15
##  2 R_0dN5YeULcy~ fema~    20 no             2.5   3.1   2.5   2.4   2.2   3.1  2.05
##  3 R_0DPiPYWhnc~ male     57 yes            2.6   3.1   3.3   3.1   4.3   2.8  2   
##  4 R_0f7bSsH6Up~ male     19 no             3.5   1.8   3.3   3.4   2.1   2.7  1.55
##  5 R_0rov2RoSkP~ fema~    20 no             3.3   3.4   3.9   3.2   2.8   3.9  1.3 
##  6 R_0wioqGERxE~ fema~    20 no             2.6   2.6   3     2.6   2.9   3.4  2.55
##  7 R_0wRO8lNe0k~ male     34 yes            3.2   2.5   3.2   2.8   4     3.2  1.85
##  8 R_116nEdFsGD~ fema~    19 no             2.9   4     3.9   4.2   3.7   1.9  1.1 
##  9 R_11ZmBd5VEk~ fema~    19 yes            3.4   3.4   3.3   3.4   3.2   3.2  2.2 
## 10 R_12i26Qzosm~ male     20 no             2.4   2.1   1.8   2.2   3.4   2.9  2.15
## # ... with 291 more rows, and 4 more variables: TA <dbl>, OHQ <dbl>, FoC <dbl>,
## #   Foc2 <dbl>
```

---
background-image: url(images/05/dplyr-logo.png)
background-size: 6%
background-position: 90% 5%

# Summarising groups

Once data is *grouped*, the most common thing to do is to `summarise()` those groups.

```r
summarise(grouped_crime,
          state_anxiety = mean(SA),
          sd_SA = sd(SA),
          var_SA = var(SA))
```

```
## # A tibble: 4 x 5
## # Groups:   sex [2]
##   sex    victim_crime state_anxiety sd_SA var_SA
##   <chr>  <chr>                <dbl> <dbl>  <dbl>
## 1 female no                    1.90 0.518  0.268
## 2 female yes                   1.98 0.643  0.413
## 3 male   no                    2.02 0.553  0.306
## 4 male   yes                   1.74 0.472  0.223
```

---
class: inverse, center, middle
# Removing unwanted rows

---
background-image: url(images/05/dplyr-logo.png)
background-size: 6%
background-position: 90% 5%

# Filtering rows

The `prison_pop` dataset has 22409 rows, but we don't need (or want) them all!

```r
unique(prison_pop$View)
```

```
## [1] "a Establishment*Sex*Age Group"     "b Establishment*Sex*Custody type" 
## [3] "c Establishment*Sex*Nationality"   "d Establishment*Sex*Offence group"
```

The data is actually *repeated* four times, but organised differently each time.

```
## # A tibble: 4 x 3
##   View                              total_pop num_entries
##   <chr>                                 <dbl>       <int>
## 1 a Establishment*Sex*Age Group        938760        2042
## 2 b Establishment*Sex*Custody type     939314        2740
## 3 c Establishment*Sex*Nationality      938841        3215
## 4 d Establishment*Sex*Offence group    936191       14412
```

---
background-image: url(images/05/dplyr-logo.png)
background-size: 6%
background-position: 90% 5%

# Filtering rows
If we just started investigating the data without accounting for this, it would be misleading.

.pull-left[

```r
ggplot(prison_pop, aes(x = Population)) +
  geom_histogram(binwidth = 100)
```

![](Week-6-data-wrangling_files/figure-html/unnamed-chunk-16-1.png)
]
.pull-right[

```r
ggplot(prison_pop, aes(x = Population)) +
  geom_histogram(binwidth = 100) + facet_wrap(~View)
```

![](Week-6-data-wrangling_files/figure-html/unnamed-chunk-17-1.png)
]

---
background-image: url(images/05/dplyr-logo.png)
background-size: 6%
background-position: 90% 5%

# Filtering rows

We can use the `filter()` function to select only the rows we're interested in, using *logical conditions* and *relational operators*.

```r
filter(prison_pop,
       View == "a Establishment*Sex*Age Group")
```

```
## # A tibble: 2,042 x 6
##    View           Date   Establishment Sex    `Age / Custody / National~ Population
##    <chr>          <chr>  <chr>         <chr>  <chr>                           <dbl>
##  1 a Establishme~ 2015-~ Altcourse     Male   Adults (21+)                      922
##  2 a Establishme~ 2015-~ Altcourse     Male   Juveniles and Young Adult~        169
##  3 a Establishme~ 2015-~ Ashfield      Male   Adults (21+)                      389
##  4 a Establishme~ 2015-~ Askham Grange Female Adults (21+)                       NA
##  5 a Establishme~ 2015-~ Askham Grange Female Juveniles and Young Adult~         NA
##  6 a Establishme~ 2015-~ Aylesbury     Male   Adults (21+)                      113
##  7 a Establishme~ 2015-~ Aylesbury     Male   Juveniles and Young Adult~        268
##  8 a Establishme~ 2015-~ Bedford       Male   Adults (21+)                      459
##  9 a Establishme~ 2015-~ Bedford       Male   Juveniles and Young Adult~         30
## 10 a Establishme~ 2015-~ Belmarsh      Male   Adults (21+)                      794
## # ... with 2,032 more rows
```

---
# Relational operators

Relational operators compare two (or more) things and return a **logical** value (i.e. TRUE/FALSE)

|Operator|Meaning| Example|
|---|------------------| |
|>  | Greater than    |5 > 4|
|>= | Greater than or equal to| 4 >= 4|
|<  | Less than | Population < 400|
|<= | Less than or equal to | Population <= 400|
|== | Exactly equal to | Sex == "Male"| 
|!= | Not equal to | Establishment != "Ashfield"|
|%in%| Is contained in| Establishment %in% c("Bedford", "Oakwood")|

---
# Logical operators

Logical operators can be used to combine multiple relational operators or *negate* a relational operator.

|Operator| Meaning| Example|
|-|-|-|-|
|&| AND| Population < 1000 & Sex == "Male"|
|&#124;| OR| Population > 200 &vert; Population < 500|
|!| NOT| !(Establishment %in% c("Bedford", "Oakwood")) |

---
background-image: url(images/05/dplyr-logo.png)
background-size: 6%
background-position: 90% 5%

# Filtering rows

We can have multiple *conditions* for selection with `filter()`.

Suppose we only wanted to include rows where Population is over 300 but under 600.

```r
filter(prison_pop,
       View == "a Establishment*Sex*Age Group",
       Population > 300 & Population < 600)
```

```
## # A tibble: 487 x 6
##    View           Date   Establishment Sex    `Age / Custody / National~ Population
##    <chr>          <chr>  <chr>         <chr>  <chr>                           <dbl>
##  1 a Establishme~ 2015-~ Ashfield      Male   Adults (21+)                      389
##  2 a Establishme~ 2015-~ Bedford       Male   Adults (21+)                      459
##  3 a Establishme~ 2015-~ Brinsford     Male   Juveniles and Young Adult~        349
##  4 a Establishme~ 2015-~ Bristol       Male   Adults (21+)                      553
##  5 a Establishme~ 2015-~ Bronzefield   Female Adults (21+)                      459
##  6 a Establishme~ 2015-~ Buckley Hall  Male   Adults (21+)                      440
##  7 a Establishme~ 2015-~ Coldingley    Male   Adults (21+)                      515
##  8 a Establishme~ 2015-~ Deerbolt      Male   Juveniles and Young Adult~        311
##  9 a Establishme~ 2015-~ Eastwood Park Female Adults (21+)                      331
## 10 a Establishme~ 2015-~ Erlestoke     Male   Adults (21+)                      514
## # ... with 477 more rows
```

---
class: inverse, middle, center
# Putting it all together

---
background-image: url(images/05/dplyr-logo.png)
background-size: 6%
background-position: 90% 5%

# Pipes

Often you want to conduct several steps, one after the other.

You could do this using objects to store each intermediate step.

```r
temp_pris <- filter(prison_pop,
                    View == "a Establishment*Sex*Age Group",
                    Date == "2015-06")
temp_pris <- group_by(temp_pris,
                      Sex,
                      `Age / Custody / Nationality / Offence Group`)
temp_pris <- summarise(temp_pris,
                       mean_pop = mean(Population, na.rm = TRUE), 
                       median_pop = median(Population, na.rm = TRUE),
                       total_pop = sum(Population, na.rm = TRUE),
                       max_pop = max(Population, na.rm = TRUE))
```
---
background-image: url(images/05/dplyr-logo.png)
background-size: 6%
background-position: 90% 5%

# Pipes

A simpler way is to use *pipes* (`%>%`)

*pipes* can be read as meaning "AND THEN"

```r
prison_pop %>%
  filter(View == "a Establishment*Sex*Age Group",
         Date == "2015-06") %>%
  group_by(Sex, `Age / Custody / Nationality / Offence Group`) %>%
  summarise(mean_pop = mean(Population, na.rm = TRUE), 
            median_pop = median(Population, na.rm = TRUE),
            total_pop = sum(Population, na.rm = TRUE),
            max_pop = max(Population, na.rm = TRUE))
```

```
## # A tibble: 4 x 6
## # Groups:   Sex [2]
##   Sex    `Age / Custody / Nationality / Offe~ mean_pop median_pop total_pop max_pop
##   <chr>  <chr>                                   <dbl>      <dbl>     <dbl>   <dbl>
## 1 Female Adults (21+)                            356          333      3560     480
## 2 Female Juveniles and Young Adults (15-20)       18.6         19       167      35
## 3 Male   Adults (21+)                            717.         677     76730    1587
## 4 Male   Juveniles and Young Adults (15-20)      101.          54      5559     490
```

---
# Reading materials

## Revision

For revision of this week's concepts, see Chapter *Data transformation* in R for Data Science.

For practice, use the "Work with Data" RStudio cloud primer.

## Next week

Discovering Statistics using R (Field et al.)
  - Chapter 9, Comparing two means
  - Chapter 5, Exploring assumptions (additional)