Making better spaghetti (plots)

class: left, middle, inverse, title-slide

# Making better spaghetti (plots)
## Exploring the individuals in longitudinal data with the <code>brolgar</code> package
### Nicholas Tierney, Monash University
### YSC, Canberra Wednesday 2nd October, 2019 <a href="https://bit.ly/ysc-njt">bit.ly/ysc-njt</a> <a href="https://twitter.com/nj_tierney">nj_tierney</a>

---

layout: true
<div class="my-footer">bit.ly/ysc-njt • @nj_tierney</div>

---
class: inverse, middle,

# What is longitudinal data?

.huge[
> Something observed sequentially over time
]

---

# What is longitudinal data?

.large[

```
## # A tsibble: 1 x 4 [!]
## # Key: country [1]
## country year height_cm continent
## <chr> <dbl> <dbl> <chr> 
## 1 Australia 1910 173. Oceania
```
]

---

# What is longitudinal data?

.large[

```
## # A tsibble: 2 x 4 [!]
## # Key: country [1]
## country year height_cm continent
## <chr> <dbl> <dbl> <chr> 
## 1 Australia 1910 173. Oceania 
## 2 Australia 1920 173. Oceania
```
]

---

# What is longitudinal data?

.large[

```
## # A tsibble: 3 x 4 [!]
## # Key: country [1]
## country year height_cm continent
## <chr> <dbl> <dbl> <chr> 
## 1 Australia 1910 173. Oceania 
## 2 Australia 1920 173. Oceania 
## 3 Australia 1960 176. Oceania
```
]

---

# What is longitudinal data?

.large[

```
## # A tsibble: 4 x 4 [!]
## # Key: country [1]
## country year height_cm continent
## <chr> <dbl> <dbl> <chr> 
## 1 Australia 1910 173. Oceania 
## 2 Australia 1920 173. Oceania 
## 3 Australia 1960 176. Oceania 
## 4 Australia 1970 178. Oceania
```
]

---
class: center, middle, inverse

.huge[
But we are **statisticians**:

let's **visualise**
]

---

---

---

# All of Australia

---

# ...And New Zealand

---

# ... And Afghanistan and Albania

---

# And the rest?

---

# And the rest?

---

---

# Does transparency help?

---

# Does transparency + a model help?

---
class: inverse, middle, center

.vhuge[
I've got ~~99 problems~~ **153 countries** but I can't see **anything**
]

---
class: middle, center

---
class: inverse, middle, center

.huge[
Problem #1: How do I look at **some** of the data?
]

.huge[
Problem #2: How do I find **interesting** observations?
]

---

# Introducing `brolgar`:

.pull-left.large[
* **br**owsing
* **o**ver
* **l**ongitudinal data 
* **g**raphically, and
* **a**nalytically, in
* **r**
]

.pull-right[
<img src="imgs/brolga-bird.jpg" width="569" style="display: block; margin: auto;" />
**
]

???

* It's a crane, it fishes, and it's a native Australian bird

---

---
class: inverse, middle, center

# What is longitudinal data?

.vlarge[
> Something observed sequentially over time
]
---
class: inverse, middle, center

# What is longitudinal data?

.vlarge[
> ~~Something~~ **Anything that is** observed sequentially over time **is a time series**
]

.large[
[-- Rob Hyndman and George Athanasopolous,
Forecasting: Principles and Practice](https://otexts.com/fpp2/data-methods.html)
]

---

# Longitudinal data as a time series <img src="https://tsibble.tidyverts.org/reference/figures/logo.png" align="right" height=140/>

```r
heights <- as_tsibble(heights,
 index = year,
 key = country,
* regular = FALSE)
```

1. **index**: Your time variable
2. **key**: Variable(s) defining individual groups (or series)

`1. +  2.` determine distinct rows in a tsibble.

(From Earo Wang's talk: [Melt the clock](https://slides.earo.me/rstudioconf19/#8))

---

.large[

```
## # A tsibble: 1,499 x 4 [!]
## # Key: country [153]
## country year height_cm continent
## <chr> <dbl> <dbl> <chr> 
## 1 Afghanistan 1870 168. Asia 
## 2 Afghanistan 1880 166. Asia 
## 3 Afghanistan 1930 167. Asia 
## 4 Afghanistan 1990 167. Asia 
## 5 Afghanistan 2000 161. Asia 
## 6 Albania 1880 170. Europe 
## # … with 1,493 more rows
```
]

---
class: inverse, middle, center

.huge[
Remember:

**key**  = variable(s) defining individual groups (or series)
]

---

# `sample_n_keys()` to sample ... **keys**

```r
heights %>% sample_n_keys(5)
```

```
## # A tsibble: 24 x 4 [!]
## # Key: country [5]
## country year height_cm continent
## <chr> <dbl> <dbl> <chr> 
## 1 Eritrea 1860 166. Africa 
## 2 Eritrea 1880 165. Africa 
## 3 Eritrea 1930 164 Africa 
## 4 Eritrea 1950 157. Africa 
## 5 Eritrea 1960 156. Africa 
## 6 Eritrea 1970 156 Africa 
## # … with 18 more rows
```

---

# `sample_n_keys()` to sample ... **keys**

---

# `facet_sample()`: See more individuals

```r
ggplot(heights, aes(x = year, 
                    y = height_cm, 
                    group = country)) + 
  geom_line() 
```

---

# `facet_sample()`: See more individuals

```r
ggplot(heights,
       aes(x = year,
             y = height_cm,
             group = country)) + 
  geom_line() + 
* facet_sample()
```

---

# `facet_sample()`: See more individuals

---

# `facet_strata()`: See all individuals

```r
ggplot(heights,
       aes(x = year,
             y = height_cm,
             group = country)) + 
  geom_line() + 
* facet_strata()
```

---

# `facet_strata()`: See all individuals

---

## `facet_strata(along = -year)`: see all individuals **along** some variable

```r
ggplot(heights,
       aes(x = year,
             y = height_cm,
             group = country)) + 
  geom_line() + 
* facet_strata(along = -year)
```

---

## `facet_strata(along = -year)`: see all individuals **along** some variable

---

## Problem #1: How do I look at some of the data?

.large[

`as_tsibble()`

`sample_n_keys()`

`facet_sample()`

`facet_strata()`

]

---

## ~~Problem #1: How do I look at some of the data?~~

.large[

`as_tsibble()`

`sample_n_keys()`

`facet_sample()`

`facet_strata()`

]

---

## Problem #2: How do I find **interesting** observations?

---
class: inverse, center, middle

.huge[
Define interesting?
]

---

## Identify features: one per **key** <img src="https://feasts.tidyverts.org/reference/figures/logo.png" align="right" height=140/>

```r
heights %>%
* features(height_cm,
*          feat_five_num)
```

---

## Identify features: summarise down to one observation

---

## Identify features: summarise down to one observation

---

## Identify important features and decide how to filter

---

## Identify important features and decide how to filter

---

## Identify important features and decide how to filter

---

## Join this feature back to the data

---

## Join this feature back to the data

---

## 🎉 Countries with smallest and largest max height

---
class: inverse, middle, cetner

.vhuge[
Let's see that **one more time**, but with the data
]

---

## Identify features: summarise down to one observation

```
## # A tsibble: 1,499 x 4 [!]
## # Key: country [153]
## country year height_cm continent
## <chr> <dbl> <dbl> <chr> 
## 1 Afghanistan 1870 168. Asia 
## 2 Afghanistan 1880 166. Asia 
## 3 Afghanistan 1930 167. Asia 
## 4 Afghanistan 1990 167. Asia 
## 5 Afghanistan 2000 161. Asia 
## 6 Albania 1880 170. Europe 
## 7 Albania 1890 170. Europe 
## 8 Albania 1900 169. Europe 
## 9 Albania 2000 168. Europe 
## 10 Algeria 1910 169. Africa 
## # … with 1,489 more rows
```

---

## Identify features: summarise down to one observation

```
## # A tibble: 153 x 6
## country min q25 med q75 max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 161. 164. 167. 168. 168.
## 2 Albania 168. 168. 170. 170. 170.
## 3 Algeria 166. 168. 169 170. 171.
## 4 Angola 159. 160. 167. 168. 169.
## 5 Argentina 167. 168. 168. 170. 174.
## 6 Armenia 164. 166. 169. 172. 172.
## 7 Australia 170 171. 172. 173. 178.
## 8 Austria 162. 164. 167. 169. 179.
## 9 Azerbaijan 170. 171. 172. 172. 172.
## 10 Bahrain 161. 161. 164. 164. 164 
## # … with 143 more rows
```

---

## Identify important features and decide how to filter

```r
heights_five %>% 
  filter(max == max(max) | max == min(max))
```

```
## # A tibble: 2 x 6
## country min q25 med q75 max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Denmark 165. 168. 170. 178. 183.
## 2 Papua New Guinea 152. 152. 156. 160. 161.
```

---

## Join summaries back to data

```r
heights_five %>% 
  filter(max == max(max) | max == min(max)) %>% 
  left_join(heights, by = "country")
```

```
## # A tibble: 21 x 9
## country min q25 med q75 max year height_cm continent
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 
## 1 Denmark 165. 168. 170. 178. 183. 1820 167. Europe 
## 2 Denmark 165. 168. 170. 178. 183. 1830 165. Europe 
## 3 Denmark 165. 168. 170. 178. 183. 1850 167. Europe 
## 4 Denmark 165. 168. 170. 178. 183. 1860 168. Europe 
## 5 Denmark 165. 168. 170. 178. 183. 1870 168. Europe 
## 6 Denmark 165. 168. 170. 178. 183. 1880 170. Europe 
## 7 Denmark 165. 168. 170. 178. 183. 1890 169. Europe 
## 8 Denmark 165. 168. 170. 178. 183. 1900 170. Europe 
## 9 Denmark 165. 168. 170. 178. 183. 1910 170 Europe 
## 10 Denmark 165. 168. 170. 178. 183. 1920 174. Europe 
## # … with 11 more rows
```

---
class: middle, center
# Other available `features()` in `brolgar`

---

# What is the range of the data? `feat_ranges`

```r
heights %>%
  features(height_cm, feat_ranges)
```

```
## # A tibble: 153 x 5
## country min max range_diff iqr
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 161. 168. 7 3.27
## 2 Albania 168. 170. 2.20 1.53
## 3 Algeria 166. 171. 5.06 2.15
## 4 Angola 159. 169. 10.5 7.87
## 5 Argentina 167. 174. 7 2.21
## 6 Armenia 164. 172. 8.82 5.30
## 7 Australia 170 178. 8.4 2.58
## 8 Austria 162. 179. 17.2 5.35
## 9 Azerbaijan 170. 172. 1.97 1.12
## 10 Bahrain 161. 164 3.3 2.75
## # … with 143 more rows
```

---

# Does my data only increase or decrease? `feat_monotonic`

```r
heights %>%
  features(height_cm, feat_monotonic)
```

```
## # A tibble: 153 x 5
## country increase decrease unvary monotonic
## <chr> <lgl> <lgl> <lgl> <lgl> 
## 1 Afghanistan FALSE FALSE FALSE FALSE 
## 2 Albania FALSE TRUE FALSE TRUE 
## 3 Algeria FALSE FALSE FALSE FALSE 
## 4 Angola FALSE FALSE FALSE FALSE 
## 5 Argentina FALSE FALSE FALSE FALSE 
## 6 Armenia FALSE FALSE FALSE FALSE 
## 7 Australia FALSE FALSE FALSE FALSE 
## 8 Austria FALSE FALSE FALSE FALSE 
## 9 Azerbaijan FALSE FALSE FALSE FALSE 
## 10 Bahrain TRUE FALSE FALSE TRUE 
## # … with 143 more rows
```

---

# What is the spread of my data? `feat_spread`

```r
heights %>%
  features(height_cm, feat_spread)
```

```
## # A tibble: 153 x 5
## country var sd mad iqr
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 7.20 2.68 1.65 3.27
## 2 Albania 0.950 0.975 0.667 1.53
## 3 Algeria 3.30 1.82 0.741 2.15
## 4 Angola 16.9 4.12 3.11 7.87
## 5 Argentina 2.89 1.70 1.36 2.21
## 6 Armenia 10.6 3.26 3.60 5.30
## 7 Australia 7.63 2.76 1.66 2.58
## 8 Austria 26.6 5.16 3.93 5.35
## 9 Azerbaijan 0.516 0.718 0.621 1.12
## 10 Bahrain 3.42 1.85 0.297 2.75
## # … with 143 more rows
```

---

# Take homes

.large[
1. Longitudinal data is a time series
2. Specify structure once
2. Use `facet_sample()` / `facet_strata()` to look at data
1. Summarise with `features` to find interesting observations
3. Reconnect summaries to data with a **left join**
]

---

# Thanks

.large[
- Di Cook
- Tania Prvan
- Stuart Lee
- Mitchell O'Hara Wild
- Earo Wang
- Rob Hyndman
- Miles McBain
- Monash University
]

---

# Resources

.large[
- [feasts](http://feasts.tidyverts.org/)
- [tsibble](http://tsibble.tidyverts.org/)
- [Time series graphics using feasts](https://robjhyndman.com/hyndsight/feasts/)
- [Feature-based time series analysis](https://robjhyndman.com/hyndsight/fbtsa/)
]

---

# Colophon

.large[
- Slides made using [xaringan](https://github.com/yihui/xaringan)
- Extended with [xaringanthemer](https://github.com/gadenbuie/xaringanthemer)
- Colours taken + modified from [lorikeet theme from ochRe](https://github.com/ropenscilabs/ochRe)
- Header font is **Josefin Sans**
- Body text font is **Montserrat**
- Code font is **Fira Mono**
]

---

# Learning more

.large[
 [brolgar.njtierney.com](http://brolgar.njtierney.com/)

[bit.ly/ysc-njt](https://bit.ly/ysc-njt)

nj_tierney

njtierney

nicholas.tierney@gmail.com

]

---

.vhuge[
bonus round
🎉 💃 🎉
]
 
---

## Identify features: summarise down to one observation (variance)

---
class: inverse, middle, center

# Example: What is the growth of countries like?

---

# `key_slope`: Fit a linear model to each key

```r
heights_slope <- key_slope(heights, height_cm ~ year)
heights_slope
```

```
## # A tibble: 153 x 3
## country .intercept .slope_year
## <chr> <dbl> <dbl>
## 1 Afghanistan 217. -0.0263
## 2 Albania 202. -0.0170
## 3 Algeria 111. 0.0297
## 4 Angola 43.9 0.0648
## 5 Argentina 147. 0.0117
## 6 Armenia 87.9 0.0419
## 7 Australia 46.1 0.0665
## 8 Austria 38.2 0.0695
## 9 Azerbaijan 150. 0.0111
## 10 Bahrain -157. 0.165 
## # … with 143 more rows
```

---

# Who is similar to the summary?

```r
summary(heights_slope$.slope_year)
```

```
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## -0.10246  0.02104  0.04025  0.04630  0.06858  0.32150        9
```

.vlarge[
which keys are **nearest** to the summary statistics of the slope?
]

---

# `keys_near()`

```r
heights_slope_near <- heights_slope %>%
 keys_near(key = country,
 var = .slope_year)
```

---

# `keys_near()`

```
## # A tibble: 6 x 5
## country .slope_year stat stat_value stat_diff
## <chr> <dbl> <fct> <dbl> <dbl>
## 1 Eritrea -0.102 min -0.102 0 
## 2 Tajikistan 0.0199 q_25 0.0205 0.000632
## 3 Mali 0.0401 med 0.0403 0.000120
## 4 Spain 0.0404 med 0.0403 0.000120
## 5 Austria 0.0695 q_75 0.0690 0.000515
## 6 Burundi 0.321 max 0.321 0
```

---

# Join back to data

```r
heights_near <- heights_slope_near %>% 
 left_join(heights, by = "country")

heights_near
```

```
## # A tibble: 67 x 8
## country .slope_year stat stat_value stat_diff year height_cm continent
## <chr> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <chr> 
## 1 Eritrea -0.102 min -0.102 0 1860 166. Africa 
## 2 Eritrea -0.102 min -0.102 0 1880 165. Africa 
## 3 Eritrea -0.102 min -0.102 0 1930 164 Africa 
## 4 Eritrea -0.102 min -0.102 0 1950 157. Africa 
## 5 Eritrea -0.102 min -0.102 0 1960 156. Africa 
## 6 Eritrea -0.102 min -0.102 0 1970 156 Africa 
## 7 Tajiki… 0.0199 q_25 0.0205 0.000632 1860 165 Asia 
## 8 Tajiki… 0.0199 q_25 0.0205 0.000632 1870 165. Asia 
## 9 Tajiki… 0.0199 q_25 0.0205 0.000632 1880 167. Asia 
## 10 Tajiki… 0.0199 q_25 0.0205 0.000632 1890 166. Asia 
## # … with 57 more rows
```

---

---

---

.vhuge[
**End.**
]
---