class: left, middle, inverse, title-slide # Making better spaghetti (plots) ## Exploring the individuals in longitudinal data with the
brolgar
package ###
Nicholas Tierney, Monash University
###
YSC, Canberra
Wednesday 2nd October, 2019
bit.ly/ysc-njt
nj_tierney
--- layout: true <div class="my-footer"><span>bit.ly/ysc-njt • @nj_tierney</span></div> --- class: inverse, middle, # What is longitudinal data? .huge[ > Something observed sequentially over time ] --- # What is longitudinal data? .large[ ``` ## # A tsibble: 1 x 4 [!] ## # Key: country [1] ## country year height_cm continent ## <chr> <dbl> <dbl> <chr> ## 1 Australia 1910 173. Oceania ``` ] --- # What is longitudinal data? .large[ ``` ## # A tsibble: 2 x 4 [!] ## # Key: country [1] ## country year height_cm continent ## <chr> <dbl> <dbl> <chr> ## 1 Australia 1910 173. Oceania ## 2 Australia 1920 173. Oceania ``` ] --- # What is longitudinal data? .large[ ``` ## # A tsibble: 3 x 4 [!] ## # Key: country [1] ## country year height_cm continent ## <chr> <dbl> <dbl> <chr> ## 1 Australia 1910 173. Oceania ## 2 Australia 1920 173. Oceania ## 3 Australia 1960 176. Oceania ``` ] --- # What is longitudinal data? .large[ ``` ## # A tsibble: 4 x 4 [!] ## # Key: country [1] ## country year height_cm continent ## <chr> <dbl> <dbl> <chr> ## 1 Australia 1910 173. Oceania ## 2 Australia 1920 173. Oceania ## 3 Australia 1960 176. Oceania ## 4 Australia 1970 178. Oceania ``` ] --- class: center, middle, inverse .huge[ But we are **statisticians**: let's **visualise** ] --- <img src="figures/reveal-height-1.gif" width="150%" style="display: block; margin: auto;" /> --- <img src="figures/gg-example-1.png" width="150%" style="display: block; margin: auto;" /> --- # All of Australia <img src="figures/gg-all-australia-1.png" width="936" style="display: block; margin: auto;" /> --- # ...And New Zealand <img src="figures/gg-show-a-few-countries-1.png" width="936" style="display: block; margin: auto;" /> --- # ... And Afghanistan and Albania <img src="figures/sample-more-heights-1.png" width="936" style="display: block; margin: auto;" /> --- # And the rest? <img src="figures/animate-all-data-1.gif" style="display: block; margin: auto;" /> --- # And the rest? <img src="figures/gg-show-all-1.png" width="936" style="display: block; margin: auto;" /> --- <img src="gifs/noodle-explode.gif" width="50%" style="display: block; margin: auto;" /> --- # Does transparency help? <img src="figures/gg-show-all-w-alpha-1.png" width="936" style="display: block; margin: auto;" /> --- # Does transparency + a model help? <img src="figures/gg-show-all-w-model-1.png" width="936" style="display: block; margin: auto;" /> --- class: inverse, middle, center .vhuge[ I've got ~~99 problems~~ **153 countries** but I can't see **anything** ] --- class: middle, center <iframe width="1120" height="630" src="https://www.youtube.com/embed/UerBCXHKJ5s?start=15" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> --- class: inverse, middle, center .huge[ Problem #1: How do I look at **some** of the data? ] -- .huge[ Problem #2: How do I find **interesting** observations? ] --- # Introducing `brolgar`: .pull-left.large[ * **br**owsing * **o**ver * **l**ongitudinal data * **g**raphically, and * **a**nalytically, in * **r** ] .pull-right[ <img src="imgs/brolga-bird.jpg" width="569" style="display: block; margin: auto;" /> ** ] ??? * It's a crane, it fishes, and it's a native Australian bird --- <img src="figures/gg-remind-spaghetti-1.png" width="200%" style="display: block; margin: auto;" /> --- class: inverse, middle, center # What is longitudinal data? .vlarge[ > Something observed sequentially over time ] --- class: inverse, middle, center # What is longitudinal data? .vlarge[ > ~~Something~~ **Anything that is** observed sequentially over time **is a time series** ] -- .large[ [-- Rob Hyndman and George Athanasopolous, Forecasting: Principles and Practice](https://otexts.com/fpp2/data-methods.html) ] --- # Longitudinal data as a time series <img src="https://tsibble.tidyverts.org/reference/figures/logo.png" align="right" height=140/> ```r heights <- as_tsibble(heights, index = year, key = country, * regular = FALSE) ``` 1. **index**: Your time variable 2. **key**: Variable(s) defining individual groups (or series) `1. + 2.` determine distinct rows in a tsibble. (From Earo Wang's talk: [Melt the clock](https://slides.earo.me/rstudioconf19/#8)) --- .large[ ``` ## # A tsibble: 1,499 x 4 [!] ## # Key: country [153] ## country year height_cm continent ## <chr> <dbl> <dbl> <chr> ## 1 Afghanistan 1870 168. Asia ## 2 Afghanistan 1880 166. Asia ## 3 Afghanistan 1930 167. Asia ## 4 Afghanistan 1990 167. Asia ## 5 Afghanistan 2000 161. Asia ## 6 Albania 1880 170. Europe ## # … with 1,493 more rows ``` ] --- class: inverse, middle, center .huge[ Remember: **key** = variable(s) defining individual groups (or series) ] --- # `sample_n_keys()` to sample ... **keys** ```r heights %>% sample_n_keys(5) ``` ``` ## # A tsibble: 24 x 4 [!] ## # Key: country [5] ## country year height_cm continent ## <chr> <dbl> <dbl> <chr> ## 1 Eritrea 1860 166. Africa ## 2 Eritrea 1880 165. Africa ## 3 Eritrea 1930 164 Africa ## 4 Eritrea 1950 157. Africa ## 5 Eritrea 1960 156. Africa ## 6 Eritrea 1970 156 Africa ## # … with 18 more rows ``` --- # `sample_n_keys()` to sample ... **keys** <img src="figures/ggplot-sample-keys-1.png" width="936" style="display: block; margin: auto;" /> --- # `facet_sample()`: See more individuals ```r ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() ``` <img src="figures/gg-facet-sample-all-1.png" width="60%" style="display: block; margin: auto;" /> --- # `facet_sample()`: See more individuals ```r ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + * facet_sample() ``` --- # `facet_sample()`: See more individuals <img src="figures/print-gg-facet-sample-1.png" width="936" style="display: block; margin: auto;" /> --- # `facet_strata()`: See all individuals ```r ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + * facet_strata() ``` --- # `facet_strata()`: See all individuals <img src="figures/print-gg-facet-strata-1.png" width="936" style="display: block; margin: auto;" /> --- ## `facet_strata(along = -year)`: see all individuals **along** some variable ```r ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + * facet_strata(along = -year) ``` --- ## `facet_strata(along = -year)`: see all individuals **along** some variable <img src="figures/print-gg-facet-strata-along-1.png" width="936" style="display: block; margin: auto;" /> --- ## Problem #1: How do I look at some of the data? -- .large[ `as_tsibble()` `sample_n_keys()` `facet_sample()` `facet_strata()` ] --- ## ~~Problem #1: How do I look at some of the data?~~ .large[ `as_tsibble()` `sample_n_keys()` `facet_sample()` `facet_strata()` ] --- ## Problem #2: How do I find **interesting** observations? <img src="figures/quote-interesting-obs-1.png" width="936" style="display: block; margin: auto;" /> --- class: inverse, center, middle .huge[ Define interesting? ] --- ## Identify features: one per **key** <img src="https://feasts.tidyverts.org/reference/figures/logo.png" align="right" height=140/> ```r heights %>% * features(height_cm, * feat_five_num) ``` ``` ## # A tibble: 153 x 6 ## country min q25 med q75 max ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Afghanistan 161. 164. 167. 168. 168. ## 2 Albania 168. 168. 170. 170. 170. ## 3 Algeria 166. 168. 169 170. 171. ## 4 Angola 159. 160. 167. 168. 169. ## 5 Argentina 167. 168. 168. 170. 174. ## 6 Armenia 164. 166. 169. 172. 172. ## # … with 147 more rows ``` --- ## Identify features: summarise down to one observation <img src="figures/anim-line-flat-max-1.gif" style="display: block; margin: auto;" /> --- ## Identify features: summarise down to one observation <img src="figures/show-line-range-point-1.png" width="936" style="display: block; margin: auto;" /> --- ## Identify important features and decide how to filter <img src="figures/gg-show-point-1.png" width="936" style="display: block; margin: auto;" /> --- ## Identify important features and decide how to filter <img src="figures/gg-show-red-points-1.png" width="936" style="display: block; margin: auto;" /> --- ## Identify important features and decide how to filter <img src="figures/gg-just-red-points-1.png" width="936" style="display: block; margin: auto;" /> --- ## Join this feature back to the data <img src="figures/gg-join-red-1.png" width="936" style="display: block; margin: auto;" /> --- ## Join this feature back to the data <img src="figures/gg-join-red-show-all-1.png" width="936" style="display: block; margin: auto;" /> --- ## 🎉 Countries with smallest and largest max height <img src="figures/show-red-all-again-1.png" width="936" style="display: block; margin: auto;" /> --- class: inverse, middle, cetner .vhuge[ Let's see that **one more time**, but with the data ] --- ## Identify features: summarise down to one observation ``` ## # A tsibble: 1,499 x 4 [!] ## # Key: country [153] ## country year height_cm continent ## <chr> <dbl> <dbl> <chr> ## 1 Afghanistan 1870 168. Asia ## 2 Afghanistan 1880 166. Asia ## 3 Afghanistan 1930 167. Asia ## 4 Afghanistan 1990 167. Asia ## 5 Afghanistan 2000 161. Asia ## 6 Albania 1880 170. Europe ## 7 Albania 1890 170. Europe ## 8 Albania 1900 169. Europe ## 9 Albania 2000 168. Europe ## 10 Algeria 1910 169. Africa ## # … with 1,489 more rows ``` --- ## Identify features: summarise down to one observation ``` ## # A tibble: 153 x 6 ## country min q25 med q75 max ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Afghanistan 161. 164. 167. 168. 168. ## 2 Albania 168. 168. 170. 170. 170. ## 3 Algeria 166. 168. 169 170. 171. ## 4 Angola 159. 160. 167. 168. 169. ## 5 Argentina 167. 168. 168. 170. 174. ## 6 Armenia 164. 166. 169. 172. 172. ## 7 Australia 170 171. 172. 173. 178. ## 8 Austria 162. 164. 167. 169. 179. ## 9 Azerbaijan 170. 171. 172. 172. 172. ## 10 Bahrain 161. 161. 164. 164. 164 ## # … with 143 more rows ``` --- ## Identify important features and decide how to filter ```r heights_five %>% filter(max == max(max) | max == min(max)) ``` ``` ## # A tibble: 2 x 6 ## country min q25 med q75 max ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Denmark 165. 168. 170. 178. 183. ## 2 Papua New Guinea 152. 152. 156. 160. 161. ``` --- ## Join summaries back to data ```r heights_five %>% filter(max == max(max) | max == min(max)) %>% left_join(heights, by = "country") ``` ``` ## # A tibble: 21 x 9 ## country min q25 med q75 max year height_cm continent ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Denmark 165. 168. 170. 178. 183. 1820 167. Europe ## 2 Denmark 165. 168. 170. 178. 183. 1830 165. Europe ## 3 Denmark 165. 168. 170. 178. 183. 1850 167. Europe ## 4 Denmark 165. 168. 170. 178. 183. 1860 168. Europe ## 5 Denmark 165. 168. 170. 178. 183. 1870 168. Europe ## 6 Denmark 165. 168. 170. 178. 183. 1880 170. Europe ## 7 Denmark 165. 168. 170. 178. 183. 1890 169. Europe ## 8 Denmark 165. 168. 170. 178. 183. 1900 170. Europe ## 9 Denmark 165. 168. 170. 178. 183. 1910 170 Europe ## 10 Denmark 165. 168. 170. 178. 183. 1920 174. Europe ## # … with 11 more rows ``` --- class: middle, center # Other available `features()` in `brolgar` --- # What is the range of the data? `feat_ranges` ```r heights %>% features(height_cm, feat_ranges) ``` ``` ## # A tibble: 153 x 5 ## country min max range_diff iqr ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Afghanistan 161. 168. 7 3.27 ## 2 Albania 168. 170. 2.20 1.53 ## 3 Algeria 166. 171. 5.06 2.15 ## 4 Angola 159. 169. 10.5 7.87 ## 5 Argentina 167. 174. 7 2.21 ## 6 Armenia 164. 172. 8.82 5.30 ## 7 Australia 170 178. 8.4 2.58 ## 8 Austria 162. 179. 17.2 5.35 ## 9 Azerbaijan 170. 172. 1.97 1.12 ## 10 Bahrain 161. 164 3.3 2.75 ## # … with 143 more rows ``` --- # Does my data only increase or decrease? `feat_monotonic` ```r heights %>% features(height_cm, feat_monotonic) ``` ``` ## # A tibble: 153 x 5 ## country increase decrease unvary monotonic ## <chr> <lgl> <lgl> <lgl> <lgl> ## 1 Afghanistan FALSE FALSE FALSE FALSE ## 2 Albania FALSE TRUE FALSE TRUE ## 3 Algeria FALSE FALSE FALSE FALSE ## 4 Angola FALSE FALSE FALSE FALSE ## 5 Argentina FALSE FALSE FALSE FALSE ## 6 Armenia FALSE FALSE FALSE FALSE ## 7 Australia FALSE FALSE FALSE FALSE ## 8 Austria FALSE FALSE FALSE FALSE ## 9 Azerbaijan FALSE FALSE FALSE FALSE ## 10 Bahrain TRUE FALSE FALSE TRUE ## # … with 143 more rows ``` --- # What is the spread of my data? `feat_spread` ```r heights %>% features(height_cm, feat_spread) ``` ``` ## # A tibble: 153 x 5 ## country var sd mad iqr ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Afghanistan 7.20 2.68 1.65 3.27 ## 2 Albania 0.950 0.975 0.667 1.53 ## 3 Algeria 3.30 1.82 0.741 2.15 ## 4 Angola 16.9 4.12 3.11 7.87 ## 5 Argentina 2.89 1.70 1.36 2.21 ## 6 Armenia 10.6 3.26 3.60 5.30 ## 7 Australia 7.63 2.76 1.66 2.58 ## 8 Austria 26.6 5.16 3.93 5.35 ## 9 Azerbaijan 0.516 0.718 0.621 1.12 ## 10 Bahrain 3.42 1.85 0.297 2.75 ## # … with 143 more rows ``` --- # Take homes .large[ 1. Longitudinal data is a time series 2. Specify structure once 2. Use `facet_sample()` / `facet_strata()` to look at data 1. Summarise with `features` to find interesting observations 3. Reconnect summaries to data with a **left join** ] --- # Thanks .large[ - Di Cook - Tania Prvan - Stuart Lee - Mitchell O'Hara Wild - Earo Wang - Rob Hyndman - Miles McBain - Monash University ] --- # Resources .large[ - [feasts](http://feasts.tidyverts.org/) - [tsibble](http://tsibble.tidyverts.org/) - [Time series graphics using feasts](https://robjhyndman.com/hyndsight/feasts/) - [Feature-based time series analysis](https://robjhyndman.com/hyndsight/fbtsa/) ] --- # Colophon .large[ - Slides made using [xaringan](https://github.com/yihui/xaringan) - Extended with [xaringanthemer](https://github.com/gadenbuie/xaringanthemer) - Colours taken + modified from [lorikeet theme from ochRe](https://github.com/ropenscilabs/ochRe) - Header font is **Josefin Sans** - Body text font is **Montserrat** - Code font is **Fira Mono** ] --- # Learning more .large[
[brolgar.njtierney.com](http://brolgar.njtierney.com/)
[bit.ly/ysc-njt](https://bit.ly/ysc-njt)
nj_tierney
njtierney
nicholas.tierney@gmail.com ] --- .vhuge[ bonus round 🎉 💃 🎉 ] --- ## Identify features: summarise down to one observation (variance) <img src="figures/show-line-range-1.gif" style="display: block; margin: auto;" /> --- class: inverse, middle, center # Example: What is the growth of countries like? --- # `key_slope`: Fit a linear model to each key ```r heights_slope <- key_slope(heights, height_cm ~ year) heights_slope ``` ``` ## # A tibble: 153 x 3 ## country .intercept .slope_year ## <chr> <dbl> <dbl> ## 1 Afghanistan 217. -0.0263 ## 2 Albania 202. -0.0170 ## 3 Algeria 111. 0.0297 ## 4 Angola 43.9 0.0648 ## 5 Argentina 147. 0.0117 ## 6 Armenia 87.9 0.0419 ## 7 Australia 46.1 0.0665 ## 8 Austria 38.2 0.0695 ## 9 Azerbaijan 150. 0.0111 ## 10 Bahrain -157. 0.165 ## # … with 143 more rows ``` --- # Who is similar to the summary? ```r summary(heights_slope$.slope_year) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## -0.10246 0.02104 0.04025 0.04630 0.06858 0.32150 9 ``` -- .vlarge[ which keys are **nearest** to the summary statistics of the slope? ] --- # `keys_near()` ```r heights_slope_near <- heights_slope %>% keys_near(key = country, var = .slope_year) ``` --- # `keys_near()` ``` ## # A tibble: 6 x 5 ## country .slope_year stat stat_value stat_diff ## <chr> <dbl> <fct> <dbl> <dbl> ## 1 Eritrea -0.102 min -0.102 0 ## 2 Tajikistan 0.0199 q_25 0.0205 0.000632 ## 3 Mali 0.0401 med 0.0403 0.000120 ## 4 Spain 0.0404 med 0.0403 0.000120 ## 5 Austria 0.0695 q_75 0.0690 0.000515 ## 6 Burundi 0.321 max 0.321 0 ``` --- # Join back to data ```r heights_near <- heights_slope_near %>% left_join(heights, by = "country") heights_near ``` ``` ## # A tibble: 67 x 8 ## country .slope_year stat stat_value stat_diff year height_cm continent ## <chr> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Eritrea -0.102 min -0.102 0 1860 166. Africa ## 2 Eritrea -0.102 min -0.102 0 1880 165. Africa ## 3 Eritrea -0.102 min -0.102 0 1930 164 Africa ## 4 Eritrea -0.102 min -0.102 0 1950 157. Africa ## 5 Eritrea -0.102 min -0.102 0 1960 156. Africa ## 6 Eritrea -0.102 min -0.102 0 1970 156 Africa ## 7 Tajiki… 0.0199 q_25 0.0205 0.000632 1860 165 Asia ## 8 Tajiki… 0.0199 q_25 0.0205 0.000632 1870 165. Asia ## 9 Tajiki… 0.0199 q_25 0.0205 0.000632 1880 167. Asia ## 10 Tajiki… 0.0199 q_25 0.0205 0.000632 1890 166. Asia ## # … with 57 more rows ``` --- <img src="figures/show-palap-1.png" width="936" style="display: block; margin: auto;" /> --- <img src="figures/show-palap-label-1.png" width="936" style="display: block; margin: auto;" /> --- .vhuge[ **End.** ] --- <img src="gifs/dog-solve-problem.gif" width="50%" style="display: block; margin: auto;" />