+ - 0:00:00
Notes for current slide
Notes for next slide

Exploring and understanding the individual experience from longitudinal data, or…

How to make better spaghetti (plots)

Nicholas Tierney, Monash University

1 / 95

A Bit About Me

2 / 95

Background: Undergraduate

Undergraduate in Psychology

  • Statistics
  • Experiment Design
  • Cognitive Theory
  • Neurology
  • Humans

3 / 95

Background: PhD

  • "Ah, statistics, everything is black and white!
  • "There's always an answer"
  • "data in, answer out"

4 / 95

Background: PhD

  • Data is really messy
  • Missing values are frustrating
  • How to Explore data?

5 / 95

EDA: Why it's worth it

6 / 95

(My personal) motivation

A lot of research in new statistical methods, e.g., imputation, inference, prediction

7 / 95

(My personal) motivation

A lot of research in new statistical methods, e.g., imputation, inference, prediction

Not much research on how we explore data

7 / 95

(My personal) motivation

Focus on building a bridge across a river. Less focus on how it is built, and the tools used.

8 / 95
  • I became very interested in how we explore our data - exploratory data analysis.

My research:

Design and improve tools for (exploratory) data analysis

9 / 95

10 / 95

Current work:

How to explore longitudinal data effectively

11 / 95

What is longitudinal data?

Something observed sequentially over time

12 / 95

What is longitudinal data?

country year height_cm
Australia 1910 173
13 / 95

What is longitudinal data?

country year height_cm
Australia 1910 173
Australia 1920 173
14 / 95

What is longitudinal data?

country year height_cm
Australia 1910 173
Australia 1920 173
Australia 1960 176
15 / 95

What is longitudinal data?

country year height_cm
Australia 1910 173
Australia 1920 173
Australia 1960 176
Australia 1970 178
16 / 95

17 / 95

All of Australia

18 / 95

...And New Zealand

19 / 95

And the rest?

20 / 95

And the rest?

21 / 95

22 / 95

Problems:

  • Overplotting
  • We don't see the individuals
  • We could look at 144 individual plots, but this doesn't help.

23 / 95

Answers: Transparency?

24 / 95

Answers: Transparency?

24 / 95

Answers: Transparency + a model?

25 / 95
  • This helps reduce the overplotting
  • It's not that this is wrong, it is useful - but we lose the individuals
  • We only get the overall average. We dont get the rest of the information
  • How do we even get started?

But we forget about the individuals

26 / 95
  • The model might make some good overall predictions
  • But it can be really ill suited for some individual
  • Exploring this is somewhat clumsy - we need another way to explore

Problem #1: How do I look at some of the data?

27 / 95

Problem #1: How do I look at some of the data?

Problem #2: How do I find interesting observations?

27 / 95

Introducing brolgar: brolgar.njtierney.com

  • browsing
  • over
  • longitudinal data
  • graphically, and
  • analytically, in
  • r

28 / 95
  • It's a crane, it fishes, and it's a native Australian bird

29 / 95

What is longitudinal data?

30 / 95

What is longitudinal data?

Something observed sequentially over time

30 / 95

What is longitudinal data?

Something Anything that is observed sequentially over time is a time series

31 / 95

What is longitudinal data? Longitudinal data is a time series.

Something Anything that is observed sequentially over time is a time series

32 / 95

Longitudinal data as a time series

heights <- as_tsibble(heights,
index = year,
key = country,
regular = FALSE)
  1. index: Your time variable
  2. key: Variable(s) defining individual groups (or series)

1. + 2. determine distinct rows in a tsibble.

(From Earo Wang's talk: Melt the clock)

33 / 95

Longitudinal data as a time series

Key Concepts:

Record important time series information once, and use it many times in other places

  • We add information about index + key:
    • Index = Year
    • Key = Country
34 / 95
## # A tsibble: 1,490 x 3 [!]
## # Key: country [144]
## country year height_cm
## <chr> <dbl> <dbl>
## 1 Afghanistan 1870 168.
## 2 Afghanistan 1880 166.
## 3 Afghanistan 1930 167.
## 4 Afghanistan 1990 167.
## 5 Afghanistan 2000 161.
## 6 Albania 1880 170.
## # … with 1,484 more rows
35 / 95

Remember:

key = variable(s) defining individual groups (or series)

36 / 95

Problem #1: How do I look at some of the data?

37 / 95

Problem #1: How do I look at some of the data?

Look at only a sample of the data:

37 / 95

Sample n rows with sample_n()

38 / 95

Sample n rows with sample_n()

heights %>% sample_n(5)
38 / 95

Sample n rows with sample_n()

heights %>% sample_n(5)
## # A tsibble: 5 x 3 [!]
## # Key: country [5]
## country year height_cm
## <chr> <dbl> <dbl>
## 1 Cambodia 1860 165.
## 2 Bolivia 1890 164.
## 3 Macedonia 1930 169.
## 4 United States 1920 173.
## 5 Papua New Guinea 1880 152.
38 / 95

Sample n rows with sample_n()

39 / 95

Sample n rows with sample_n()

## # A tsibble: 5 x 3 [!]
## # Key: country [5]
## country year height_cm
## <chr> <dbl> <dbl>
## 1 Cambodia 1860 165.
## 2 Bolivia 1890 164.
## 3 Macedonia 1930 169.
## 4 United States 1920 173.
## 5 Papua New Guinea 1880 152.
40 / 95

Sample n rows with sample_n()

## # A tsibble: 5 x 3 [!]
## # Key: country [5]
## country year height_cm
## <chr> <dbl> <dbl>
## 1 Cambodia 1860 165.
## 2 Bolivia 1890 164.
## 3 Macedonia 1930 169.
## 4 United States 1920 173.
## 5 Papua New Guinea 1880 152.

... sampling needs to select not random rows of the data, but the keys - the countries.

40 / 95

sample_n_keys() to sample ... keys

sample_n_keys(heights, 5)
## # A tsibble: 32 x 3 [!]
## # Key: country [5]
## country year height_cm
## <chr> <dbl> <dbl>
## 1 Congo, DRC 1810 163.
## 2 Congo, DRC 1870 166.
## 3 Congo, DRC 1880 163.
## 4 Congo, DRC 1890 163.
## 5 Congo, DRC 1910 165.
## 6 Congo, DRC 1920 163.
## # … with 26 more rows
41 / 95

sample_n_keys() to sample ... keys

42 / 95

Problem #1: How do I look at some of the data?

Look at subsamples

43 / 95

Problem #1: How do I look at some of the data?

Look at subsamples

Sample keys with sample_n_keys()

43 / 95

Problem #1: How do I look at some of the data?

Look at subsamples

Sample keys with sample_n_keys()

Look at many subsamples

43 / 95

Problem #1: How do I look at some of the data?

Look at subsamples

Sample keys with sample_n_keys()

Look at many subsamples

?

43 / 95

Portion out your spaghetti! 🍝 🍝 🍝 🍝

44 / 95

Look at one set of subsamples 🍝

45 / 95

Look at many subsamples 🍝 🍝

46 / 95

How to look at many subsamples

  • How many facets to look at? (2, 4, ... 16?)
47 / 95

How to look at many subsamples

  • How many facets to look at? (2, 4, ... 16?)
  • How many keys per facets?
    • 144 keys into 16 facets = 9 each
47 / 95

How to look at many subsamples

  • How many facets to look at? (2, 4, ... 16?)
  • How many keys per facets?
    • 144 keys into 16 facets = 9 each
  • Randomly pick 16 groups of size 9.
47 / 95

How to look at many subsamples

  • How many facets to look at? (2, 4, ... 16?)
  • How many keys per facets?
    • 144 keys into 16 facets = 9 each
  • Randomly pick 16 groups of size 9.
  • This might not look like much extra work, but it hits the distraction threshold quite quickly.
47 / 95
48 / 95

Distraction threshold (time to rabbit hole)

49 / 95

Distraction threshold (time to rabbit hole)

(Something I made up)

49 / 95

Distraction threshold (time to rabbit hole)

(Something I made up)

If solving a problem requires solving 3+ smaller problems

Your focus shifts from the current goal to something else.

You are distracted.

49 / 95
  • Task one

  • Task one being overshadowed slightly by minor task 1

  • Task one being overshadowed slightly by minor task 2
  • Task one being overshadowed slightly by minor task 3

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

50 / 95

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

How many keys are there?

50 / 95

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

How many keys are there?

How many facets do I want to look at

50 / 95

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

How many keys are there?

How many facets do I want to look at

How many keys per facet should I look at

50 / 95

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

How many keys are there?

How many facets do I want to look at

How many keys per facet should I look at

How do I ensure there are the same number of keys per plot

50 / 95

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

How many keys are there?

How many facets do I want to look at

How many keys per facet should I look at

How do I ensure there are the same number of keys per plot

What is rep, rep.int, and rep_len?

50 / 95

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

How many keys are there?

How many facets do I want to look at

How many keys per facet should I look at

How do I ensure there are the same number of keys per plot

What is rep, rep.int, and rep_len?

Do I want length.out or times?

50 / 95

51 / 95

Avoiding the rabbit hole

52 / 95

Avoiding the rabbit hole

We can blame ourselves when we are distracted for not being better.

52 / 95

Avoiding the rabbit hole

We can blame ourselves when we are distracted for not being better.

It's not that we should be better, rather with better tools we could be more efficient.

52 / 95

Avoiding the rabbit hole

We can blame ourselves when we are distracted for not being better.

It's not that we should be better, rather with better tools we could be more efficient.

We need to make things as easy as reasonable, with the least amount of distraction.

52 / 95

Remove distraction by asking relevant questions

53 / 95

Remove distraction by asking relevant questions

How many keys per facet?

How many plots do I want to look at?

53 / 95

Remove distraction by asking relevant questions

How many keys per facet?

How many plots do I want to look at?

facet_sample(
n_per_facet = 3,
n_facets = 9
)
53 / 95

54 / 95

facet_sample(): See more individuals

ggplot(heights, aes(x = year,
y = height_cm,
group = country)) +
geom_line()

55 / 95

facet_sample(): See more individuals

ggplot(heights,
aes(x = year,
y = height_cm,
group = country)) +
geom_line() +
facet_sample()
56 / 95

facet_sample(): See more individuals

57 / 95

How to see all individuals?

58 / 95

How to see all individuals?

facet_strata()

ggplot(heights,
aes(x = year,
y = height_cm,
group = country)) +
geom_line() +
facet_strata()
58 / 95

facet_strata(): See all individuals

59 / 95

Can we re-order these facets in a meaningful way?

60 / 95

In asking these questions we can solve something else interesting

facet_strata(along = -year): see all individuals along some variable

ggplot(heights,
aes(x = year,
y = height_cm,
group = country)) +
geom_line() +
facet_strata(along = -year)
61 / 95

facet_strata(along = -year): see all individuals along some variable

62 / 95

Focus on answering relevant questions instead of the minutae:

"How many lines per facet"

"How many facets?"

facet_sample(
n_per_facet = 10,
n_facets = 12
)
63 / 95

Focus on answering relevant questions instead of the minutae:

"How many lines per facet"

"How many facets?"

facet_sample(
n_per_facet = 10,
n_facets = 12
)

"How many facets to put all the data in?"

"How to arrange plots along?"

facet_strata(
n_strata = 10,
along = -year
)
63 / 95

facet_strata() & facet_sample() Under the hood

using sample_n_keys() & stratify_keys()

64 / 95

facet_strata() & facet_sample() Under the hood

using sample_n_keys() & stratify_keys()

You can still get at data and do manipulations

64 / 95

Problem #1: How do I look at some of the data?

65 / 95

Problem #1: How do I look at some of the data?

as_tsibble()

sample_n_keys()

facet_sample()

facet_strata()

65 / 95

Problem #1: How do I look at some of the data?

as_tsibble()

sample_n_keys()

facet_sample()

facet_strata()

Store useful information

View subsamples of data

View many subsamples

View all subsamples

65 / 95

Problem #1: How do I look at some of the data?

as_tsibble()

sample_n_keys()

facet_sample()

facet_strata()

Store useful information

View subsamples of data

View many subsamples

View all subsamples

66 / 95

Problem #2: How do I find interesting observations?

67 / 95

A workflow

68 / 95

A workflow

Define what is interesting

68 / 95

A workflow

Define what is interesting

maximum height

68 / 95

Identify features: one observation per key

69 / 95

Identify features: one observation per key

70 / 95

Identify features: one observation per key

71 / 95

Identify important features and decide how to filter

72 / 95

Identify important features and decide how to filter

73 / 95

Join this feature back to the data

74 / 95

Join this feature back to the data

75 / 95

🎉 Countries with smallest and largest max height

76 / 95

Let's see that one more time, but with the data

77 / 95

Identify features: one observation per key

## # A tsibble: 1,490 x 3 [!]
## # Key: country [144]
## country year height_cm
## <chr> <dbl> <dbl>
## 1 Afghanistan 1870 168.
## 2 Afghanistan 1880 166.
## 3 Afghanistan 1930 167.
## 4 Afghanistan 1990 167.
## 5 Afghanistan 2000 161.
## 6 Albania 1880 170.
## 7 Albania 1890 170.
## 8 Albania 1900 169.
## 9 Albania 2000 168.
## 10 Algeria 1910 169.
## # … with 1,480 more rows
78 / 95

Identify features: one observation per key

## # A tibble: 144 x 6
## country min q25 med q75 max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 161. 164. 167. 168. 168.
## 2 Albania 168. 168. 170. 170. 170.
## 3 Algeria 166. 168. 169 170. 171.
## 4 Angola 159. 160. 167. 168. 169.
## 5 Argentina 167. 168. 168. 170. 174.
## 6 Armenia 164. 166. 169. 172. 172.
## 7 Australia 170 171. 172. 173. 178.
## 8 Austria 162. 164. 167. 169. 179.
## 9 Azerbaijan 170. 171. 172. 172. 172.
## 10 Bahrain 161. 161. 164. 164. 164
## # … with 134 more rows
79 / 95

Identify important features and decide how to filter

heights_five %>%
filter(max == max(max) | max == min(max))
## # A tibble: 2 x 6
## country min q25 med q75 max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Denmark 165. 168. 170. 178. 183.
## 2 Papua New Guinea 152. 152. 156. 160. 161.
80 / 95

Join summaries back to data

heights_five %>%
filter(max == max(max) | max == min(max)) %>%
left_join(heights, by = "country")
## # A tibble: 21 x 8
## country min q25 med q75 max year height_cm
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Denmark 165. 168. 170. 178. 183. 1820 167.
## 2 Denmark 165. 168. 170. 178. 183. 1830 165.
## 3 Denmark 165. 168. 170. 178. 183. 1850 167.
## 4 Denmark 165. 168. 170. 178. 183. 1860 168.
## 5 Denmark 165. 168. 170. 178. 183. 1870 168.
## 6 Denmark 165. 168. 170. 178. 183. 1880 170.
## 7 Denmark 165. 168. 170. 178. 183. 1890 169.
## 8 Denmark 165. 168. 170. 178. 183. 1900 170.
## 9 Denmark 165. 168. 170. 178. 183. 1910 170
## 10 Denmark 165. 168. 170. 178. 183. 1920 174.
## # … with 11 more rows
81 / 95

82 / 95

Identify features: one per key

heights %>%
features(height_cm,
feat_five_num)
## # A tibble: 144 x 6
## country min q25 med q75 max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 161. 164. 167. 168. 168.
## 2 Albania 168. 168. 170. 170. 170.
## 3 Algeria 166. 168. 169 170. 171.
## 4 Angola 159. 160. 167. 168. 169.
## 5 Argentina 167. 168. 168. 170. 174.
## 6 Armenia 164. 166. 169. 172. 172.
## # … with 138 more rows
83 / 95

What is the range of the data? feat_ranges

heights %>%
features(height_cm, feat_ranges)
## # A tibble: 144 x 5
## country min max range_diff iqr
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 161. 168. 7 3.27
## 2 Albania 168. 170. 2.20 1.53
## 3 Algeria 166. 171. 5.06 2.15
## 4 Angola 159. 169. 10.5 7.87
## 5 Argentina 167. 174. 7 2.21
## 6 Armenia 164. 172. 8.82 5.30
## 7 Australia 170 178. 8.4 2.58
## 8 Austria 162. 179. 17.2 5.35
## 9 Azerbaijan 170. 172. 1.97 1.12
## 10 Bahrain 161. 164 3.3 2.75
## # … with 134 more rows
84 / 95

Does it only increase or decrease? feat_monotonic

heights %>%
features(height_cm, feat_monotonic)
## # A tibble: 144 x 5
## country increase decrease unvary monotonic
## <chr> <lgl> <lgl> <lgl> <lgl>
## 1 Afghanistan FALSE FALSE FALSE FALSE
## 2 Albania FALSE TRUE FALSE TRUE
## 3 Algeria FALSE FALSE FALSE FALSE
## 4 Angola FALSE FALSE FALSE FALSE
## 5 Argentina FALSE FALSE FALSE FALSE
## 6 Armenia FALSE FALSE FALSE FALSE
## 7 Australia FALSE FALSE FALSE FALSE
## 8 Austria FALSE FALSE FALSE FALSE
## 9 Azerbaijan FALSE FALSE FALSE FALSE
## 10 Bahrain TRUE FALSE FALSE TRUE
## # … with 134 more rows
85 / 95

What is the spread of my data? feat_spread

heights %>%
features(height_cm, feat_spread)
## # A tibble: 144 x 5
## country var sd mad iqr
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 7.20 2.68 1.65 3.27
## 2 Albania 0.950 0.975 0.667 1.53
## 3 Algeria 3.30 1.82 0.741 2.15
## 4 Angola 16.9 4.12 3.11 7.87
## 5 Argentina 2.89 1.70 1.36 2.21
## 6 Armenia 10.6 3.26 3.60 5.30
## 7 Australia 7.63 2.76 1.66 2.58
## 8 Austria 26.6 5.16 3.93 5.35
## 9 Azerbaijan 0.516 0.718 0.621 1.12
## 10 Bahrain 3.42 1.85 0.297 2.75
## # … with 134 more rows
86 / 95

features: MANY more features in feasts

Such as:

  • feat_acf: autocorrelation-based features
  • feat_stl: STL (Seasonal, Trend, and Remainder by LOESS) decomposition
  • Create your own features
87 / 95

Take homes

Problem #1: How do I look at some of the data?

  1. Longitudinal data is a time series
  2. Specify structure once, get a free lunch.
  3. Look at as much of the raw data as possible
  4. Use facet_sample() / facet_strata()
88 / 95

Take homes

Problem #2: How do I find interesting observations?

  1. Decide what features are interesting
  2. Summarise down to one observation
  3. Decide how to filter
  4. Join this feature back to the data
89 / 95

Future Directions

  • More features (summaries)
  • Generalise beyond time series
  • Explore stratification process
90 / 95

Thanks

  • Di Cook
  • Tania Prvan
  • Stuart Lee
  • Mitchell O'Hara Wild
  • Earo Wang
  • Rob Hyndman
  • Miles McBain
  • Hadley Wickham
  • Monash University
91 / 95

Resources

92 / 95

Colophon

93 / 95

Learning more

brolgar.njtierney.com

bit.ly/njt-wombat

nj_tierney

njtierney

nicholas.tierney@gmail.com

94 / 95

End.

95 / 95

A Bit About Me

2 / 95
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow