Undergraduate in Psychology
A lot of research in new statistical methods, e.g., imputation, inference, prediction
A lot of research in new statistical methods, e.g., imputation, inference, prediction
Not much research on how we explore data
Focus on building a bridge across a river. Less focus on how it is built, and the tools used.
My research:
Design and improve tools for (exploratory) data analysis
Current work:
How to explore longitudinal data effectively
Something observed sequentially over time
country | year | height_cm |
---|---|---|
Australia | 1910 | 173 |
country | year | height_cm |
---|---|---|
Australia | 1910 | 173 |
Australia | 1920 | 173 |
country | year | height_cm |
---|---|---|
Australia | 1910 | 173 |
Australia | 1920 | 173 |
Australia | 1960 | 176 |
country | year | height_cm |
---|---|---|
Australia | 1910 | 173 |
Australia | 1920 | 173 |
Australia | 1960 | 176 |
Australia | 1970 | 178 |
Problem #1: How do I look at some of the data?
Problem #1: How do I look at some of the data?
Problem #2: How do I find interesting observations?
brolgar
: brolgar.njtierney.comSomething observed sequentially over time
SomethingAnything that is observed sequentially over time is a time series
SomethingAnything that is observed sequentially over time is a time series
heights <- as_tsibble(heights, index = year, key = country, regular = FALSE)
1. + 2.
determine distinct rows in a tsibble.
(From Earo Wang's talk: Melt the clock)
Record important time series information once, and use it many times in other places
## # A tsibble: 1,490 x 3 [!]## # Key: country [144]## country year height_cm## <chr> <dbl> <dbl>## 1 Afghanistan 1870 168.## 2 Afghanistan 1880 166.## 3 Afghanistan 1930 167.## 4 Afghanistan 1990 167.## 5 Afghanistan 2000 161.## 6 Albania 1880 170.## # … with 1,484 more rows
Remember:
key = variable(s) defining individual groups (or series)
Look at only a sample of the data:
n
rows with sample_n()
n
rows with sample_n()
heights %>% sample_n(5)
n
rows with sample_n()
heights %>% sample_n(5)
## # A tsibble: 5 x 3 [!]## # Key: country [5]## country year height_cm## <chr> <dbl> <dbl>## 1 Cambodia 1860 165.## 2 Bolivia 1890 164.## 3 Macedonia 1930 169.## 4 United States 1920 173.## 5 Papua New Guinea 1880 152.
n
rows with sample_n()
n
rows with sample_n()
## # A tsibble: 5 x 3 [!]## # Key: country [5]## country year height_cm## <chr> <dbl> <dbl>## 1 Cambodia 1860 165.## 2 Bolivia 1890 164.## 3 Macedonia 1930 169.## 4 United States 1920 173.## 5 Papua New Guinea 1880 152.
n
rows with sample_n()
## # A tsibble: 5 x 3 [!]## # Key: country [5]## country year height_cm## <chr> <dbl> <dbl>## 1 Cambodia 1860 165.## 2 Bolivia 1890 164.## 3 Macedonia 1930 169.## 4 United States 1920 173.## 5 Papua New Guinea 1880 152.
... sampling needs to select not random rows of the data, but the keys - the countries.
sample_n_keys()
to sample ... keyssample_n_keys(heights, 5)
## # A tsibble: 32 x 3 [!]## # Key: country [5]## country year height_cm## <chr> <dbl> <dbl>## 1 Congo, DRC 1810 163.## 2 Congo, DRC 1870 166.## 3 Congo, DRC 1880 163.## 4 Congo, DRC 1890 163.## 5 Congo, DRC 1910 165.## 6 Congo, DRC 1920 163.## # … with 26 more rows
sample_n_keys()
to sample ... keysLook at subsamples
Look at subsamples
Sample keys with sample_n_keys()
Look at subsamples
Sample keys with sample_n_keys()
Look at many subsamples
Look at subsamples
Sample keys with sample_n_keys()
Look at many subsamples
?
(Something I made up)
(Something I made up)
If solving a problem requires solving 3+ smaller problems
Your focus shifts from the current goal to something else.
You are distracted.
Task one
Task one being overshadowed slightly by minor task 1
I want to look at many subsamples of the data
I want to look at many subsamples of the data
How many keys are there?
I want to look at many subsamples of the data
How many keys are there?
How many facets do I want to look at
I want to look at many subsamples of the data
How many keys are there?
How many facets do I want to look at
How many keys per facet should I look at
I want to look at many subsamples of the data
How many keys are there?
How many facets do I want to look at
How many keys per facet should I look at
How do I ensure there are the same number of keys per plot
I want to look at many subsamples of the data
How many keys are there?
How many facets do I want to look at
How many keys per facet should I look at
How do I ensure there are the same number of keys per plot
What is rep
, rep.int
, and rep_len
?
I want to look at many subsamples of the data
How many keys are there?
How many facets do I want to look at
How many keys per facet should I look at
How do I ensure there are the same number of keys per plot
What is rep
, rep.int
, and rep_len
?
Do I want length.out
or times
?
We can blame ourselves when we are distracted for not being better.
We can blame ourselves when we are distracted for not being better.
It's not that we should be better, rather with better tools we could be more efficient.
We can blame ourselves when we are distracted for not being better.
It's not that we should be better, rather with better tools we could be more efficient.
We need to make things as easy as reasonable, with the least amount of distraction.
How many keys per facet?
How many plots do I want to look at?
How many keys per facet?
How many plots do I want to look at?
facet_sample( n_per_facet = 3, n_facets = 9 )
facet_sample()
: See more individualsggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line()
facet_sample()
: See more individualsggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_sample()
facet_sample()
: See more individualsfacet_strata()
ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_strata()
facet_strata()
: See all individualsIn asking these questions we can solve something else interesting
facet_strata(along = -year)
: see all individuals along some variableggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_strata(along = -year)
facet_strata(along = -year)
: see all individuals along some variable"How many lines per facet"
"How many facets?"
facet_sample( n_per_facet = 10, n_facets = 12 )
"How many lines per facet"
"How many facets?"
facet_sample( n_per_facet = 10, n_facets = 12 )
"How many facets to put all the data in?"
"How to arrange plots along?"
facet_strata( n_strata = 10, along = -year )
facet_strata()
& facet_sample()
Under the hood using sample_n_keys()
& stratify_keys()
facet_strata()
& facet_sample()
Under the hood using sample_n_keys()
& stratify_keys()
You can still get at data and do manipulations
as_tsibble()
sample_n_keys()
facet_sample()
facet_strata()
as_tsibble()
sample_n_keys()
facet_sample()
facet_strata()
Store useful information
View subsamples of data
View many subsamples
View all subsamples
as_tsibble()
sample_n_keys()
facet_sample()
facet_strata()
Store useful information
View subsamples of data
View many subsamples
View all subsamples
A workflow
A workflow
Define what is interesting
A workflow
Define what is interesting
maximum height
Let's see that one more time, but with the data
## # A tsibble: 1,490 x 3 [!]## # Key: country [144]## country year height_cm## <chr> <dbl> <dbl>## 1 Afghanistan 1870 168.## 2 Afghanistan 1880 166.## 3 Afghanistan 1930 167.## 4 Afghanistan 1990 167.## 5 Afghanistan 2000 161.## 6 Albania 1880 170.## 7 Albania 1890 170.## 8 Albania 1900 169.## 9 Albania 2000 168.## 10 Algeria 1910 169.## # … with 1,480 more rows
## # A tibble: 144 x 6## country min q25 med q75 max## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan 161. 164. 167. 168. 168.## 2 Albania 168. 168. 170. 170. 170.## 3 Algeria 166. 168. 169 170. 171.## 4 Angola 159. 160. 167. 168. 169.## 5 Argentina 167. 168. 168. 170. 174.## 6 Armenia 164. 166. 169. 172. 172.## 7 Australia 170 171. 172. 173. 178.## 8 Austria 162. 164. 167. 169. 179.## 9 Azerbaijan 170. 171. 172. 172. 172.## 10 Bahrain 161. 161. 164. 164. 164 ## # … with 134 more rows
heights_five %>% filter(max == max(max) | max == min(max))
## # A tibble: 2 x 6## country min q25 med q75 max## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Denmark 165. 168. 170. 178. 183.## 2 Papua New Guinea 152. 152. 156. 160. 161.
heights_five %>% filter(max == max(max) | max == min(max)) %>% left_join(heights, by = "country")
## # A tibble: 21 x 8## country min q25 med q75 max year height_cm## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Denmark 165. 168. 170. 178. 183. 1820 167.## 2 Denmark 165. 168. 170. 178. 183. 1830 165.## 3 Denmark 165. 168. 170. 178. 183. 1850 167.## 4 Denmark 165. 168. 170. 178. 183. 1860 168.## 5 Denmark 165. 168. 170. 178. 183. 1870 168.## 6 Denmark 165. 168. 170. 178. 183. 1880 170.## 7 Denmark 165. 168. 170. 178. 183. 1890 169.## 8 Denmark 165. 168. 170. 178. 183. 1900 170.## 9 Denmark 165. 168. 170. 178. 183. 1910 170 ## 10 Denmark 165. 168. 170. 178. 183. 1920 174.## # … with 11 more rows
heights %>% features(height_cm, feat_five_num)
## # A tibble: 144 x 6## country min q25 med q75 max## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan 161. 164. 167. 168. 168.## 2 Albania 168. 168. 170. 170. 170.## 3 Algeria 166. 168. 169 170. 171.## 4 Angola 159. 160. 167. 168. 169.## 5 Argentina 167. 168. 168. 170. 174.## 6 Armenia 164. 166. 169. 172. 172.## # … with 138 more rows
feat_ranges
heights %>% features(height_cm, feat_ranges)
## # A tibble: 144 x 5## country min max range_diff iqr## <chr> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan 161. 168. 7 3.27## 2 Albania 168. 170. 2.20 1.53## 3 Algeria 166. 171. 5.06 2.15## 4 Angola 159. 169. 10.5 7.87## 5 Argentina 167. 174. 7 2.21## 6 Armenia 164. 172. 8.82 5.30## 7 Australia 170 178. 8.4 2.58## 8 Austria 162. 179. 17.2 5.35## 9 Azerbaijan 170. 172. 1.97 1.12## 10 Bahrain 161. 164 3.3 2.75## # … with 134 more rows
feat_monotonic
heights %>% features(height_cm, feat_monotonic)
## # A tibble: 144 x 5## country increase decrease unvary monotonic## <chr> <lgl> <lgl> <lgl> <lgl> ## 1 Afghanistan FALSE FALSE FALSE FALSE ## 2 Albania FALSE TRUE FALSE TRUE ## 3 Algeria FALSE FALSE FALSE FALSE ## 4 Angola FALSE FALSE FALSE FALSE ## 5 Argentina FALSE FALSE FALSE FALSE ## 6 Armenia FALSE FALSE FALSE FALSE ## 7 Australia FALSE FALSE FALSE FALSE ## 8 Austria FALSE FALSE FALSE FALSE ## 9 Azerbaijan FALSE FALSE FALSE FALSE ## 10 Bahrain TRUE FALSE FALSE TRUE ## # … with 134 more rows
feat_spread
heights %>% features(height_cm, feat_spread)
## # A tibble: 144 x 5## country var sd mad iqr## <chr> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan 7.20 2.68 1.65 3.27## 2 Albania 0.950 0.975 0.667 1.53## 3 Algeria 3.30 1.82 0.741 2.15## 4 Angola 16.9 4.12 3.11 7.87## 5 Argentina 2.89 1.70 1.36 2.21## 6 Armenia 10.6 3.26 3.60 5.30## 7 Australia 7.63 2.76 1.66 2.58## 8 Austria 26.6 5.16 3.93 5.35## 9 Azerbaijan 0.516 0.718 0.621 1.12## 10 Bahrain 3.42 1.85 0.297 2.75## # … with 134 more rows
feasts
Such as:
feat_acf
: autocorrelation-based featuresfeat_stl
: STL (Seasonal, Trend, and Remainder by LOESS) decompositionfacet_sample()
/ facet_strata()
End.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |