library(here)
library(tidyverse)
library(lme4)
library(brms)
library(tidybayes)
frames <- read_csv(here("samplingframes", "data", "data_samplesize.csv"))
set.seed(1) # for replicability
Because R is a statistical programming language it comes with a lot of hypothesis tests and tools built in, and of course there is an overwhelming number of packages out there that extend this. It is impossible to cover the whole thing in one tutorial, so I’m going to be a little picky. For example, I’m going to skip over the most commonly used classical tests, because they’re comparatively easy to learn and it’s not the best use of our time! For future reference though:
t.test()
function handles one-sample, independent samples and paired samples t-testschisq.test()
function handles chi-square tests of independence and Pearson goodness of fit testsprop.test()
function tests for the equality of two proportions.binom.test()
function allows you to do a binomial test of choice proportion against a known ratewilcox.test()
function handles one- and two-sample nonparametric tests of equality of meanscor.test()
function tests the significance of a correlationOf course, there are many, many others! What we’re going to focus on here is:
lm()
function, and model comparison using the anova()
functionbrm()
function (in the brms
) package, and model comparison using Bayes factorslmer()
function (in the lme4
package)glmer()
function (also from lme4
)One of our goals is to get to the point where we can analyze the sampling frames data from yesterday. After exploring and visualizing these data yesterday, you have some idea of what they look like. Below is a summary plot that shows how generalization curves varied across both conditions (category, property) and sample sizes (small, medium, large).
frames_avg <- frames %>%
mutate(sample_size = factor(sample_size, levels = c("small","medium","large"))) %>%
group_by(test_item, condition, sample_size) %>%
summarise(
n = n(),
sd=sd(response),
se=sd/sqrt(n),
response = mean(response),
) %>%
ungroup()
frames_avg %>%
ggplot(aes(x = test_item, y = response, colour = condition)) +
geom_line() +
geom_point() +
geom_errorbar(aes(ymin = response - se, ymax = response + se)) +
facet_wrap(~sample_size)
But the analyses towards the start of this tutorial are included mainly to illustrate statistical concepts — they wouldn’t be analyses we’d run if we were interested only in the sampling frames project.
Linear models should be fairly familiar to most: it’s essentially what we were all taught in undergraduate under the name multiple regression. However, what is sometimes underemphasised is the fact that correlation, ANOVA, and t-tests can all be cast within the linear modelling framework, and R allows you do do all these using the lm()
function. So that’s where we’re going to start.
To begin with, we need a data set. For this purpose, let’s construct a simplified version of the frames
data, by averaging all the responses made by a person, regardless of the number of observations or the test item:
tinyframes <- frames %>%
group_by(id, age, condition) %>%
summarise(
response = mean(response)
) %>%
ungroup()
Let’s take a look at the tinyframes
dataset we’ve just created:
glimpse(tinyframes)
## Observations: 225
## Variables: 4
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ age <dbl> 36, 46, 33, 71, 23, 31, 23, 31, 37, 46, 27, 30, 22, 31, 20,…
## $ condition <chr> "category", "category", "property", "property", "property",…
## $ response <dbl> 5.333333, 7.047619, 4.857143, 3.857143, 9.000000, 7.904762,…
A very typical way to produce descriptive statistics is to calculate mean and standard deviation for each condition, and count the number of people in each condition.
tinyframes %>%
group_by(condition) %>%
summarise(
mean_resp = mean(response),
sd_resp = sd(response),
n = n()
)
## # A tibble: 2 x 4
## condition mean_resp sd_resp n
## <chr> <dbl> <dbl> <int>
## 1 category 5.40 1.56 114
## 2 property 4.39 1.37 111
We would also want to visualise the data. It is almost always a mistake to start trying to model a data set without properly exploring it and making sure you have a good “feel” for what is going on. So let’s draw a picture!
tinyframes %>%
ggplot(aes(x = condition, y = response)) +
geom_violin(draw_quantiles=c(0.5))
So let’s start with a simple question. Is there a “significant” difference between the two conditions? I’m not a fan of orthodox null hypothesis testing, to be honest, but it does have it’s place. Traditionally, the solution is the t-test:
t.test(
formula = response ~ condition,
data = tinyframes,
var.equal = TRUE
)
##
## Two Sample t-test
##
## data: response by condition
## t = 5.1625, df = 223, p-value = 5.388e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.6259535 1.3988834
## sample estimates:
## mean in group category mean in group property
## 5.397661 4.385242
Okay we have a significant difference. So we reject the null hypothesis (i.e., that the two groups have the same population mean) and accept the alternative (i.e., that they have different population means). Yay. I guess. The moment we start caring about data analysis in any detail, though, it helps to recast these “hypotheses” in terms of statistical models.
mod1 <- lm(formula = response ~ 1, data = tinyframes)
mod2 <- lm(formula = response ~ condition, data = tinyframes)
To give you a sense of what R has just done, it has estimated the coefficients for two different regression models: mod1
only includes an intercept term (i.e., the “grand mean”), whereas mod2
contains two terms:
mod2
##
## Call:
## lm(formula = response ~ condition, data = tinyframes)
##
## Coefficients:
## (Intercept) conditionproperty
## 5.398 -1.012
Notice that the coefficients have a clear relationship to the group means: the “intercept” term is identical to the group mean for category sampling, and the “conditionproperty” term is what you have to add to that to get the group mean for property sampling (i.e., 5.4 - 1.0 = 4.4). It’s expressed in different language than the t-test, but mod2
nevertheless maps onto the alternative hypothesis.
To compare these two linear models, we can call the anova()
function:
anova(mod1, mod2)
## Analysis of Variance Table
##
## Model 1: response ~ 1
## Model 2: response ~ condition
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 224 539.98
## 2 223 482.33 1 57.645 26.652 5.388e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This doesn’t look like the output of a t-test, but if we look carefully we notice that:
In a sense, what we’ve just done is illustrate the fact that the Student t-test is equivalent to a one-way ANOVA with two groups (which we were all taught as undergraduates), but we’ve used linear models to do it!
Let’s look at mod2
in a bit more detail. The model is based on the equation: \[\begin{equation}
\text{response}_i = \beta_0 + \beta_1 * \text{condition}_i + \epsilon_i
\end{equation}\] which says that the \(i\)th value of the response variable is equal to the sum of three terms. \(\beta_0\) is the intercept, \(\beta_1\) corresponds to the variable called conditionproperty
in the summary of mod2 above, and \(\epsilon_i\) is an error term (also called residual). Here \(\text{condition}\) is a variable that takes value 1 when the \(i\)th data point is from the property sampling condition, and 0 otherwise.
Intuitively, we want to find values of \(\beta_0\) and \(\beta_1\) that make the error terms as small as possible. When fitting the model, lm()
computed values that minimize the sum squared error \(\sum_i \epsilon_i^2\).
Let’s now try out a Bayesian approach to regression. Bayesian inference involves three stages:
\[ P(h|D) \propto P(D|h) P(h) \]
For our regression example, we’re going to assume that the error terms \(\epsilon_i\) are generated from a normal distribution with zero mean and variance \(\sigma\):
\[ \epsilon_i \sim N(0, \sigma^2)\] So the hypotheses we’re interested in are triples \((\beta_0, \beta_1, \sigma)\) that specify different values of \(\sigma\) and the coefficients \(\beta_0\) and \(\beta_1\) in the regression equation above. We use the brm()
function (and its default priors) to sample from the posterior distribution \(P(\beta_0, \beta_1, \sigma | D)\)
mod2_bayes <- brm(
formula = response ~ 0 + Intercept + condition,
data = tinyframes,
save_all_pars = TRUE
)
## Compiling the C++ model
## Start sampling
##
## SAMPLING FOR MODEL '03817e0fb1a22da429f873c90218c352' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 1.6e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.029064 seconds (Warm-up)
## Chain 1: 0.03094 seconds (Sampling)
## Chain 1: 0.060004 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL '03817e0fb1a22da429f873c90218c352' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 8e-06 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.08 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.027009 seconds (Warm-up)
## Chain 2: 0.022449 seconds (Sampling)
## Chain 2: 0.049458 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL '03817e0fb1a22da429f873c90218c352' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 1e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.1 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.028364 seconds (Warm-up)
## Chain 3: 0.029438 seconds (Sampling)
## Chain 3: 0.057802 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL '03817e0fb1a22da429f873c90218c352' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 9e-06 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.032012 seconds (Warm-up)
## Chain 4: 0.025719 seconds (Sampling)
## Chain 4: 0.057731 seconds (Total)
## Chain 4:
Let’s visualize the posterior distributions:
mod2_bayes_samples_wide <- mod2_bayes %>%
spread_draws(b_Intercept, b_conditionproperty, sigma)
mod2_bayes_samples <- mod2_bayes_samples_wide %>%
gather(key='variable', value = "value", b_Intercept, b_conditionproperty, sigma)
mod2_bayes_samples %>%
mutate(variable=factor(variable, levels=c('b_Intercept', 'b_conditionproperty', 'sigma'))) %>%
ggplot(aes(x = value)) +
geom_density() +
facet_wrap(~variable)
The least-squares regression model (ie the one fit using lm()
) gave estimates of 5.4 for \(\beta_0\) and -1.0 for \(\beta_1\), and you can see that the Bayesian posterior distributions have most of their probability mass near these estimates. So in this case (as in many others) the Bayesian and the non-Bayesian approaches are leading to similar conclusions.
Above we used an anova to compare regression models with and without a term for condition
. The analogous Bayesian approach uses Bayes factors to compare two models. Let \(M_2\) be the Bayesian model we specified before, and let \(M_1\) be the simpler model \(M_1\) that doesn’t include a term for condition
. The Bayes factor comparing the two is
\[\begin{align} BF_{21} &= \frac{P(M_2 | D)}{P(M_1 | D)} \\ &= \frac{P(D|M_2) P(M_2)}{P(D| M_1)P(M_1)} \\ &= \frac{P(D|M_2)}{P(D| M_1)} \end{align}\] where the final step is valid only if we assume (as people often do) that the two models have equal prior probabilities. Note that \(P(D|M_2)\) is a marginal likelihood – it’s computed by marginalizing out the three variables \(\beta_0\), \(\beta_1\) and \(\sigma\). For us the equation isn’t really important (so feel free to skip) but here it is for those who are interested: \[\begin{align} P(D|M_2) &= \int P(D|\beta_0, \beta_1, \sigma, M_2) P(\beta_0, \beta_1, \sigma | M_2) d\beta_0 d\beta_1 d\sigma \\ &= \int P(D|\beta_0, \beta_1, \sigma, M_2) P(\beta_0)P(\beta_1) P(\sigma)d\beta_0 d\beta_1 d\sigma \end{align}\]
Bayes factors automatically control for model complexity – e.g. the fact that \(M_2\) has one more parameter than \(M_1\). Kass & Raftery suggest interpreting them as follows:
Value | Interpretation |
---|---|
\(BF_{21} < 1\) | Negative: supports \(M_1\) rather than \(M_2\) |
\(1 < BF_{21} < 3\) | Barely worth mentioning |
\(3 < BF_{21} < 10\) | Substantial |
\(10 < BF_{21} < 100\) | Strong |
\(100 < BF_{21}\) | Decisive |
Let’s set up the simpler model \(M_1\) and compute a Bayes factor comparing \(M_2\) with \(M_1\):
mod1_bayes <- brm(
formula = response ~ 1,
data = tinyframes,
save_all_pars = TRUE
)
## Compiling the C++ model
## Start sampling
##
## SAMPLING FOR MODEL 'c8c8b6431621222e2b469dfd362688a6' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 1.7e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.030829 seconds (Warm-up)
## Chain 1: 0.025842 seconds (Sampling)
## Chain 1: 0.056671 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'c8c8b6431621222e2b469dfd362688a6' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 8e-06 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.08 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.030428 seconds (Warm-up)
## Chain 2: 0.037811 seconds (Sampling)
## Chain 2: 0.068239 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'c8c8b6431621222e2b469dfd362688a6' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 7e-06 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.07 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.030854 seconds (Warm-up)
## Chain 3: 0.028383 seconds (Sampling)
## Chain 3: 0.059237 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'c8c8b6431621222e2b469dfd362688a6' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 6e-06 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.029032 seconds (Warm-up)
## Chain 4: 0.032117 seconds (Sampling)
## Chain 4: 0.061149 seconds (Total)
## Chain 4:
BF <- bayes_factor(mod2_bayes, mod1_bayes)
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
cat('Bayes factor comparing mod2_bayes to mod1_bayes is', BF$bf)
## Bayes factor comparing mod2_bayes to mod1_bayes is 3933163
In practice we should use more iterations than we get by default when estimating the Bayes factor. But the overall message seems clear — there is overwhelming support for \(M_2\) (which includes a term for condition
) relative to \(M_1\) (which does not).
Suppose we wondered whether people’s responses varied by age in addition to condition. As always, visualizing the data is a good first step.
tinyframes %>%
ggplot(aes(x = age, y = response, colour = condition)) +
geom_smooth(method = "lm") +
geom_point()
We see again that responses in the category condition are higher than responses in the property condition, but both regression lines are flattish so it doesn’t seem like there is an effect of age. Let’s proceed anyway. We’ll take the two regression models we ran before (mod1
and mod2
) and now add a third model that includes age
as an additional predictor:
mod3 <- lm(formula = response ~ condition + age, data = tinyframes)
If we take a quick look at the coefficients
mod3
##
## Call:
## lm(formula = response ~ condition + age, data = tinyframes)
##
## Coefficients:
## (Intercept) conditionproperty age
## 5.102572 -1.018548 0.008536
we can see that age really isn’t having much of an effect, if any. To compare all three models using an \(F\)-test, what we would do is this:
anova(mod1, mod2, mod3)
## Analysis of Variance Table
##
## Model 1: response ~ 1
## Model 2: response ~ condition
## Model 3: response ~ condition + age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 224 539.98
## 2 223 482.33 1 57.645 26.6544 5.399e-07 ***
## 3 222 480.12 1 2.214 1.0238 0.3127
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this ANOVA table, what you’re looking at is a test of mod2
against mod1
, followed by mod3
against mod2
. This suggests that mod2
is preferred over mod1
(reject the null), but mod3
isn’t preferred over mod2
(retain the null).
An alternative approach is to compare the three models using a statistic that accounts for model complexity. A simple model that accounts for just the pattern of data observed seems better than a complex model that is flexible enough to fit many patterns of data, including the pattern observed as well as many others. The same conclusion holds if the more complex model fits the data slightly better – probably it does so by virtue of its flexibility, not because it is closer to the truth.
Two scoring measures that take model complexity into account are the Akaike information criterion (AIC) and the Bayesian information criterion (BIC), both of which have been around since the 1970s. For our linear models, we can evaluate them using the AIC()
and BIC()
functions:
AIC(mod1, mod2, mod3)
## df AIC
## mod1 2 839.4940
## mod2 3 816.0928
## mod3 4 817.0575
BIC(mod1, mod2, mod3)
## df BIC
## mod1 2 846.3262
## mod2 3 826.3411
## mod3 4 830.7219
Smaller values of AIC and BIC are better, and it’s hardly a surprise that mod2
turns out to be the best one! We won’t go into the definitions of the AIC and BIC here, and mention just two points. First, both criteria include a term that penalizes a model based on the number of parameters that it includes (having more parameters is worse). Second, the BIC score for a model \(M\) closely approximates the Bayes factor \(P(D|M)\) when the Bayes factor assumes one particular prior on the model parameters.
Instead of comparing the models using the BIC, we could fit a Bayesian regression model with age as an additional predictor (ie mod3_bayes
), and compare the three Bayesian regression models by computing Bayes factors. In the interests of time, however, we’ll focus mainly on frequentist approaches from here on, although each analysis presented has a Bayesian counterpart.
Overall, mod2
looks pretty sensible:
summary(mod2)
##
## Call:
## lm(formula = response ~ condition, data = tinyframes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2424 -0.9691 -0.3024 0.8052 4.6148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3977 0.1377 39.187 < 2e-16 ***
## conditionproperty -1.0124 0.1961 -5.163 5.39e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.471 on 223 degrees of freedom
## Multiple R-squared: 0.1068, Adjusted R-squared: 0.1027
## F-statistic: 26.65 on 1 and 223 DF, p-value: 5.388e-07
confint(mod2)
## 2.5 % 97.5 %
## (Intercept) 5.126217 5.6691049
## conditionproperty -1.398883 -0.6259535
To get started with mixed models, we will look at a version of the frames
data that is a little more complex than the tinyframes
data from the last section, but not quite the full thing yet. Specifically, what we’ll do is take the (within-subject) average responses across every test_item
, but we won’t average across the different values of n_obs
. That gives us a modestframes
data set:
modestframes <- frames %>%
group_by(id, age, condition, n_obs) %>%
summarise(response = mean(response)) %>%
ungroup()
glimpse(modestframes)
## Observations: 675
## Variables: 5
## $ id <dbl> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7,…
## $ age <dbl> 36, 36, 36, 46, 46, 46, 33, 33, 33, 71, 71, 71, 23, 23, 23,…
## $ condition <chr> "category", "category", "category", "category", "category",…
## $ n_obs <dbl> 2, 6, 12, 2, 6, 12, 2, 6, 12, 2, 6, 12, 2, 6, 12, 2, 6, 12,…
## $ response <dbl> 5.857143, 5.285714, 4.857143, 5.285714, 7.571429, 8.285714,…
As always, we start by visualizing the data. We’ll plot data for 80 individuals along with average responses across the entire group.
# get a random sample of 80 participants
whichids <- sample(unique(modestframes$id), 80)
modestframes_sample <- modestframes %>%
filter(id %in% whichids)
mfplot <- modestframes %>%
group_by(condition, n_obs) %>%
summarize(response = mean(response)) %>%
ggplot(aes(x = n_obs, y = response)) +
geom_point() +
geom_line() +
# now add data for the 80 individuals in whichids
geom_point(data = modestframes_sample, aes(x = n_obs, y = response, colour = factor(id)), show.legend = FALSE) +
geom_line(data = modestframes_sample, aes(x = n_obs, y = response, colour = factor(id)), show.legend = FALSE, alpha = .3) +
facet_wrap(~condition)
plot(mfplot)
Let’s start by building some possible models. The simplest model we might want to consider is one in which the population mean response is different from zero (i.e., a fixed effect for the intercept), but there is variation in the mean response across individuals (i.e., a random intercept for each participant). The model formula for this is written:
response ~ 1 + (1|id)
The first part of this model description response ~ 1
is something we’ve seen before under linear models: it’s the model that has an intercept but nothing else! Anything in parentheses (1|id)
is a random effect term. In this case, we have a separate intercept for each person (i.e., each unique id
). As an alternative, we might want to consider a model that includes fixed effects for the between-subject factor condition
and the within-subject factor n_obs
. The formula for that, expressed in lme4
notation, is
response ~ 1 + condition + n_obs + (1|id)
In real life, we might want to also consider models that only contain one of these two fixed effects, but for simplicity I’m not going to bother with that here. Instead, let’s jump straight to estimating these two models:
modest1 <- lmer(formula = response ~ 1 + (1|id), data = modestframes)
modest2 <- lmer(formula = response ~ condition + n_obs + (1|id), data = modestframes)
To compare them:
anova(modest1, modest2)
## refitting model(s) with ML (instead of REML)
## Data: modestframes
## Models:
## modest1: response ~ 1 + (1 | id)
## modest2: response ~ condition + n_obs + (1 | id)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## modest1 3 2354.9 2368.5 -1174.5 2348.9
## modest2 5 2333.5 2356.1 -1161.8 2323.5 25.403 2 3.046e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
One thing that is nice about mixed models is that you can allow much more customisation than this. For instance, is it really all that plausible to think that everyone has a unique “random intercept” term, but is affected by the sample size in precisely the same way? The data for individuals in the plot above make it very clear that the curves for different individuals have different slopes. So let’s suppose that everyone has their own “random slope” term (i.e., everyone has their own regression coefficient for the effect of sample size). That gives us this model:
modest3 <- lmer(formula = response ~ condition + n_obs + (1 + n_obs|id), data = modestframes)
Okay let’s compare the expanded model to the model that only has a random intercept:
anova(modest2, modest3)
## refitting model(s) with ML (instead of REML)
## Data: modestframes
## Models:
## modest2: response ~ condition + n_obs + (1 | id)
## modest3: response ~ condition + n_obs + (1 + n_obs | id)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## modest2 5 2333.5 2356.1 -1161.8 2323.5
## modest3 7 2270.4 2302.0 -1128.2 2256.4 67.164 2 2.603e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Overall, it looks like this new model is providing a better account of the data, as evidenced by the lower AIC and BIC values. We can get a quantitative summary of how this model performs:
summary(modest3)
## Linear mixed model fit by REML ['lmerMod']
## Formula: response ~ condition + n_obs + (1 + n_obs | id)
## Data: modestframes
##
## REML criterion at convergence: 2267.9
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8960 -0.3681 -0.0306 0.3557 3.3860
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## id (Intercept) 1.72618 1.3138
## n_obs 0.01684 0.1298 -0.21
## Residual 0.55277 0.7435
## Number of obs: 675, groups: id, 225
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 5.2256978 0.1385646 37.713
## conditionproperty -0.6693767 0.1874783 -3.570
## n_obs 0.0004094 0.0111058 0.037
##
## Correlation of Fixed Effects:
## (Intr) cndtnp
## cndtnprprty -0.667
## n_obs -0.311 0.000
Very nice. However, if we’re thinking that we might be satisfied with this model, we should now start the process of “model criticism”. Let’s extract the fitted values (“predictions”) and residuals and add them to the data frame:
modestframes$modelfit <- predict(modest3)
modestframes$residuals <- residuals(modest3)
One very simple check is to draw a scatterplot showing the fitted (modelled) responses against the raw data. How closely do they resemble one another?
modestframes %>%
ggplot(aes(x = modelfit, y = response)) +
geom_point() +
facet_grid(condition ~ n_obs) +
geom_abline(intercept = 0, slope = 1)
They seem reasonably close to one another. There are hints of some systematic misfits near the edges of the response range. Let’s confirm this by plotting the residuals:
modestframes %>%
ggplot(aes( x = response, y = residuals)) +
geom_point() +
facet_grid( condition ~ n_obs) +
geom_hline( yintercept = 0)
So we have a problem – the residuals tend to increase with n_obs
. If we had a good fit, we’d be unable to see any structure in the residuals. One reason for the model misfit is that the responses use a bounded scale (0 - 9), which means that we shouldn’t really be fitting a linear model.
It’s helpful also to look at the fitted values for individuals. So let’s plot the fitted values for the same 80 individuals we selected randomly earlier on:
modestframes %>%
filter(id %in% whichids) %>%
ggplot(aes(x = n_obs, y = modelfit, colour = factor(id))) +
geom_point(show.legend = FALSE) +
geom_line(show.legend = FALSE, alpha = .3) +
facet_wrap(~ condition)
Overall this seems like a reasonable, though imperfect, approximation to what’s going on in the data.
We’re about ready to start working with the actual frames
data. To guide us in this process, let’s plot the raw data for 20 randomly chosen subjects. This is pretty important, because each person is providing 21 responses that we expect to be related to one another in a systematic way, but we might not be completely sure what structure we’ll find.
whichids <- sample(unique(frames$id), 20)
frames %>%
filter(id %in% whichids) %>%
ggplot(aes(x = test_item, y = response, shape = condition, colour = factor(n_obs))) +
geom_point() +
geom_line(alpha = .3) +
facet_wrap(~ id)
There’s quite a variety of things there. None of these panels look like random responding, but it’s immediately obvious from inspection that there are quite pronounced individual differences. It’s not clear how well we’re going to do by modelling these as linear functions, but let’s give it a try and see how far we can get!
From the last exercise, we can be reasonably sure that there is a fixed effect of condition
and n_obs
, as well as random intercepts and slopes as a function of n_obs
. So our starting point will be the model that came out of that modelling exercise, and – for the sake of our sanity – I’m only going to look at one alternative model, namely one that adds a fixed and random effect of test_item
(mainly because it’s kind of a foregone conclusion that these effects exist!)
linframes1 <- lmer(formula = response ~ condition + n_obs + (1 + n_obs|id), data = frames)
linframes2 <- lmer(formula = response ~ condition + n_obs + test_item + (1 + n_obs + test_item |id), data = frames)
As before, we can call the anova()
function to run some simple model comparisons:
anova(linframes1, linframes2)
## refitting model(s) with ML (instead of REML)
## Data: frames
## Models:
## linframes1: response ~ condition + n_obs + (1 + n_obs | id)
## linframes2: response ~ condition + n_obs + test_item + (1 + n_obs + test_item |
## linframes2: id)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## linframes1 7 23128 23173 -11556.8 23114
## linframes2 11 19734 19805 -9855.8 19712 3402 4 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I personally prefer to rely on BIC over \(p\)-values or AIC, but realistically none of them are ideal, and in any case the differences in model performance are so extreme it doesn’t matter what you use. They all give the same answer. So now let’s take a closer look at the model performance:
summary(linframes2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: response ~ condition + n_obs + test_item + (1 + n_obs + test_item |
## id)
## Data: frames
##
## REML criterion at convergence: 19727.9
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -5.1767 -0.5145 0.0274 0.5606 3.3035
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## id (Intercept) 7.40794 2.7218
## n_obs 0.01985 0.1409 -0.52
## test_item 0.48701 0.6979 -0.88 0.37
## Residual 2.80264 1.6741
## Number of obs: 4725, groups: id, 225
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 7.9666631 0.2105951 37.829
## conditionproperty -0.6192096 0.1683706 -3.678
## n_obs 0.0004094 0.0111058 0.037
## test_item -0.6914286 0.0480916 -14.377
##
## Correlation of Fixed Effects:
## (Intr) cndtnp n_obs
## cndtnprprty -0.394
## n_obs -0.478 0.000
## test_item -0.788 0.000 0.303
Add the model fits and residuals
linframes <- frames
linframes$modelfit <- predict(linframes2)
linframes$residuals <- residuals(linframes2)
Now that we have a model, let’s see how well it holds up under a good old-fashioned eyeball test. Here are the model predictions (lines) plotted against the raw data for the same 20 participants:
linframes %>%
filter(id %in% whichids) %>%
ggplot(aes(x = test_item, y = response, shape = condition, colour = factor(n_obs))) +
geom_point() +
geom_line(aes(y = modelfit), alpha = .3) +
facet_wrap(~ id)
It’s okay, I guess, but a little less than ideal. There are some clear nonlinearities in the data. At the individual-subject level, some responses look linear, others look S-shaped, and others look curvilinear. It would be nice to capture this in the model.
glmerframes <- frames %>% mutate(generalisation = (response+.1)/9.2) %>% mutate(id=factor(id))
logitmod <- glmer(
formula = generalisation ~ condition + test_item + n_obs + (1 + test_item + n_obs|id),
family = gaussian(link = "logit"),
data = glmerframes)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 1.71494 (tol = 0.001, component 1)
For comparison, let’s fit this model with a linear link function (i.e., same model as last time, but using the rescaled data)
linearmod <- lmer(
formula = generalisation ~ condition + test_item + n_obs + (1 + test_item + n_obs|id),
data = glmerframes)
Comparing models?
anova(linearmod, logitmod)
## refitting model(s) with ML (instead of REML)
## Data: glmerframes
## Models:
## linearmod: generalisation ~ condition + test_item + n_obs + (1 + test_item +
## linearmod: n_obs | id)
## logitmod: generalisation ~ condition + test_item + n_obs + (1 + test_item +
## logitmod: n_obs | id)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## linearmod 11 -1237.9 -1166.9 629.96 -1259.9
## logitmod 11 -2741.5 -2670.4 1381.73 -2763.5 1503.5 0 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again, all three criteria (AIC, BIC, \(p\)-values if you absolutely must) lead to the same conclusion, namely that the nonlinear model provides the better account of people’s responses. As usual, we should take a good look at what the model is actually doing before we accept anything:
glmerframes$modelfit <- predict(logitmod, type="response")
glmerframes$residuals <- residuals(logitmod, type="response")
glmerframes %>%
filter(id %in% whichids) %>%
ggplot(aes(x = test_item, y = generalisation, shape = condition, colour = factor(n_obs))) +
geom_point() +
geom_line(aes(y = modelfit), alpha = .3) +
facet_wrap(~ id)
Okay, at long last I’m “happy”. It’s not a perfect model but it’s good enough that we can use it to test for “effects”:
summary(logitmod)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: gaussian ( logit )
## Formula: generalisation ~ condition + test_item + n_obs + (1 + test_item +
## n_obs | id)
## Data: glmerframes
##
## AIC BIC logLik deviance df.resid
## -2741.5 -2670.4 1381.7 -2763.5 4714
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -5.2697 -0.4241 0.0384 0.4929 3.4459
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## id (Intercept) 2.216628 1.48883
## test_item 0.145355 0.38125 -0.86
## n_obs 0.008568 0.09256 -0.56 0.25
## Residual 0.033600 0.18330
## Number of obs: 4725, groups: id, 225
##
## Fixed effects:
## Estimate Std. Error t value Pr(>|z|)
## (Intercept) 4.01522 0.30558 13.140 <2e-16 ***
## conditionproperty -0.48466 0.18185 -2.665 0.0077 **
## test_item -0.99181 0.08175 -12.133 <2e-16 ***
## n_obs -0.01351 0.01551 -0.871 0.3837
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) cndtnp tst_tm
## cndtnprprty -0.346
## test_item -0.848 0.018
## n_obs -0.556 0.081 0.332
## convergence code: 0
## Model failed to converge with max|grad| = 1.71494 (tol = 0.001, component 1)
It’s taken us a while, but I think we’re now at the stage where we could reasonably claim that there are genuine effects of test item and sampling condition. It’s less clear whether there is an effect of sample size.
Remember this?
frames_avg %>%
ggplot(aes(x = test_item, y = response, colour = condition)) +
geom_point() +
geom_line() +
facet_wrap(~sample_size)
Why am I fitting this model without interaction terms? Just eyeballing the mean response it’s really clear that there is a three-way interaction here. The difference between property sampling and category sampling increases as a function of sample size, but only for the distant test items.
The answer:
kitchensink <- glmer(
formula = generalisation ~ condition * test_item * n_obs + (1 + test_item * n_obs|id),
family = gaussian(link = "logit"),
data = glmerframes)
## Warning in (function (fn, par, lower = rep.int(-Inf, n), upper = rep.int(Inf, :
## failure to converge in 10000 evaluations
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 1.05812 (tol = 0.001, component 1)
Yeah, this is not a happy place to be. Once your model gets this complicated, be prepared for bad things to happen. Nevertheless, let’s take a quick check and see if this model is better:
anova(logitmod, kitchensink)
## Data: glmerframes
## Models:
## logitmod: generalisation ~ condition + test_item + n_obs + (1 + test_item +
## logitmod: n_obs | id)
## kitchensink: generalisation ~ condition * test_item * n_obs + (1 + test_item *
## kitchensink: n_obs | id)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## logitmod 11 -2741.5 -2670.4 1381.7 -2763.5
## kitchensink 19 -3950.7 -3828.0 1994.3 -3988.7 1225.2 8 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yep. Even though we’re having horrible problems getting the bloody thing to converge (so our estimates are probably all a bit wrong), it’s still providing a vastly superior fit to the last one. Well, let’s take a look…
glmerframes2 <- glmerframes
glmerframes2$modelfit <- predict(kitchensink, type="response")
glmerframes2$residuals <- residuals(kitchensink, type="response")
glmerframes2 %>%
#filter(id %in% whichids) %>%
ggplot(aes(x = test_item, y = generalisation, shape = condition, colour = factor(n_obs))) +
geom_point() +
geom_line(aes(y = modelfit), alpha = .3) +
facet_wrap(~ id)
Yeah, that’s actually better. The eyeball test again agrees with AIC, BIC and even the humble \(p\)-value. This model does a much better job of fitting the data in the experiment.
So now let’s use the summary()
function to take a look at the model coefficients and the “standard” tests of significance:
summary(kitchensink)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: gaussian ( logit )
## Formula: generalisation ~ condition * test_item * n_obs + (1 + test_item *
## n_obs | id)
## Data: glmerframes
##
## AIC BIC logLik deviance df.resid
## -3950.7 -3828.0 1994.4 -3988.7 4706
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -5.8127 -0.3859 0.0558 0.4859 3.3474
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## id (Intercept) 1.859838 1.36376
## test_item 0.129752 0.36021 -0.88
## n_obs 0.045449 0.21319 -0.40 0.34
## test_item:n_obs 0.004104 0.06406 0.19 -0.25 -0.87
## Residual 0.026721 0.16347
## Number of obs: 4725, groups: id, 225
##
## Fixed effects:
## Estimate Std. Error t value Pr(>|z|)
## (Intercept) 1.85672 0.32072 5.789 7.07e-09 ***
## conditionproperty -0.01594 0.40223 -0.040 0.968380
## test_item -0.41441 0.09195 -4.507 6.57e-06 ***
## n_obs 0.40408 0.05207 7.761 8.43e-15 ***
## conditionproperty:test_item -0.03293 0.11816 -0.279 0.780497
## conditionproperty:n_obs 0.10659 0.05717 1.865 0.062251 .
## test_item:n_obs -0.10628 0.01871 -5.681 1.34e-08 ***
## conditionproperty:test_item:n_obs -0.07719 0.02095 -3.685 0.000229 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) cndtnp tst_tm n_obs cndtnprprty:t_ cndtnprprty:n_ tst_:_
## cndtnprprty -0.613
## test_item -0.898 0.557
## n_obs -0.140 -0.009 0.098
## cndtnprprty:t_ 0.544 -0.893 -0.617 0.042
## cndtnprprty:n_ 0.023 -0.147 0.003 -0.576 0.094
## tst_tm:n_bs -0.095 0.159 0.054 -0.886 -0.142 0.544
## cndtnpr:_:_ 0.151 -0.108 -0.124 0.536 0.073 -0.856 -0.673
## convergence code: 0
## Model failed to converge with max|grad| = 1.05812 (tol = 0.001, component 1)
## failure to converge in 10000 evaluations
Uh huh. What the hell does any of this mean??!? I mean, it’s the same output we’ve seen at every previous step in the process, but now it just looks like an arbitrary collection of asterisks and numbers.
To my mind, the only part of this output that matters is this one line, highlighting the fact that yes, once you put together a proper model that accounts for individual differences and can match the structure of the data reasonably well, there is indeed a three way interaction. Everything matters (except totally irrelevant stuff like age and the colour of the wallpaper) and everything interacts with everything else:
## conditionproperty:test_item:n_obs -0.07719 0.02095 -3.685 0.000229 ***
If you want to get more out of the data than this, statistics won’t help you any more. You’re going to have to try doing some psychology, I’m afraid…
If I were exploring the data without the guide of substantive theory and with no proper cognitive model to tell me where to look next, I would be very wary of going any further. Blindly trying to make sense of three-way interaction effects is a terrible idea, and you’ll end up chasing shadows. Nothing – and I repeat, NOTHING – in what we have done so far, is “theory”. Yes, what we have here is a “model” in the sense that statisticians refer to a model, but it is not a cognitive model in the sense that any psychologist would care about. It is not constrained by any notion of how people are solving the reasoning problem. There is no substance here: it is purely data analysis. One of the big traps that psychologists have fallen for, time and time again over the last century, is believing that statistical models can provide a substitute for theory. Call it the psychometric fallacy if you will.
This tutorial has included a number of different analyses, and you might be confused at this point about which (if any) of them should be included when writing up the experiment for publication. For some thoughts on this issue see chdss2019_content/samplingframes/analysis/analysis_samplesize.Rmd