check_overdispersion underestimates dispersion in mixed models

Hi,

so I was playing around with mixed Poisson models and noticed that for large numbers of random coefficients (also called random effects,  i guess) the dispersion ratio from `check_overdispersion` was too low. 

I used code from the Ben Bolker GLMM-FAQ (https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#overdispersion) and modified it for 200 random coefficients:
```
library(lme4)
set.seed(101)  
d <- data.frame(x=runif(1000),
                f=factor(sample(1:200,size=1000,replace=TRUE))) # modified for more random effects
suppressMessages(d$y <- simulate(~x+(1|f), family=poisson,
                                 newdata=d,
                                 newparams=list(theta=1,beta=c(0,2)))[[1]])
m1 <- glmer(y~x+(1|f),data=d, family=poisson)
#overdisp_fun(m1)
performance::check_overdispersion(m1)
```
Now this should be a perfect model, but the dispersion ratio is only about 0.8 not 1. Looking deeper the assumed residual degrees of freedom are 997, which is widely inconsitent with the result of 814
```
x <- performance::check_overdispersion(m1)
x$residual_df
pchisq(x$chisq_statistic, df = x$residual_df)
```
Now I assumed, that the issue is that overdispersion is a phenomenon on the conditional residuals, as the random coefficients have a scaling factor. So they act as fixed coefficients in this analysis and should be subtracted from the degrees of freedom which gives 200 RC the intercept and x makes 798 df. Checking with 200 simulations you have to modify Ben Bolkers code to avoid the issue with lambda < 5, but you get results that look pretty consistent

```
res <- replicate(200, {
  d <- data.frame(x=runif(1000) + log(5), # small lambda(<5) slightly distort the result
                  f=factor(sample(1:200,size=1000,replace=TRUE)))
  suppressMessages(d$y <- simulate(~x +(1|f) , family=poisson,
                                   newdata=d,
                                   newparams=list(theta=1,beta=c(0,2)))[[1]])
  m1 <- glmer(y~x+(1|f),data=d, family=poisson)
  performance::check_overdispersion(m1)[[1]]
})
# doesn't fit df = 997
qqplot(res, rchisq(10000, 997))
abline(0, 1)
# does fit df = 798
mean(res)
t.test(res, mu = 798)
qqplot(res, rchisq(10000, 798))
abline(0, 1)

```
Now i wanted to make sure i got it right and used a much larger simulation and here the chisq seem to have a mean closer to 799 which confuses me, but is perhaps explained by remaining imprecision of the pearson residual:
```
library(parallel)
# 20,000 simulations on 10 cores, took 6 minutes on my machine
res_list <- mclapply(1:20000, function(i) {
    d <- data.frame(x=runif(1000) + log(5), # small lambda(<5) slightly distort the result
                    f=factor(sample(1:200,size=1000,replace=TRUE)))
    suppressMessages(d$y <- simulate(~x +(1|f) , family=poisson,
                                     newdata=d,
                                     newparams=list(theta=1,beta=c(0,2)))[[1]])
    m1 <- glmer(y~x+(1|f),data=d, family=poisson)
    performance::check_overdispersion(m1)[[1]]
}, mc.cores = 10)   

res <- unlist(res_list)
mean(res)
t.test(res, mu = 798)
qqplot(res, rchisq(10000, 798))
abline(0, 1)
```
The same problem exists in the `overdisp_fun` from the GLMM-FAQ. Do you think I should post there as well?

Session info:
```
> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] performance_0.8.0 lme4_1.1-27.1     Matrix_1.2-18    

loaded via a namespace (and not attached):
 [1] minqa_1.2.4     MASS_7.3-51.6   compiler_3.6.3  tools_3.6.3     Rcpp_1.0.7      splines_3.6.3   nlme_3.1-147   
 [8] grid_3.6.3      insight_0.14.5  nloptr_1.2.2.3  boot_1.3-25     lattice_0.20-41
```





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

check_overdispersion underestimates dispersion in mixed models #464

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

check_overdispersion underestimates dispersion in mixed models #464

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions