Skip to content

check_overdispersion underestimates dispersion in mixed models #464

@Istalan

Description

@Istalan

Hi,

so I was playing around with mixed Poisson models and noticed that for large numbers of random coefficients (also called random effects, i guess) the dispersion ratio from check_overdispersion was too low.

I used code from the Ben Bolker GLMM-FAQ (https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#overdispersion) and modified it for 200 random coefficients:

library(lme4)
set.seed(101)  
d <- data.frame(x=runif(1000),
                f=factor(sample(1:200,size=1000,replace=TRUE))) # modified for more random effects
suppressMessages(d$y <- simulate(~x+(1|f), family=poisson,
                                 newdata=d,
                                 newparams=list(theta=1,beta=c(0,2)))[[1]])
m1 <- glmer(y~x+(1|f),data=d, family=poisson)
#overdisp_fun(m1)
performance::check_overdispersion(m1)

Now this should be a perfect model, but the dispersion ratio is only about 0.8 not 1. Looking deeper the assumed residual degrees of freedom are 997, which is widely inconsitent with the result of 814

x <- performance::check_overdispersion(m1)
x$residual_df
pchisq(x$chisq_statistic, df = x$residual_df)

Now I assumed, that the issue is that overdispersion is a phenomenon on the conditional residuals, as the random coefficients have a scaling factor. So they act as fixed coefficients in this analysis and should be subtracted from the degrees of freedom which gives 200 RC the intercept and x makes 798 df. Checking with 200 simulations you have to modify Ben Bolkers code to avoid the issue with lambda < 5, but you get results that look pretty consistent

res <- replicate(200, {
  d <- data.frame(x=runif(1000) + log(5), # small lambda(<5) slightly distort the result
                  f=factor(sample(1:200,size=1000,replace=TRUE)))
  suppressMessages(d$y <- simulate(~x +(1|f) , family=poisson,
                                   newdata=d,
                                   newparams=list(theta=1,beta=c(0,2)))[[1]])
  m1 <- glmer(y~x+(1|f),data=d, family=poisson)
  performance::check_overdispersion(m1)[[1]]
})
# doesn't fit df = 997
qqplot(res, rchisq(10000, 997))
abline(0, 1)
# does fit df = 798
mean(res)
t.test(res, mu = 798)
qqplot(res, rchisq(10000, 798))
abline(0, 1)

Now i wanted to make sure i got it right and used a much larger simulation and here the chisq seem to have a mean closer to 799 which confuses me, but is perhaps explained by remaining imprecision of the pearson residual:

library(parallel)
# 20,000 simulations on 10 cores, took 6 minutes on my machine
res_list <- mclapply(1:20000, function(i) {
    d <- data.frame(x=runif(1000) + log(5), # small lambda(<5) slightly distort the result
                    f=factor(sample(1:200,size=1000,replace=TRUE)))
    suppressMessages(d$y <- simulate(~x +(1|f) , family=poisson,
                                     newdata=d,
                                     newparams=list(theta=1,beta=c(0,2)))[[1]])
    m1 <- glmer(y~x+(1|f),data=d, family=poisson)
    performance::check_overdispersion(m1)[[1]]
}, mc.cores = 10)   

res <- unlist(res_list)
mean(res)
t.test(res, mu = 798)
qqplot(res, rchisq(10000, 798))
abline(0, 1)

The same problem exists in the overdisp_fun from the GLMM-FAQ. Do you think I should post there as well?

Session info:

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] performance_0.8.0 lme4_1.1-27.1     Matrix_1.2-18    

loaded via a namespace (and not attached):
 [1] minqa_1.2.4     MASS_7.3-51.6   compiler_3.6.3  tools_3.6.3     Rcpp_1.0.7      splines_3.6.3   nlme_3.1-147   
 [8] grid_3.6.3      insight_0.14.5  nloptr_1.2.2.3  boot_1.3-25     lattice_0.20-41

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions