-
-
Notifications
You must be signed in to change notification settings - Fork 101
Description
Hi,
so I was playing around with mixed Poisson models and noticed that for large numbers of random coefficients (also called random effects, i guess) the dispersion ratio from check_overdispersion
was too low.
I used code from the Ben Bolker GLMM-FAQ (https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#overdispersion) and modified it for 200 random coefficients:
library(lme4)
set.seed(101)
d <- data.frame(x=runif(1000),
f=factor(sample(1:200,size=1000,replace=TRUE))) # modified for more random effects
suppressMessages(d$y <- simulate(~x+(1|f), family=poisson,
newdata=d,
newparams=list(theta=1,beta=c(0,2)))[[1]])
m1 <- glmer(y~x+(1|f),data=d, family=poisson)
#overdisp_fun(m1)
performance::check_overdispersion(m1)
Now this should be a perfect model, but the dispersion ratio is only about 0.8 not 1. Looking deeper the assumed residual degrees of freedom are 997, which is widely inconsitent with the result of 814
x <- performance::check_overdispersion(m1)
x$residual_df
pchisq(x$chisq_statistic, df = x$residual_df)
Now I assumed, that the issue is that overdispersion is a phenomenon on the conditional residuals, as the random coefficients have a scaling factor. So they act as fixed coefficients in this analysis and should be subtracted from the degrees of freedom which gives 200 RC the intercept and x makes 798 df. Checking with 200 simulations you have to modify Ben Bolkers code to avoid the issue with lambda < 5, but you get results that look pretty consistent
res <- replicate(200, {
d <- data.frame(x=runif(1000) + log(5), # small lambda(<5) slightly distort the result
f=factor(sample(1:200,size=1000,replace=TRUE)))
suppressMessages(d$y <- simulate(~x +(1|f) , family=poisson,
newdata=d,
newparams=list(theta=1,beta=c(0,2)))[[1]])
m1 <- glmer(y~x+(1|f),data=d, family=poisson)
performance::check_overdispersion(m1)[[1]]
})
# doesn't fit df = 997
qqplot(res, rchisq(10000, 997))
abline(0, 1)
# does fit df = 798
mean(res)
t.test(res, mu = 798)
qqplot(res, rchisq(10000, 798))
abline(0, 1)
Now i wanted to make sure i got it right and used a much larger simulation and here the chisq seem to have a mean closer to 799 which confuses me, but is perhaps explained by remaining imprecision of the pearson residual:
library(parallel)
# 20,000 simulations on 10 cores, took 6 minutes on my machine
res_list <- mclapply(1:20000, function(i) {
d <- data.frame(x=runif(1000) + log(5), # small lambda(<5) slightly distort the result
f=factor(sample(1:200,size=1000,replace=TRUE)))
suppressMessages(d$y <- simulate(~x +(1|f) , family=poisson,
newdata=d,
newparams=list(theta=1,beta=c(0,2)))[[1]])
m1 <- glmer(y~x+(1|f),data=d, family=poisson)
performance::check_overdispersion(m1)[[1]]
}, mc.cores = 10)
res <- unlist(res_list)
mean(res)
t.test(res, mu = 798)
qqplot(res, rchisq(10000, 798))
abline(0, 1)
The same problem exists in the overdisp_fun
from the GLMM-FAQ. Do you think I should post there as well?
Session info:
> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] performance_0.8.0 lme4_1.1-27.1 Matrix_1.2-18
loaded via a namespace (and not attached):
[1] minqa_1.2.4 MASS_7.3-51.6 compiler_3.6.3 tools_3.6.3 Rcpp_1.0.7 splines_3.6.3 nlme_3.1-147
[8] grid_3.6.3 insight_0.14.5 nloptr_1.2.2.3 boot_1.3-25 lattice_0.20-41