Tuesday, October 16, 2012

To Binomial or To Poisson? That is the Question



Modeling pathogenic protozoa such as Cryptosporidium oocysts in water is a problem of modeling count data. An important aspect of the problem is to understand the method detection probability. The currently used method for detecting and quantifying Cryptosporidium oocysts in a water sample has an average recovery rate of about 40%. That is, the probability of detecting an oocyst in a water sample is about 0.4 when the oocyst is present. The method used for quantifying this recovery rate is to have labs analyze water samples spiked with known number of oocysts. In the recent US EPA study, 50 certified labs were given several spiked samples and their recoveries were calculated as the number of oocysts identified divided by the number of oocysts spiked. The number of oocysts spiked is typically about 100. The counting process has a standard deviation of 2 to 3 oocysts. Given the relatively accurate total counts, we generally use a binomial model:
where the index ij represents the ith observation from lab j. A simple multilevel model is then to model the detection probability as lab specific:
and imposing a common prior on the logits of lab means:
This model assumes that detection probability varies by lab and lab-specific detection probabilities are exchangeable.
There are two problems with the model. The first problem is a numerical one. The likelihood of a binomial is of the form  py(1 p)n-y. The likelihood function can be numerically challenging to evaluate under certain situations. This is, obviously, a problem that can be addressed in programming. The other problem is that the total (Nij) is not observed exactly. When using binomial, this uncertainty is ignored. Because ignored uncertainty will not vanish, it will be reflected in the inflated uncertainty level in the estimated model parameters. In this case, the only parameter we have is the binomial mean p.
Section 1.2.5 of Agresti [2002] discusses the relationship between multinomial and Poisson. Simplifying the discussion to a binomial situation, we have the following.
1.
We use Y 1 to denote the vector of detected number of oocysts and Y 2 the number of missed oocysts. The vector of total number of oocysts is Y 1 + Y 2.
2.
We model Y 1 and Y 2 as two independent Poisson random variates with means λ1 and λ2, respectively.
3.
If Y 1 and Y 2 are Poisson random variables, the sum of Y 1 and Y 2 is also a Poisson random variable with mean λ1 + λ2.
4.
In the binomial model the sum of Y 1 and Y 2 is given, the joint distribution of Y 1 and Y 2 must then be conditional on the sum:
which is shown [Agresti2002] to be a binomial distribution characterized by π1 = λ∕ (λ1 + λ2) and π2 = λ∕ (λ1 + λ2) = 1 π1.
In other words, the binomial distribution is the conditional distribution of two independent Poisson variates. Or, we can model the binomial count data as two independent Poisson random variables and derive the binomial parameter.
Since we know the total (N), we can parameterize into the Poisson mean. For example, we can parameterize the Poisson mean of the detected number of oocysts as a product of the spiked total (N) and detection probability (p1):
Likewise, the model for λ2 is Np2is used as an offset in both models. Given that p1 + p2 = 1, the estimated binomial mean is
Only one Poisson model is needed. Modeling using Poisson with the total counts as the offset is the same as modeling using binomial.
By using a Poisson model, we can add a multiplicative error term (e.g., LN(02)) to account for the uncertainty in the number of spiked oocysts, and/or other sources of over-dispersion.

References

   A. Agresti. Categorical Data Analysis. Wiley, 2002.

1 comment:

Song Qian said...

I have been using Poisson with offset for binomial data for a long time. But I forgot where I learned it initially. I would appreciate a citation of the original work.

Log or not log

LOGorNOTLOG.html Log or not log, that is the question May 19, 2018 In 2014 I taught a special topics class on statistical i...