Wednesday, November 2, 2011

Exchangeability

When writing chapter 10 of Environmental and Ecological Statistics with R (the use of the lme4 package to fit multilevel models), I had a difficult time in explaining the exchangeability assumption, which is needed in using a multilevel model as in the EUSE (effects of urbanization on stream ecosystems) example. I opted to skip most mathematical treatment of the assumption and side-stepped the issue in the examples. But I always believe the the assumption is more than a simple mathematical treatment. Gelman et al (2003) interpreted the assumption as a model for ignorance (which is very intriguing). Bernardo (1995) suggests that the exchangeability concept forces a Bayesian model ("the representation theorems for exchangeable sequences of random variables establish that any coherent analysis of the information thus modelled requires the specification of a joint probability distribution on all the parameters involved, hence forcing a Bayesian approach.").

My question was always the practical implication of violating the assumption when fitting a model. In revising a recent paper, I must address the request from a reviewer to explain the exchangeability assumption  (which was initially buried in a citation). The second paragraph in Section 2.2 of the paper was added.  I came to a better understanding of the assumption as a result of revising the paper. Not to brush aside something that I can't give a satisfactory explanation is obviously beneficial. 

The exchangeability assumption is a generalization of the most commonly used assumption in classical statistics: i.i.d. The assumption that observations were independent identically distributed random variables makes the formulation of a likelihood function a simple and straightforward process, that is, we can simply multiply the densities of individual observations to form the likelihood function. Checking the conformity with the assumption is, however, another issue. The assumptions (both i.i.d. and exchangeability) are associated with the model we propose. When we fit a regression model we always check the residuals to see if they are i.i.d. normal with mean 0 and a constant variance.  Non-conformity with the assumption often suggests a wrong model was used.  The i.i.d. assumption is verified after a model is fit, usually with graphs. How about the assumption that observations are exchangeable? I assume that we should check for the assumption

Friday, September 30, 2011

Reproduceable Results

Over a year ago, Duke was in the spotlight for academic fraud involving some biologists misuse statistics and fabricate results. The main culprit resigned and now practicing medicine in South Carolina. Duke University apparently does not agree that their rising star was a fraud, as the Duke cancer research head wrote a very positive letter of recommendation for the guy.

One clue that something was wrong was that no one can reproduce the results the Duke team published in Nature. Unable to reproduce a result in a published paper is such a common phenomenon in ecological studies because no one can repeat a costly experiment or even obtain the data used in a paper. Even when people put code and data as a supplement to a paper, rarely reviewers check these materials. In order for a reviewer to check the work of a submitted paper, we should ask the author to provide something like a Sweave file, code plus documentation and data. This is why I came back to Sweave today to prepare my report using Sweave so that those interested can repeat the work I did.

During the summer, I read a paper published in the journal Methods in Ecology and Evolution (1:25-37, 2010), advocating the use of a program called "TITAN" for detecting and estimating community threshold using species compositional data. In one example, the authors studied the effect of urbanization in a watershed on the biodiversity in stream using data from multiple watersheds in Maryland. The conclusion that a mere 1 to 2% urban land cover in a watershed can result in a dramatic shift in biodiversity in streams is highly suspicious as the measurement error of land cover (as a percentage of total land area in a watershed) can be very high (5 to 10%). Subsequent papers by the same authors also reached similar conclusions (very small urban land cover will lead to large changes in aquatic ecosystem biodiversity).

A careful examination of the code, I realized that the statistics behind the method was wrong. The mistake is not obvious in the description of the method, but should have been detected if the reviewers were critical enough to try the method with a simple simulation.  I and a colleague conducted an extensive simulation study and we found that TITAN cannon detect known thresholds in simulated data, unless the threshold is clearly a step function noticeable without using a computer. The effort took us several weeks. Reviewers of this paper should have the code and data set for one example. But the TITAN authors included a bootstrapping procedure that made running the program time consuming. As a result, I suspect that reviewers of the manuscript never ran the code. If they did, they would have discovered that TITAN will produce different estimates every time the model is executed using the same data. That is, the reviewers would likely to see a different result from those in the manuscript.

Thursday, September 29, 2011

Sweave

I used Sweave a while ago. It seems to be useful when preparing answer keys for my class. I stopped using it because I didn't know how to control the figure size (as in pdf(..., width=4)).  (I now know that I can use the width= and height options in Sweave.  But for some reason, the output is not exactly what I wanted.)

I picked it up again today so that I can document a consulting project without handing out two files.  When I used Sweave last time, I used it on a Mac. Today, I use a PC with texlive. When compiling the resulting .tex file, the message "Sweave.sty not found" appeared. After a few moment, I realized that the default for "stylepath" in R is now FALSE. But when switching it back to TRUE, the path in the resulting .tex file is

\usepackage{c:/PROGRA~1/R/R-213~1.1/share/texmf/tex/latex/Sweave}

which is not recognized by tex or latex as a proper path. Initially, I copied the folder to a different place (e.g., c:/texlive/Sweave) and changed the path in the resulting .tex file. But a better way is probably to reinstall R in a folder that does not have space in its name, so that I don't have to modify the generated .tex file every time the .Rnw file is updated. By moving R (e.g. to C:/R-2.13.1), the "site-start.el" file needs to be modified so that Emacs-ESS can find R.

Thursday, March 3, 2011

The PCB in Fish Example: Simple Linear Regression Model



The PCB in Fish Example: Simple Linear
Regression Model



1 Data


Data used here are PCB concentrations in lake trout collected by the Wisconsin Department of Natural Resources from 1974 to 2003 (Figure 1). The PCB concentration – fish size relationship (Figure 2)
represents the biological accumulation of PCB over time, as a larger fish is likely to be older.

Figure 1: Temporal trend of fish tissue PCB concentrations – PCB concentrations in lake trout from Lake Michigan decline over time, but shown a stabilizing trend in the last few years.



Figure 2: Fish tissue PCB concentrations vs. fish length – Large fish tend to have higher PCB concentrations in lake trout from Lake Michigan.

2 Regression with One Predictor


The first order rate model suggests that a simple
linear regression be used for assessing a temporal trend is a log linear model:

[;\log(PCB) = \beta_0 + \beta_1Year + \varepsilon;]                                            (1)

The model coefficients [;\beta_0,\beta_1;] are estimated using the least squares method, implemented in R function lm():

lake.lm1 <- lm(log(pcb) ~ year, data=laketrout)
display(lake.lm1, 3)
 
lm(formula = log(pcb) ~ year, data = laketrout)
(Intercept) 119.8467  10.9689
year         -0.0599   0.0055
---
n = 631, k = 2
residual sd = 0.8784, R-Squared = 0.16

The estimated [;\beta_0;] (the intercept) is 119.85 and the estimated [;\beta_1;] (the slope) is -0.06. With these two coefficients, we can calculate the mean log PCB concentration for a given year: [;\beta_0+\beta_1 year;]. The estimated residual standard deviation of 0.8784 describes the variability or uncertainty. When putting the two parts together, the fitted model can be seen as a conditional normal distribution describing the probability distribution of log PCB concentrations. For example, the estimated log PCB distribution for year 1974 is [;N(\beta_0+\beta_1 \times 1974, 0.88);] or [;N(1.60,0.88);].


3 Model Interpretation

3.1 Centering the Predictor


The intercept of a simple regression model is the expected value of the response variable when the predictor is 0. For this model, we don’t believe that the model can be extrapolated to year 0. Consequently, the intercept cannot be interpreted to have any physical meaning. However, if the model is refit with using [;yr=year-1974;] as the new predictor, the new intercept is 1.66, the mean log PCB concentration of 1974. The transformation [;yr=year-1974;], a linear transformation, does not change the fitted model, but the resulting intercept is easier to interpret.

3.2 Slope


The slope is the change in log PCB for a unit change in year. Because the response variable is log PCB concentration, a change of [;\beta_1;]in the logarithm scale is a change of factor of [;e^{\beta_1};] in the original scale. That is, the initial year (1974) concentration is [;PCB_{1974} = e^{1.60}e^{\varepsilon};]. The second year (1975) PCB concentration is [;PCB_{1975}=e^{1.60-0.06 \cdot 1}e^{\varepsilon}=e^{1.60e^{\varepsilon}e^{0.06};] , or [;P CB_{1975}= P CB_{1974}e^{-0.06};]. Given [;e^{-0.06} \approx 1 - 0.06;], the 1975 concentration is approximately 6% less than the 1974 concentration. The slope is the annual rate of reduction.

3.3 Residuals


The residual or model error term [;\varepsilon;] describes the variability of individuals. For this model, the estimated residual standard deviation is 0.87. When interpreting the fitted model in the original scale of PCB concentration, the predicted PCB concentration has a log normal distribution with log mean [;1.6-0.06\cdot yr;]and log standard deviation 0.88. This model suggests that the middle 50% of the PCB concentrations in 1974 will be bounded between [;qlnorm(c(0.25,0.75),1.60,0.88);] or (2.74, 8.97) mg/kg, and the middle 95% of the concentration values are bounded by (0.88, 27.79) mg/kg. The estimated mean concentration in 1974 is [;e^{1.6+0.88/2}=7.3;] mg/kg, and the estimated standard deviation is[;e^{1.6+0.88^2/2}\sqrt{e^{0.88^2}-1} = 7.89;], or [;\sqrt{e^{0.88^2}-1}=1.081;], 1.081 times of the mean (i.e., the coefficient of variation cv = 1.081).

The model can be summarized graphically as in Figure 3.

Figure 3: Simple linear regression of the PCB example – PCB concentration
data are plotted against year. The simple linear regression resulted in highly
uncertain predictions. The solid line is the predicted mean PCB concentration
and the dashed lines are the middle 95% intervals.





Wednesday, March 2, 2011

The PCB in Fish Example: Multiple Regression -- Interaction

When fitting the multiple regression model with yr and len.c as the predictors, an important assumption is that the effect of year (the slope of year) is not affected by the size of the fish and the effect of fish size (the slope of length) is the same throughout the study period. This is the additive-effect assumption imposed on a multiple regression model. Is this assumption reasonable? Madenjian et al. [1998] reported that small lake trout ([;<40;] cm) eat small alewives (Alosa pseudoharengus, which have an average PCB concentration of 0.2 mg/kg), intermediate-size lake trout (40 ~ 60 cm) eat alewives and rainbow smelt (Osmerus mordax, whose PCB concentrations ranged from 0.2 to 0.45 mg/kg) and large lake trout (60 cm) eat large alewives (with an average PCB concentration of 0.6 mg/kg). On the one hand, because larger fish tend to consume food with higher concentrations of PCB, its reduction over time should be slower than the rate of reduction of small fish. On the other hand, because PCB was banned in the 1970s, the natural reduction of PCB through microbiological metabolism resulted in the overall reduction of PCB concentration in the environment and in fish. We expect that the PCB – length relationship will change over time. In other words, the slope of year in the multiple regression model is expected to change with the size of a fish and the slope of length is expected to change over time. To model this “interaction” effect, we add a third predictor, the product of yr and len.c in the model:

#### R code ####
lake.lm4 <- lm(log(pcb) ~ I(year-1974)*len.c, data=laketrout)
display(lake.lm4, 4)
 
#### R output ####
lm(formula = log(pcb) ~ I(year - 1974)*len.c, data = laketrout)
                     coef.est coef.se
(Intercept)           1.8967   0.0465
I(year - 1974)       -0.0873   0.0036
len.c                 0.0510   0.0038
len.c:I(year - 1974)  0.0008   0.0003
---
n = 631, k = 4
residual sd = 0.5520, R-Squared = 0.67

When the interaction term [;len.c:I(year - 1974);] is included, the model is expressed as:

[;\log(P CB) = 1.89 - 0.087yr + 0.051Len.c + 0.00085yr \cdot Len.c + \varepsilon;]

                                       (1)

Because of the product term, the model is no longer a linear model. The slopes of centered length (len.c) and year (yr) are no longer constant. We can rearrange the model to understand the interaction effect. First, the interaction term is grouped with yr:

[;\log(P CB) = 1.89 + (-0.087 + 0.00085Len.c)yr + 0.051Len.c + \varepsilon;]

That is, the effect (or slope) of [;yr;] is now a function of [;Len.c;]. The slope shown (-0.087) is the slope when [;Len.c = 0;] or the year effect for an average sized fish. When the fish size is 10 cm above average, the yr effect is -0.087 + 0.00085 10 = -0.0785. In other words, not only a larger fish has a higher PCB concentration on average, PCB in a larger fish tend to dissipate at a lower rate. This interpretation is true only when we are comparing same-sized fish over time. So, when comparing fish of the average length (Len.c = 0), the annual rate of dissipation is 8.7%. The annual dissipation rate is 7.6% for fish with a size 10 cm above average. When examining the log(PCB) fish length relationship, the model can be rearranged to be:

[;\log(PCB) = 1.89+(0.051+0.00085yr)Len.c-0.087yr+\varepsilon;]

The relationship is still linear for any given year. But the slope changes over time. Initially, (yr = 0 or 1974), the size effect is 0.051. Each unit (1 cm) increase in size will result in a 5.1% increase in PCB concentration.  Ten years later (1984), the slope was 0.051 + 0.00085 10 = 0.0595. The size effect is stronger. This is reasonable because the rate of concentration decreasing for a large fish is smaller than the rate for a small fish. Consequently, the difference in concentration between the same two fish increases over time.

The interaction effect is small (albeit statistically significant). Can this small interaction effect be practically significant? Because the response variable is in logarithmic scale, we need to be careful in interpreting a small effect. For the slope of yr, the slope value for a small fish (-6.7 cm below average, or the first quartile) is 0.09 - 0.00085 × (-6.7) = 0.095 and the slope is 0.09 - 0.0008 × (8.5) = 0.083 for a large fish (8.5 cm above average, the third quartile). PCB concentration reduction is at a lower rate (~ 8%) for a large fish and a higher rate (~ 10%) for small fish. The slope of len.c increases from 0.05 in 1974 to 0.074 in 2004, indicating a much larger difference in PCB concentration between a large and a small fish.

References


C.P. Madenjian, R.J. Hesselberg, T.J. Desorcie, L.J. Schmidt, Stedman. R.M., L.J. Begnoche, and D.R. Passino-Reader. Estimate of net trophic transfer efficiency of PCBs to Lake Michigan lake trout from their prey. Environmental Science and Technology, 32:886–891, 1998.




Log or not log

LOGorNOTLOG.html Log or not log, that is the question May 19, 2018 In 2014 I taught a special topics class on statistical i...