Thursday, March 3, 2011

The PCB in Fish Example: Simple Linear Regression Model



The PCB in Fish Example: Simple Linear
Regression Model



1 Data


Data used here are PCB concentrations in lake trout collected by the Wisconsin Department of Natural Resources from 1974 to 2003 (Figure 1). The PCB concentration – fish size relationship (Figure 2)
represents the biological accumulation of PCB over time, as a larger fish is likely to be older.

Figure 1: Temporal trend of fish tissue PCB concentrations – PCB concentrations in lake trout from Lake Michigan decline over time, but shown a stabilizing trend in the last few years.



Figure 2: Fish tissue PCB concentrations vs. fish length – Large fish tend to have higher PCB concentrations in lake trout from Lake Michigan.

2 Regression with One Predictor


The first order rate model suggests that a simple
linear regression be used for assessing a temporal trend is a log linear model:

[;\log(PCB) = \beta_0 + \beta_1Year + \varepsilon;]                                            (1)

The model coefficients [;\beta_0,\beta_1;] are estimated using the least squares method, implemented in R function lm():

lake.lm1 <- lm(log(pcb) ~ year, data=laketrout)
display(lake.lm1, 3)
 
lm(formula = log(pcb) ~ year, data = laketrout)
(Intercept) 119.8467  10.9689
year         -0.0599   0.0055
---
n = 631, k = 2
residual sd = 0.8784, R-Squared = 0.16

The estimated [;\beta_0;] (the intercept) is 119.85 and the estimated [;\beta_1;] (the slope) is -0.06. With these two coefficients, we can calculate the mean log PCB concentration for a given year: [;\beta_0+\beta_1 year;]. The estimated residual standard deviation of 0.8784 describes the variability or uncertainty. When putting the two parts together, the fitted model can be seen as a conditional normal distribution describing the probability distribution of log PCB concentrations. For example, the estimated log PCB distribution for year 1974 is [;N(\beta_0+\beta_1 \times 1974, 0.88);] or [;N(1.60,0.88);].


3 Model Interpretation

3.1 Centering the Predictor


The intercept of a simple regression model is the expected value of the response variable when the predictor is 0. For this model, we don’t believe that the model can be extrapolated to year 0. Consequently, the intercept cannot be interpreted to have any physical meaning. However, if the model is refit with using [;yr=year-1974;] as the new predictor, the new intercept is 1.66, the mean log PCB concentration of 1974. The transformation [;yr=year-1974;], a linear transformation, does not change the fitted model, but the resulting intercept is easier to interpret.

3.2 Slope


The slope is the change in log PCB for a unit change in year. Because the response variable is log PCB concentration, a change of [;\beta_1;]in the logarithm scale is a change of factor of [;e^{\beta_1};] in the original scale. That is, the initial year (1974) concentration is [;PCB_{1974} = e^{1.60}e^{\varepsilon};]. The second year (1975) PCB concentration is [;PCB_{1975}=e^{1.60-0.06 \cdot 1}e^{\varepsilon}=e^{1.60e^{\varepsilon}e^{0.06};] , or [;P CB_{1975}= P CB_{1974}e^{-0.06};]. Given [;e^{-0.06} \approx 1 - 0.06;], the 1975 concentration is approximately 6% less than the 1974 concentration. The slope is the annual rate of reduction.

3.3 Residuals


The residual or model error term [;\varepsilon;] describes the variability of individuals. For this model, the estimated residual standard deviation is 0.87. When interpreting the fitted model in the original scale of PCB concentration, the predicted PCB concentration has a log normal distribution with log mean [;1.6-0.06\cdot yr;]and log standard deviation 0.88. This model suggests that the middle 50% of the PCB concentrations in 1974 will be bounded between [;qlnorm(c(0.25,0.75),1.60,0.88);] or (2.74, 8.97) mg/kg, and the middle 95% of the concentration values are bounded by (0.88, 27.79) mg/kg. The estimated mean concentration in 1974 is [;e^{1.6+0.88/2}=7.3;] mg/kg, and the estimated standard deviation is[;e^{1.6+0.88^2/2}\sqrt{e^{0.88^2}-1} = 7.89;], or [;\sqrt{e^{0.88^2}-1}=1.081;], 1.081 times of the mean (i.e., the coefficient of variation cv = 1.081).

The model can be summarized graphically as in Figure 3.

Figure 3: Simple linear regression of the PCB example – PCB concentration
data are plotted against year. The simple linear regression resulted in highly
uncertain predictions. The solid line is the predicted mean PCB concentration
and the dashed lines are the middle 95% intervals.





No comments:

Log or not log

LOGorNOTLOG.html Log or not log, that is the question May 19, 2018 In 2014 I taught a special topics class on statistical i...