Monday, June 21, 2010

An example of statistical modeling -- part 1: model formulation

I will regularly post discussions on examples used in Environmental and Ecological Statistics with R (available at Amazon).

This post discuss the general thought process of the PCB in fish example.  In the first part, I go over the background in data and the process of establishing the model form.

The data were fish (lake trout) tissue concentrations of PCB measured from fish caught in Lake Michigan from 1970 to 2000.  Each year, a number of fish were sampled from the lake and PCB concentrations were measured using the edible portion of fish sample.   The main objective of the example is to study the trend of mean PCB fish tissue concentrations in the lake.

A modeling project always have the following three steps:

1. Model formulation

2. Parameter estimation, and

3. Model evaluation

A properly formulated model can be based on relevant theory or exploratory data analysis.  In this case, we start with a general model often used in environmental engineering.  The model suggests that the rate of change of a chemical concentration is often proportional to the concentration itself (the first order reaction model).  That is, the derivative of concentration with respect to time is proportional to concentration:

[;\frac{dC}{dt} = - k t;]

Solving this differential equation, we have

[;C_t = C_0 e^{-kt};]

where [;C_t;] is the concentration at time t, [;C_0;] is the concentration at time 0.  Taking logarithmic transformation on both sides:

[;\log(C_t) = \log(C_0) - k t;]

which suggests a log linear regression model with [;\log(C_t);] as the response variable and [;t;] as the predictor.  The intercept is [;\log(C_0);] and the slope is [;k;].

This process of establishing a linear regression model form through some basic mechanistic relations is common. A common feature of this approach is the aggregation/simplification of mechanistic relations from fine spatiotemporal scales into the scale represented by the data.  For example, the first order model is a reasonable model for a given fish.  But there is no possibility of measuring the PCB concentration from the same fish over time.  When using concentration data from multiple fish each year, the resulting relationship is about the over average PCB concentrations in all fish in Lake Michigan.  Also because of the simplification and aggregation, model coefficients may now vary due to other factors.  In this example, we will explore the changes of k as a function of fish size.

In summary, we used the first order reaction model as the basis to establish the basic form of the relationship between PCB fish tissue concentration and time (since PCB was banned from the area).

Tuesday, June 15, 2010

Quantile Regression and a week in Denver

I attended the 3rd USGS Modeling Conference in Denver last week. This was an event participated mostly by USGS and other DOI scientists. The conference started on Monday with a short course on quantile regression, a topic increasingly mentioned in ecological literature, especially by those interested in estimating ecological threshold. In the last year or so, I read a few papers and some applications of the methods in ecological literature. My initial impression was not very positive. To begin with, the original reference on the subject presented the method in terms of a optimization problem. It gave me the impression that quantile regression is what least square to linear model, a computational method without statistics. Many applications, especially those presented in AWRA conferences are incomprehensible half-baked products. In signing up for the short course, I wanted to learn more before I start to criticize quantile regression.

The instructor, Brian Cade, is a USGS statistician. He started out with a definition and went on with some examples and R code. By the coffee break, I know how to run R to fit a quantile model, but don't know much else. I wanted to know the meaning of the output and their practical implications. During the second half of the short course, I started to think about these questions. I stopped Brian before he went into more advanced topics (such as Bayesian quantile regression). I asked a discussion on the examples already covered, especially on their practical implications. What were achieved by using quantile regression that cannot be achieved using regular regression?

Here are what I learned from the short course.

1. Quantile regression is a form of nonparametric modeling -- no probabilistic assumption was imposed on data. Initially, a quantile regression was presented as an extension of the linear regression model. Instead of modeling the mean of the response, a quantile regression models a given quantile as a linear model of the predictor. The current implementation of QR in R also allows smoothing functions be used. As a result, QR can be fully nonparametric -- not only the response variable is not limited to a specific distribution, but also the model form of a quantile.

2. Quantile regression is an exploratory data analysis tool. The main application is the detection of changes in response variable distribution. In regular regression (lm and glm), we impose a fixed distributional assumption on the response. In QR, this distributional assumption is no longer applicable. When multiple quantiles are estimated, we can examine the response variable probability distribution at different locations along the x-axis. I can see that a good graphical presentation of the estimated response variable probability distributions can be very useful. For example, I have been working on the USGS urbanization data. One response variable is the mean tolerance score of macroinvertebrates community, a variable limited between 0 and 10. It is reasonable to believe that when urbanization level in the watershed is low, the distribution of species tolerance is limited by factors other than urbanization induced pollution and habitat modification. But when urbanization is high, only the most tolerant species are left leading to a tolerance score concentrated in the upper end of the spectrum. Along the urban gradient, we can believe a point where the probability distribution of the tolerance score may have changed.

3. If the advantage of a quantile regression is the detection of changes in response variable distribution, any application of the method should produce multiple quantiles so that the full distribution can be evaluated numerically. A good method for graphically display these distributions is essential.

4. An immediate application of quantile regression is in risk assessment. When the quantile regression results are translated into response variable distributions along the x-axis, these distributions can be translated into probability of exceedence.

5. I still need to investigate the theoretical background of model assessment tools. For example, it is still not clear to me how AIC is calculated without a probabilistic assumption on data. A double exponential distribution was imposed on the weighted residuals. I must learn more on this. But, as quantile regression is a non-parametric modeling method, it should be used as an exploratory tool for hypothesis generation, rather than a modeling tool. As a result, AIC and other model diagnostic tools are less relevant.

The rest of the conference were also interesting. I sense a big difference in attitude towards modeling between people in government and in academia. Government scientists are goal oriented -- they need to complete a project. Academic scientists are interested in story telling -- they seek a minimum publication unit. Government scientists often pursue projects with the sole purpose of fulfilling some regulatory mandate. Academic scientists often pursue projects that tickles our fancy but not necessarily mean much.

Denver in June is beautiful.

Log or not log

LOGorNOTLOG.html Log or not log, that is the question May 19, 2018 In 2014 I taught a special topics class on statistical i...