Tuesday, June 15, 2010

Quantile Regression and a week in Denver

I attended the 3rd USGS Modeling Conference in Denver last week. This was an event participated mostly by USGS and other DOI scientists. The conference started on Monday with a short course on quantile regression, a topic increasingly mentioned in ecological literature, especially by those interested in estimating ecological threshold. In the last year or so, I read a few papers and some applications of the methods in ecological literature. My initial impression was not very positive. To begin with, the original reference on the subject presented the method in terms of a optimization problem. It gave me the impression that quantile regression is what least square to linear model, a computational method without statistics. Many applications, especially those presented in AWRA conferences are incomprehensible half-baked products. In signing up for the short course, I wanted to learn more before I start to criticize quantile regression.

The instructor, Brian Cade, is a USGS statistician. He started out with a definition and went on with some examples and R code. By the coffee break, I know how to run R to fit a quantile model, but don't know much else. I wanted to know the meaning of the output and their practical implications. During the second half of the short course, I started to think about these questions. I stopped Brian before he went into more advanced topics (such as Bayesian quantile regression). I asked a discussion on the examples already covered, especially on their practical implications. What were achieved by using quantile regression that cannot be achieved using regular regression?

Here are what I learned from the short course.

1. Quantile regression is a form of nonparametric modeling -- no probabilistic assumption was imposed on data. Initially, a quantile regression was presented as an extension of the linear regression model. Instead of modeling the mean of the response, a quantile regression models a given quantile as a linear model of the predictor. The current implementation of QR in R also allows smoothing functions be used. As a result, QR can be fully nonparametric -- not only the response variable is not limited to a specific distribution, but also the model form of a quantile.

2. Quantile regression is an exploratory data analysis tool. The main application is the detection of changes in response variable distribution. In regular regression (lm and glm), we impose a fixed distributional assumption on the response. In QR, this distributional assumption is no longer applicable. When multiple quantiles are estimated, we can examine the response variable probability distribution at different locations along the x-axis. I can see that a good graphical presentation of the estimated response variable probability distributions can be very useful. For example, I have been working on the USGS urbanization data. One response variable is the mean tolerance score of macroinvertebrates community, a variable limited between 0 and 10. It is reasonable to believe that when urbanization level in the watershed is low, the distribution of species tolerance is limited by factors other than urbanization induced pollution and habitat modification. But when urbanization is high, only the most tolerant species are left leading to a tolerance score concentrated in the upper end of the spectrum. Along the urban gradient, we can believe a point where the probability distribution of the tolerance score may have changed.

3. If the advantage of a quantile regression is the detection of changes in response variable distribution, any application of the method should produce multiple quantiles so that the full distribution can be evaluated numerically. A good method for graphically display these distributions is essential.

4. An immediate application of quantile regression is in risk assessment. When the quantile regression results are translated into response variable distributions along the x-axis, these distributions can be translated into probability of exceedence.

5. I still need to investigate the theoretical background of model assessment tools. For example, it is still not clear to me how AIC is calculated without a probabilistic assumption on data. A double exponential distribution was imposed on the weighted residuals. I must learn more on this. But, as quantile regression is a non-parametric modeling method, it should be used as an exploratory tool for hypothesis generation, rather than a modeling tool. As a result, AIC and other model diagnostic tools are less relevant.

The rest of the conference were also interesting. I sense a big difference in attitude towards modeling between people in government and in academia. Government scientists are goal oriented -- they need to complete a project. Academic scientists are interested in story telling -- they seek a minimum publication unit. Government scientists often pursue projects with the sole purpose of fulfilling some regulatory mandate. Academic scientists often pursue projects that tickles our fancy but not necessarily mean much.

Denver in June is beautiful.

No comments:

Log or not log

LOGorNOTLOG.html Log or not log, that is the question May 19, 2018 In 2014 I taught a special topics class on statistical i...