In my 2003 Ecological Modelling paper on ecological threshold, I used the bootstrapping method for estimating the standard error of a change point variable. The change point was estimated as the binary break point that resulted in the greatest reduction in deviance. Ian McKeague of FSU (now at Columbia) told me about Buhlmann and Yu (2002), which suggested that bootstrapping is inappropriate for a change point problem. I was convinced that the bootstrap confidence interval is wrong, but was never able to explain the reasons convincingly until recently.
A change point problem limits the potential break points to be the gaps (intervals between distinct values) in the given data set. These potential break points do not change from bootstrap sample to bootstrap sample, only that some gaps may become wider. Should the population were sampled, the potential number of break points is infinite. As a result, bootstrap estimated break point is of an artificially smaller sampling variance than the variance of the true sampling distribution of the break point. This reduced sampling variation is likely the cause of a much smaller standard error or narrower confidence interval when estimated using bootstrapping. In Banerjeer and McKeague (2007), a simulation is performed to show that a 95% confidence interval based on bootstrapping covers the underlying true change point far less than 95% of the time.
Monday, September 20, 2010
Causal inference based on observational data
I have been reading causal inference using observational data for almost three years. Three years ago, I read a paper on the effectiveness of agricultural best management practices on reducing nutrient loading from fields. The data were observational, and the estimated effects are affected by confounding factors. Using that data set, I learned how to use multilevel model and made a few homework assignments for my class. The question in that data set is clear and the approach of using the propensity score match is obvious. I did a comparison of using the propensity score matching and multilevel modeling for estimating the effect of conservation practices on nutrient loading from agriculture fields. The results were not surprising -- on average, any conservation practice would lead to a reduction of nitrogen/phosphorus loading by 1/3-2/3. I have since read Rosenbuam and Rubin (1983) and other classical literatures. To summarize the main idea, the problem of causal inference of a treatment effect is always the problem of counter factual. That is, we can never apply a treatment and control to the same subject to assess the real treatment effect on a single subject. When treatment and control were applied to two different subjects, the result can be due to confounding factors. Statistical solution to this problem is Fisher's randomization. Subjects are randomly applied control and treatment, effectively forming two groups that are otherwise similar and one applied treatment and the other control. The difference in response can be confidently attributed to as the treatment effect. When using observational data, treatment and control are not applied randomly. Consequently, no causal inference should be made without careful adjustment. The propensity score matching approach is one of many methods used for causal inference based on observational data. The basic idea is that data can be subset such that the treatment and control groups are similar in terms of confounding factors. The similarity is achieved though the propensity score, or the likelihood a subject receiving treatment given its confounding factor levels. Operationally, a propensity score is estimated as the probability of receiving treatment through a logistic regression using all available covariates. Each observation in the treatment (or control) group is matched with an observation in the other group with a similar propensity score. Not all observations will be included in the final data set. After matching, we essentially have a treatment group and a control group that are similar with respect to confounding factors. The difference in response is likely due to the treatment.
As central idea of using the propensity score is matching and subset to balance out the effects of confounding factor. In environmental studies, the treatment is often a continuous variable. For example, the dose-response curve, the effect of nutrient on lake eutrophication. I read Imai and van Dyk (2004) and Hirano and Imbens (2004) recently and learned the generalized propensity score approach for continuous and other forms of treatments. The generalized propensity score is actually a natural extension of the binary treatment one. The propensity score of a binary treatment variable is the probability of an observation receiving treatment, or the probability distribution function of the binomial distribution of treatment. The generalized propensity score is the probability density of treatment conditional on confounding factors. One can estimate this probability distribution by fitting a regression model if the treatment over confounding factors. The resulting regression model forms a conditional normal distribution of the treatment variable, from which the propensity scores can be calculated (as the density of the model residuals (~N(0, sigma)). Imai and van Dyk (2004) suggested that the resulting propensity scores be used to divide the observations into J groups of similar propensity scores, and within each group a regression model is fit for the treatment effect. The average causal effect is a weighted mean of the J effects. Hirano and Imbens (2004) suggested a two step process: (1) the response is regressed to the polynomial of treatment and the propensity scores, and (2) the causal effect is estimated as the expected response of a given treatment value over all observations.
In any case, the propensity score approach still aims at achieving a balance of the confounding factors. When observations are divided into groups based on the estimated propensity scores (Imai and van Dyk, 2004), distributions of confounding factors (coveriates used for estimating propensity scores) are more less balanced.
A very odd application of the generalized propensity score is published recently in Ecological Applications (Yuan 2010). The paper misinterpreted the definition of the propensity score. Instead of the probability density of an observation, Yuan (2010) used the predicted mean treatment value as the propensity score. The resulting subgrouping further segregated the distributions of confounding factors, leading to a potentially more biased estimate of the causal effect.
As central idea of using the propensity score is matching and subset to balance out the effects of confounding factor. In environmental studies, the treatment is often a continuous variable. For example, the dose-response curve, the effect of nutrient on lake eutrophication. I read Imai and van Dyk (2004) and Hirano and Imbens (2004) recently and learned the generalized propensity score approach for continuous and other forms of treatments. The generalized propensity score is actually a natural extension of the binary treatment one. The propensity score of a binary treatment variable is the probability of an observation receiving treatment, or the probability distribution function of the binomial distribution of treatment. The generalized propensity score is the probability density of treatment conditional on confounding factors. One can estimate this probability distribution by fitting a regression model if the treatment over confounding factors. The resulting regression model forms a conditional normal distribution of the treatment variable, from which the propensity scores can be calculated (as the density of the model residuals (~N(0, sigma)). Imai and van Dyk (2004) suggested that the resulting propensity scores be used to divide the observations into J groups of similar propensity scores, and within each group a regression model is fit for the treatment effect. The average causal effect is a weighted mean of the J effects. Hirano and Imbens (2004) suggested a two step process: (1) the response is regressed to the polynomial of treatment and the propensity scores, and (2) the causal effect is estimated as the expected response of a given treatment value over all observations.
In any case, the propensity score approach still aims at achieving a balance of the confounding factors. When observations are divided into groups based on the estimated propensity scores (Imai and van Dyk, 2004), distributions of confounding factors (coveriates used for estimating propensity scores) are more less balanced.
A very odd application of the generalized propensity score is published recently in Ecological Applications (Yuan 2010). The paper misinterpreted the definition of the propensity score. Instead of the probability density of an observation, Yuan (2010) used the predicted mean treatment value as the propensity score. The resulting subgrouping further segregated the distributions of confounding factors, leading to a potentially more biased estimate of the causal effect.
Monday, June 21, 2010
An example of statistical modeling -- part 1: model formulation
I will regularly post discussions on examples used in Environmental and Ecological Statistics with R (available at Amazon).
This post discuss the general thought process of the PCB in fish example. In the first part, I go over the background in data and the process of establishing the model form.
The data were fish (lake trout) tissue concentrations of PCB measured from fish caught in Lake Michigan from 1970 to 2000. Each year, a number of fish were sampled from the lake and PCB concentrations were measured using the edible portion of fish sample. The main objective of the example is to study the trend of mean PCB fish tissue concentrations in the lake.
A modeling project always have the following three steps:
1. Model formulation
2. Parameter estimation, and
3. Model evaluation
A properly formulated model can be based on relevant theory or exploratory data analysis. In this case, we start with a general model often used in environmental engineering. The model suggests that the rate of change of a chemical concentration is often proportional to the concentration itself (the first order reaction model). That is, the derivative of concentration with respect to time is proportional to concentration:
Solving this differential equation, we have
where is the concentration at time t, is the concentration at time 0. Taking logarithmic transformation on both sides:
which suggests a log linear regression model with as the response variable and as the predictor. The intercept is and the slope is .
This process of establishing a linear regression model form through some basic mechanistic relations is common. A common feature of this approach is the aggregation/simplification of mechanistic relations from fine spatiotemporal scales into the scale represented by the data. For example, the first order model is a reasonable model for a given fish. But there is no possibility of measuring the PCB concentration from the same fish over time. When using concentration data from multiple fish each year, the resulting relationship is about the over average PCB concentrations in all fish in Lake Michigan. Also because of the simplification and aggregation, model coefficients may now vary due to other factors. In this example, we will explore the changes of k as a function of fish size.
In summary, we used the first order reaction model as the basis to establish the basic form of the relationship between PCB fish tissue concentration and time (since PCB was banned from the area).
This post discuss the general thought process of the PCB in fish example. In the first part, I go over the background in data and the process of establishing the model form.
The data were fish (lake trout) tissue concentrations of PCB measured from fish caught in Lake Michigan from 1970 to 2000. Each year, a number of fish were sampled from the lake and PCB concentrations were measured using the edible portion of fish sample. The main objective of the example is to study the trend of mean PCB fish tissue concentrations in the lake.
A modeling project always have the following three steps:
1. Model formulation
2. Parameter estimation, and
3. Model evaluation
A properly formulated model can be based on relevant theory or exploratory data analysis. In this case, we start with a general model often used in environmental engineering. The model suggests that the rate of change of a chemical concentration is often proportional to the concentration itself (the first order reaction model). That is, the derivative of concentration with respect to time is proportional to concentration:
Solving this differential equation, we have
where is the concentration at time t, is the concentration at time 0. Taking logarithmic transformation on both sides:
which suggests a log linear regression model with as the response variable and as the predictor. The intercept is and the slope is .
This process of establishing a linear regression model form through some basic mechanistic relations is common. A common feature of this approach is the aggregation/simplification of mechanistic relations from fine spatiotemporal scales into the scale represented by the data. For example, the first order model is a reasonable model for a given fish. But there is no possibility of measuring the PCB concentration from the same fish over time. When using concentration data from multiple fish each year, the resulting relationship is about the over average PCB concentrations in all fish in Lake Michigan. Also because of the simplification and aggregation, model coefficients may now vary due to other factors. In this example, we will explore the changes of k as a function of fish size.
In summary, we used the first order reaction model as the basis to establish the basic form of the relationship between PCB fish tissue concentration and time (since PCB was banned from the area).
Tuesday, June 15, 2010
Quantile Regression and a week in Denver
I attended the 3rd USGS Modeling Conference in Denver last week. This was an event participated mostly by USGS and other DOI scientists. The conference started on Monday with a short course on quantile regression, a topic increasingly mentioned in ecological literature, especially by those interested in estimating ecological threshold. In the last year or so, I read a few papers and some applications of the methods in ecological literature. My initial impression was not very positive. To begin with, the original reference on the subject presented the method in terms of a optimization problem. It gave me the impression that quantile regression is what least square to linear model, a computational method without statistics. Many applications, especially those presented in AWRA conferences are incomprehensible half-baked products. In signing up for the short course, I wanted to learn more before I start to criticize quantile regression.
The instructor, Brian Cade, is a USGS statistician. He started out with a definition and went on with some examples and R code. By the coffee break, I know how to run R to fit a quantile model, but don't know much else. I wanted to know the meaning of the output and their practical implications. During the second half of the short course, I started to think about these questions. I stopped Brian before he went into more advanced topics (such as Bayesian quantile regression). I asked a discussion on the examples already covered, especially on their practical implications. What were achieved by using quantile regression that cannot be achieved using regular regression?
Here are what I learned from the short course.
1. Quantile regression is a form of nonparametric modeling -- no probabilistic assumption was imposed on data. Initially, a quantile regression was presented as an extension of the linear regression model. Instead of modeling the mean of the response, a quantile regression models a given quantile as a linear model of the predictor. The current implementation of QR in R also allows smoothing functions be used. As a result, QR can be fully nonparametric -- not only the response variable is not limited to a specific distribution, but also the model form of a quantile.
2. Quantile regression is an exploratory data analysis tool. The main application is the detection of changes in response variable distribution. In regular regression (lm and glm), we impose a fixed distributional assumption on the response. In QR, this distributional assumption is no longer applicable. When multiple quantiles are estimated, we can examine the response variable probability distribution at different locations along the x-axis. I can see that a good graphical presentation of the estimated response variable probability distributions can be very useful. For example, I have been working on the USGS urbanization data. One response variable is the mean tolerance score of macroinvertebrates community, a variable limited between 0 and 10. It is reasonable to believe that when urbanization level in the watershed is low, the distribution of species tolerance is limited by factors other than urbanization induced pollution and habitat modification. But when urbanization is high, only the most tolerant species are left leading to a tolerance score concentrated in the upper end of the spectrum. Along the urban gradient, we can believe a point where the probability distribution of the tolerance score may have changed.
3. If the advantage of a quantile regression is the detection of changes in response variable distribution, any application of the method should produce multiple quantiles so that the full distribution can be evaluated numerically. A good method for graphically display these distributions is essential.
4. An immediate application of quantile regression is in risk assessment. When the quantile regression results are translated into response variable distributions along the x-axis, these distributions can be translated into probability of exceedence.
5. I still need to investigate the theoretical background of model assessment tools. For example, it is still not clear to me how AIC is calculated without a probabilistic assumption on data. A double exponential distribution was imposed on the weighted residuals. I must learn more on this. But, as quantile regression is a non-parametric modeling method, it should be used as an exploratory tool for hypothesis generation, rather than a modeling tool. As a result, AIC and other model diagnostic tools are less relevant.
The rest of the conference were also interesting. I sense a big difference in attitude towards modeling between people in government and in academia. Government scientists are goal oriented -- they need to complete a project. Academic scientists are interested in story telling -- they seek a minimum publication unit. Government scientists often pursue projects with the sole purpose of fulfilling some regulatory mandate. Academic scientists often pursue projects that tickles our fancy but not necessarily mean much.
Denver in June is beautiful.
The instructor, Brian Cade, is a USGS statistician. He started out with a definition and went on with some examples and R code. By the coffee break, I know how to run R to fit a quantile model, but don't know much else. I wanted to know the meaning of the output and their practical implications. During the second half of the short course, I started to think about these questions. I stopped Brian before he went into more advanced topics (such as Bayesian quantile regression). I asked a discussion on the examples already covered, especially on their practical implications. What were achieved by using quantile regression that cannot be achieved using regular regression?
Here are what I learned from the short course.
1. Quantile regression is a form of nonparametric modeling -- no probabilistic assumption was imposed on data. Initially, a quantile regression was presented as an extension of the linear regression model. Instead of modeling the mean of the response, a quantile regression models a given quantile as a linear model of the predictor. The current implementation of QR in R also allows smoothing functions be used. As a result, QR can be fully nonparametric -- not only the response variable is not limited to a specific distribution, but also the model form of a quantile.
2. Quantile regression is an exploratory data analysis tool. The main application is the detection of changes in response variable distribution. In regular regression (lm and glm), we impose a fixed distributional assumption on the response. In QR, this distributional assumption is no longer applicable. When multiple quantiles are estimated, we can examine the response variable probability distribution at different locations along the x-axis. I can see that a good graphical presentation of the estimated response variable probability distributions can be very useful. For example, I have been working on the USGS urbanization data. One response variable is the mean tolerance score of macroinvertebrates community, a variable limited between 0 and 10. It is reasonable to believe that when urbanization level in the watershed is low, the distribution of species tolerance is limited by factors other than urbanization induced pollution and habitat modification. But when urbanization is high, only the most tolerant species are left leading to a tolerance score concentrated in the upper end of the spectrum. Along the urban gradient, we can believe a point where the probability distribution of the tolerance score may have changed.
3. If the advantage of a quantile regression is the detection of changes in response variable distribution, any application of the method should produce multiple quantiles so that the full distribution can be evaluated numerically. A good method for graphically display these distributions is essential.
4. An immediate application of quantile regression is in risk assessment. When the quantile regression results are translated into response variable distributions along the x-axis, these distributions can be translated into probability of exceedence.
5. I still need to investigate the theoretical background of model assessment tools. For example, it is still not clear to me how AIC is calculated without a probabilistic assumption on data. A double exponential distribution was imposed on the weighted residuals. I must learn more on this. But, as quantile regression is a non-parametric modeling method, it should be used as an exploratory tool for hypothesis generation, rather than a modeling tool. As a result, AIC and other model diagnostic tools are less relevant.
The rest of the conference were also interesting. I sense a big difference in attitude towards modeling between people in government and in academia. Government scientists are goal oriented -- they need to complete a project. Academic scientists are interested in story telling -- they seek a minimum publication unit. Government scientists often pursue projects with the sole purpose of fulfilling some regulatory mandate. Academic scientists often pursue projects that tickles our fancy but not necessarily mean much.
Denver in June is beautiful.
Subscribe to:
Posts (Atom)
Log or not log
LOGorNOTLOG.html Log or not log, that is the question May 19, 2018 In 2014 I taught a special topics class on statistical i...
-
Statistics is More Than \( P \)-values and AIC Statistics is More Than \( P \)-values and AIC Introduction The cont...
-
even if you believe that she is not worthy of your talent. Recently, I served as the advisor of a graduate student at Duke University. It...
-
The second edition of EESwithR is coming in fall 2016. I added one new chapter to the book and it is posted as an example chapter on githu...