A few days before the end of last semester, I stumbled upon a paper in the journal Ecological
Indicators [Song and Guan(2013)]. The title indicates a Bayesian estimation method was used.
The authors started with a discussion of the “environmental efficiency analysis” and introduced an
indicator, which is some kind of mathematical function of input and output. There was no intuitive explanation on the indicator. The indicator was based on three
input variables (population, “fixed capital formation,” and non renewable energy consumption),
one desirable output variable (GDP), and one undesirable output (industrial SO2 emission). The
calculation resulted in 17 environmental efficiency scores (9 cities in 2 years). The main objective
of the paper is to explore factors affecting these environmental efficiency scores (EE), using regression.
The potential factors are (1) per capita GDP (RGDP), (2) total import and
export volume (IE), (3) the proportion of the “second industry” in GDP (GY), (4) the proportion
of “the industry of the second industry” in GDP (GGY), and (5) the proportion of environmental
spending in GDP (HZ). The authors explained that (1) is an indicator of economic scale, (2) is a
measure of economic exchange with the outside world (the authors used the term “opening up,” a bad translation of a Chinese term), (3) and (4) are measures of industry structure, and (5) is
the “government factor.” Not being an economist, I don’t want to comment on the choice of these
potential factors, except that GDP is now used both as part of the environmental efficiency score
and as a potential factor that will be used to explain the variance of the score.
The authors used a multiple regression approach, but regression coefficients were estimated using MCMC. I was expecting a discussion on the choice of prior distributions of these coefficients. But it was soon clear that there was no prior distribution. So, why did they use MCMC for a multiple regression problem? Based on the authors’ affiliation (School of Statistics and Mathematics), I assume that they know that there should be no substantive difference between using MCMC and using OLS. The authors presented the following regression model:
The estimated model coefficients may have revealed the answer:
The authors used a multiple regression approach, but regression coefficients were estimated using MCMC. I was expecting a discussion on the choice of prior distributions of these coefficients. But it was soon clear that there was no prior distribution. So, why did they use MCMC for a multiple regression problem? Based on the authors’ affiliation (School of Statistics and Mathematics), I assume that they know that there should be no substantive difference between using MCMC and using OLS. The authors presented the following regression model:
The estimated model coefficients may have revealed the answer:
coefficient estimate standard error
α -0.1188 1.7260
β1 -0.1564 0.1562
β2 3.8130 1.5510
β3 4.8050 8.0620
β4 -3.3960 5.5560
β5 0.9362 18.8900
All slopes, except β2 are statistically not different from 0! If the authors used the OLS, a typical regression model output would include a column of p-values, which will make the paper
unpublishable. Using MCMC, the authors are able to present the estimated coefficients, standard
error, and selected quantiles. Without the column of p-values, a busy reviewer may not be able
to catch the problem. (But all my students in an introductory biostatistics class recognized the
problem.)
Is this a successful story of cheating by using “Bayesian” statistics?
References
[Song and Guan(2013)] Malin Song and Youyi Guan. The environmental efficiency of Wanjian demonstration area: a Bayesian estimation approach. Ecological Indicators, 36:59–67, 2013.
References
[Song and Guan(2013)] Malin Song and Youyi Guan. The environmental efficiency of Wanjian demonstration area: a Bayesian estimation approach. Ecological Indicators, 36:59–67, 2013.