Friday, September 30, 2011

Reproduceable Results

Over a year ago, Duke was in the spotlight for academic fraud involving some biologists misuse statistics and fabricate results. The main culprit resigned and now practicing medicine in South Carolina. Duke University apparently does not agree that their rising star was a fraud, as the Duke cancer research head wrote a very positive letter of recommendation for the guy.

One clue that something was wrong was that no one can reproduce the results the Duke team published in Nature. Unable to reproduce a result in a published paper is such a common phenomenon in ecological studies because no one can repeat a costly experiment or even obtain the data used in a paper. Even when people put code and data as a supplement to a paper, rarely reviewers check these materials. In order for a reviewer to check the work of a submitted paper, we should ask the author to provide something like a Sweave file, code plus documentation and data. This is why I came back to Sweave today to prepare my report using Sweave so that those interested can repeat the work I did.

During the summer, I read a paper published in the journal Methods in Ecology and Evolution (1:25-37, 2010), advocating the use of a program called "TITAN" for detecting and estimating community threshold using species compositional data. In one example, the authors studied the effect of urbanization in a watershed on the biodiversity in stream using data from multiple watersheds in Maryland. The conclusion that a mere 1 to 2% urban land cover in a watershed can result in a dramatic shift in biodiversity in streams is highly suspicious as the measurement error of land cover (as a percentage of total land area in a watershed) can be very high (5 to 10%). Subsequent papers by the same authors also reached similar conclusions (very small urban land cover will lead to large changes in aquatic ecosystem biodiversity).

A careful examination of the code, I realized that the statistics behind the method was wrong. The mistake is not obvious in the description of the method, but should have been detected if the reviewers were critical enough to try the method with a simple simulation.  I and a colleague conducted an extensive simulation study and we found that TITAN cannon detect known thresholds in simulated data, unless the threshold is clearly a step function noticeable without using a computer. The effort took us several weeks. Reviewers of this paper should have the code and data set for one example. But the TITAN authors included a bootstrapping procedure that made running the program time consuming. As a result, I suspect that reviewers of the manuscript never ran the code. If they did, they would have discovered that TITAN will produce different estimates every time the model is executed using the same data. That is, the reviewers would likely to see a different result from those in the manuscript.

Thursday, September 29, 2011

Sweave

I used Sweave a while ago. It seems to be useful when preparing answer keys for my class. I stopped using it because I didn't know how to control the figure size (as in pdf(..., width=4)).  (I now know that I can use the width= and height options in Sweave.  But for some reason, the output is not exactly what I wanted.)

I picked it up again today so that I can document a consulting project without handing out two files.  When I used Sweave last time, I used it on a Mac. Today, I use a PC with texlive. When compiling the resulting .tex file, the message "Sweave.sty not found" appeared. After a few moment, I realized that the default for "stylepath" in R is now FALSE. But when switching it back to TRUE, the path in the resulting .tex file is

\usepackage{c:/PROGRA~1/R/R-213~1.1/share/texmf/tex/latex/Sweave}

which is not recognized by tex or latex as a proper path. Initially, I copied the folder to a different place (e.g., c:/texlive/Sweave) and changed the path in the resulting .tex file. But a better way is probably to reinstall R in a folder that does not have space in its name, so that I don't have to modify the generated .tex file every time the .Rnw file is updated. By moving R (e.g. to C:/R-2.13.1), the "site-start.el" file needs to be modified so that Emacs-ESS can find R.

Log or not log

LOGorNOTLOG.html Log or not log, that is the question May 19, 2018 In 2014 I taught a special topics class on statistical i...