13 March 2018

Announcing a crowdsourced reanalysis project

(Update 2018-03-14 10:18 UTC: I have received lots of offers to help with this, and I now have enough people helping.  So please don't send me an e-mail about this.)

Back in the spring of 2016, for reasons that don’t matter here, I found myself needing to understand a little bit about the NHANES (National Health and Nutrition Examination Survey) family of datasets.  NHANES is an ongoing programme that has been running in the United States since the 1970s, looking at how nutrition and health interact.

Most of the datasets produced by the various waves of NHANES are available to anyone who wants to download them. Before I got started on my project (which, in the end, was abandoned, again for reasons that don’t matter here), I thought that it was a good idea to check that I understood the structure of the data by reproducing the results of an article based on them. This seemed especially important because the NHANES files—at least, the ones I was interested in—are supplied in a format that requires SAS to read, and I needed to convert them to CSV before analyzing them in R.  So I thought the best way to check this would be to take a well-cited article and reproduce its table of results, which would allow me to be reasonably confident that I had done the conversion right, understood the variable names, etc.

Since I was using the NHANES-III data (from the third wave of the NHANES programme, conducted in the mid-1990s) I chose an article at random by looking for references to NHANES-III in Google Scholar (I don’t remember the exact search string) and picking the first article that had several hundred citations.  I won't mention its title here (read on for more details), but it addresses what is clearly an important topic and seemed like a very nice paper—exactly what I was looking for to test whether or not I was converting, importing, and interpreting the NHANES data correctly.

The NHANES-III datasets are distributed in good old-fashioned mainframe magnetic tape format, with every field having a fixed width. There is some accompanying code to interpret these files (splitting up the tape records and adding variable names) in SAS.  Since I was going to do my analyses in R, I needed to run this code and export the data in CSV format. I didn't have access to SAS (in fact, I had never previously used it), and it seemed like a big favour to ask someone to convert over 280 megabytes of data (which is the size of the three files that I downloaded) for me, especially because I thought (correctly) that it might take a couple of iterations to get the right set of files.  Fortunately, I discovered the remarkable SAS University Edition, which is a free-to-use version of SAS that seems to have most of the features one might want from a statistics package. This, too, is a big download (around 2GB, plus another 100MB for the Oracle Virtual Machine Manager that you also need—SAS are not going to allow you to run your software on any operating system, it has to be Red Hat Linux, and even if you already have Red Hat Linux, you have to run their virtualised version on top!), but amazingly, it all worked first time.  As long as you have a recent computer (64-bit processor, 4GB of RAM, a few GB free on the disk) this should work on Windows, Mac, or Linux.

Having identified and downloaded the NHANES files that I needed, opening those files using SAS University Edition and exporting them to CSV format turned out to required just a couple of lines of code using PROC EXPORT, for which I was able to find the syntax on the web quite easily.  Once I had those CSV files, I could write my code to read them in, extract the appropriate variables, and repeat most of the analyses in the article that I had chosen.

Regular readers of this blog may be able to guess what happened next: I didn’t get the same results as the authors.  I won’t disclose too many details here because I don’t want to bias the reanalysis exercise that I’m proposing to conduct, but I will say that the differences did not seem to me to be trivial.  If my numbers are correct then a fairly substantial correction to the tables of results will be required.  At least one (I don't want to give more away) of the statistically significant results is no longer statistically significant, and many of the significant odds ratios are considerably smaller.  (There are also a couple of reporting errors in plain sight in the article itself.)

When I discovered these apparent issues back in 2016, I wrote to the lead author, who told me that s/he was rather busy and invited me to get in touch again after the summer. I did so, but s/he then didn't reply further. Oh well. People are indeed often very busy, and I can see how, just because one person who maybe doesn't understand everything that you did in your study writes to you, that perhaps isn't a reason to drop everything and start going through some calculations you ran more than a decade ago.  I let the matter drop at the time because I had other stuff to do, but a few weeks ago it stuck its nose up through the pile of assorted back burner projects (we all have one) and came to my attention again.

So, here's the project.  I want to recruit a few (ideally around three) people to independently reanalyse this article using the NHANES-III datasets and see if they come up with the same results as the original authors, or the same as me, or some different set of results altogether.  My idea is that, if several people working completely independently (within reason) come up with numbers that are (a) the same as each other and (b) different from the ones in the article, we will be well placed to submit a commentary article for publication in the journal (which has an impact factor over 5), suggesting that a correction might be in order. On the other hand, if it turns out that my analyses were wrong, and the article is correct, then I can send the lead author a note to apologise for the (brief) waste of his time that my 2016 correspondence with him represented. Whatever the outcome, I hope that we will all learn something.

For the moment I'm not going to name the article here, because I don't want to have too many people running around reanalysing it outside of this "crowdsourced" project.  Of course, if you sign up to take part, I will tell you what the article is, and then I can't stop you shouting its DOI from the rooftops, but I'd prefer to keep this low-key for now.

If you would like to take part, please read the conditions below.

1. If the line below says "Still accepting offers", proceed. If it says "I have enough people who have offered to help", stop here, and thanks for reading this far.

========== I have enough people who have offered to help ==========

2. You will need a computer that either already has SAS on it, or on which you can install SAS (e.g., University Edition).  This is so that you can download and convert the NHANES data files yourself.  I'm not going to supply these, for several reasons: (a) I don't have the right to redistribute them, (b) I might conceivably have messed something up when converting them to CSV format, and (c) I might not even have the right files (although my sample sizes match the ones in the article pretty closely).  If you are thinking of volunteering, and you don't have SAS on your computer, please download SAS University Edition and make sure that you can get it to work.  (An alternative, if you are an adventurous programmer, is to download the data and SAS files, and use the latter as a recipe for splitting the data and adding variable names.)

3. You need to be reasonably competent at performing logistic regressions in SAS, or in a software package than can read SAS or CSV files.  I used R; the original authors used proprietary software (not SAS).  It would be great if all of the people who volunteered used different packages, but I'm not going to turn down anyone just because someone else wants to use the same analysis software. However, I'm also not going to give you a tutorial on how to run a logistic regression (not least because I am not remotely an expert on this myself).

4. Volunteers will be anonymous until I have all the results (to avoid, as far as possible, people collaborating with each other).  However, by participating, you accept that once the results are in, your name and your principal results may be published in a follow-up blog post. You also accept, in principle, to be a co-author on any letter to the editor that might result from this exercise.  (This point isn't a commitment to be signed in blood at this stage, but I don't want anyone to be surprised or offended when I ask if I can publish their results or use them to support a letter.)

5. If you want to work in a team on this with some colleagues, please feel free to do so, but I will only put one person's name forward per reanalysis on the hypothetical letter to the editor; others who helped may get an acknowledgement, if the journal allows.  Basically, ensure that you can say "Yes, I did most of the work on this reanalysis, I meet the criteria for co-authorship".

6. The basic idea is for you to work on your own and solve your own problems, including understanding what the original authors did.  The article is reasonably transparent about this, but it's not perfect and there are some ambiguities. I would have liked to have the lead author explain some of this, but as mentioned above, s/he appears to be too busy. If you hit problems then I can give you a minimum amount of help based on my insights, but of course the more I do that, the more we risk not being independent of each other. (That said, I could do with some help in understanding what the authors did at one particular point...)

7. You need to be able to get your reanalysis done by June 30, 2018.  This deadline may be moved (by me) if I have trouble recruiting people, but I don't want to repeat a recent experience where a couple of the people who had offered to help me on a project stopped responding to their e-mails for several months, leaving me to decide whether or not to drop them.  I expect that the reanalysis will take between 10 and 30 hours of your time, depending on your level of comfort with computers and regression analyses.

Are you still here? Then I would be very happy if you would decide whether you think this reanalysis is within your capabilities, and then make a small personal commitment to follow through with it.  If you can do that, please send me an e-mail (nicholasjlbrown, gmail) and I will give you the information you need to get started.