Next Generation DocGraph Data

After many months of mild government employee harassment, and delays based mostly on “other projects” that HHS has been working on, (have I ever told you how many of my friends at HHS were pulled on to the healthcare.gov launch… almost all of them), I am proud to announce that a new, improved update of the DocGraph Edge data set has been released.

This is not just an update, but a dramatic improvement in what data is available. After working with the original DocGraph, we thought of several fundamental improvements, and our FOIA request was much meatier this time. If you liked DocGraph before, you are going to love it now.

First, lets talk about the contents of the data. The original data set had three columns:

FirstNPI, SecondNPI, SharedTransactionCount

The SharedTransactionCount was the number of times that FirstNPI had seen a given patient first, and SecondNPI had seen the same patient later, within a 30 day window. (If that is tough to follow, you can read the full documentation of the original version of the data set). SharedTransactionCount was a measure of overlapping patient transactions, but we did not know how many patients were included. There was a threshold of patient count of at least 11 patients that had to be met. So if the SharedTransactionCount was 1100 there was no way to know if that meant 1100 patients, or 11 patients 100 times each. At least, that’s how the previous data set worked.

The new data set includes the actual number of patients in the patient sharing relationship. The new data set has the following data structure:

FirstNPI, SecondNPI, SharedTransactionCount, PatientTotal, SameDayTotal

The PatientTotal field is the total number of the patients involved in a treatment event (a healthcare transaction), which means that you can now tell the difference between high transaction providers (lots of transactions on few patients) and high patient flow providers (a few transactions each but on lots of patients).

In the original data set, you knew that the two treatment events happened somewhere between “on the same day” and within 30 days. In this new data set, you can differentiate treatment events that happened on the same day, using the SameDayTotal field. Now you can see how often the services were provided on the same day, which is really a whole new graph, with a 0-day window.

But wait…there’s more!! We also got additional “windows” beyond 30 days. We have data for 60, 90, 180 and 365 day windows. These data sets are much larger. The data is spread between 2012 and the middle of 2013 (which is not actually what we asked for, but we will take it). These data sets are enormous:

Window Edge Count
30 day 73 Million Edges
60 day 93 Million Edges
90 day 107 Million Edges
180 day 132 Million Edges
365 day 154 Million Edges

This means that for every edge in the database, we now have three weights instead of just one, and we have more than double the number of edges in our largest-window data set. I look forward to the DocGraph community doing a much more detailed analysis of this data set.

Probably the most significant announcement that we have to make is that we are releasing this data set for free and without any restriction. We have started a new DocGraph Alliance in which large companies pay the DocGraph Journal to reveal new and more interesting data sets, and to support the DocGraph community in analyzing open data. We will probably still crowdfund data sets when they are “brand new” but for older datasets like the DocGraph Edge data set, we are moving towards a fully open model, sponsored by Alliance Members. With that in mind, we asked HHS to go ahead and publish the newest version of DocGraph data on directly on their site for everyone to see. This means that the data can be used for any reason by anyone, without a license restriction by anyone. We hope to be announcing the initial DocGraph Alliance members soon, but you can thank them for sponsoring this model!

With that in mind, please find the data below:

What kinds of amazing things can you do with this new dataset?

Let us know on the DocGraph Community Google Group.