Cajun Codefest 2014

The DocGraph Journal is sponsoring Cajun Codefest 2014 (April 23-25 in Lafayette, LA) and simultaneously holding a virtual codeathon focused on the recently released DocGraphRX – Medicare D prescribing data.  Since our friends in Louisiana themed this years event “Aging in Place” and we’ve spoken with dozens of community members about DocGraphRX in the last few months, we thought the combined opportunity could not be better. Further, we’re providing $2500 in cash prizes for the virtual and in-person competitions. Each team will receive access to the prescribing patters for physicians in Louisiana.

Register today to be included in the activities leading up to and during the DocGraph challenge!

Follow the event @cajuncodefest #ccf3…

Note: Registration below is for the DocGraph virtual challenge only. If you would like to also register for the Cajun Codefest main event please visit

Next Generation DocGraph Data

After many months of mild government employee harassment, and delays based mostly on “other projects” that HHS has been working on, (have I ever told you how many of my friends at HHS were pulled on to the launch… almost all of them), I am proud to announce that a new, improved update of the DocGraph Edge data set has been released.

This is not just an update, but a dramatic improvement in what data is available. After working with the original DocGraph, we thought of several fundamental improvements, and our FOIA request was much meatier this time. If you liked DocGraph before, you are going to love it now.

First, lets talk about the contents of the data. The original data set had three columns:

FirstNPI, SecondNPI, SharedTransactionCount

The SharedTransactionCount was the number of times that FirstNPI had seen a given patient first, and SecondNPI had seen the same patient later, within a 30 day window. (If that is tough to follow, you can read the full documentation of the original version of the data set). SharedTransactionCount was a measure of overlapping patient transactions, but we did not know how many patients were included. There was a threshold of patient count of at least 11 patients that had to be met. So if the SharedTransactionCount was 1100 there was no way to know if that meant 1100 patients, or 11 patients 100 times each. At least, that’s how the previous data set worked.

The new data set includes the actual number of patients in the patient sharing relationship. The new data set has the following data structure:

FirstNPI, SecondNPI, SharedTransactionCount, PatientTotal, SameDayTotal

The PatientTotal field is the total number of the patients involved in a treatment event (a healthcare transaction), which means that you can now tell the difference between high transaction providers (lots of transactions on few patients) and high patient flow providers (a few transactions each but on lots of patients).

In the original data set, you knew that the two treatment events happened somewhere between “on the same day” and within 30 days. In this new data set, you can differentiate treatment events that happened on the same day, using the SameDayTotal field. Now you can see how often the services were provided on the same day, which is really a whole new graph, with a 0-day window.

But wait…there’s more!! We also got additional “windows” beyond 30 days. We have data for 60, 90, 180 and 365 day windows. These data sets are much larger. The data is spread between 2012 and the middle of 2013 (which is not actually what we asked for, but we will take it). These data sets are enormous:

Window Edge Count
30 day 73 Million Edges
60 day 93 Million Edges
90 day 107 Million Edges
180 day 132 Million Edges
365 day 154 Million Edges

This means that for every edge in the database, we now have three weights instead of just one, and we have more than double the number of edges in our largest-window data set. I look forward to the DocGraph community doing a much more detailed analysis of this data set.

Probably the most significant announcement that we have to make is that we are releasing this data set for free and without any restriction. We have started a new DocGraph Alliance in which large companies pay the DocGraph Journal to reveal new and more interesting data sets, and to support the DocGraph community in analyzing open data. We will probably still crowdfund data sets when they are “brand new” but for older datasets like the DocGraph Edge data set, we are moving towards a fully open model, sponsored by Alliance Members. With that in mind, we asked HHS to go ahead and publish the newest version of DocGraph data on directly on their site for everyone to see. This means that the data can be used for any reason by anyone, without a license restriction by anyone. We hope to be announcing the initial DocGraph Alliance members soon, but you can thank them for sponsoring this model!

With that in mind, please find the data below:

What kinds of amazing things can you do with this new dataset?

Let us know on the DocGraph Community Google Group.


Fair Health: One step forward, One step back

Fair Health was formed as the result of a settlement between a group of insurance companies and the District Attorney’s office of NY. The site provides a front end to its considerable pricing data, but the data is coded in CPT codes. Fair Health has taken the approach of licensing CPT descriptions from the AMA, as well as making agreements to get access to even more claims data than it was originally entitled to under the settlement. Specifically, from the Fair Health FAQ:

FAIR Health welcomes organizations to link to our website and download materials for consumer use. FAIR Health incurs fees from third parties such as the American Medical Association for use of healthcare codes in its Lookup tools, however, so links to for commercial purposes require a license agreement and payment of nominal fees.  Such commercial purposes include, but are not limited to, links established by providers or third party payors in connection with participation on state or federal health benefit exchanges.  Please contact us at for further information.

FAIR Health also licenses our consumer resources, including educational material, videos and cost lookup tools for use on organization websites and for other uses. To learn more about licensing opportunities and the associated costs, contact

(emphasis mine)

These agreements culminate in a service agreement that actually attempts to ensure that third parties that link to the site pay a fee. I would be hard pressed to find any other site on the Internet that makes such a stance in its Terms/Conditions/AUP, and I am a little surprised that Fair Health believes this is reasonable… Here it is in short:

Hyperlink Use and Disclaimer. If you or your company has accessed the FAIR Health Consumer Site through the use of a hyperlink, you agree and acknowledge that you will follow the rules set forth below:

a. All links shall link only to the FAIR Health Consumer site home page currently located at (“Consumer Site”).

b. You shall not attempt to modify, alter or frame any content on the Consumer Site. We reserve the right to review your website at any time to ensure that the link is being used appropriately.

c. FAIR Health is a New York not-for-profit corporation qualifying under section 501(c) (3) of the Internal Revenue Code. Your use of a hyperlink shall not be construed to imply sponsorship or endorsement by FAIR Health of you, your website or your products.

d. We do not necessarily review or approve of the content displayed on all websites that have linked to the Consumer Site.

e. Your website shall not include any description of FAIR Health or its products without the prior written consent of FAIR Health.

f. You agree that all FAIR Health proprietary trademarks, service marks and logos (collectively, “Marks”), belong exclusively to FAIR Health and when you use these Marks on your website, you must comply with FAIR Health’s standards. Any such use must be approved in writing by FAIR Health.

g. If we object to the link between your website and the Consumer Site for any reason in our sole discretion, you agree to remove it within twenty-four (24) hours of receiving notice from us.

h. Your use of a hyperlink linking your website with the Consumer Site is at your own risk.

(emphasis mine)

You can read the whole Terms and Conditions here. It is not lost on us that Fair Health believes that providing a direct link to the Terms and Conditions, is in fact a contradiction of the Terms and Conditions in (a). Of course, given that we are writing this article without permission, we are also violating (e). We will obviously not be subjecting this article to approval from Fair Health, which means we are also contradicting (b) and (g). In the effort for full disclosure, we also submitted the FAQ and terms to the Internet Archive, so that we can tell if it has been changed. This would also contradict the terms, which would categorize this action as equivalent to uploading a virus to their servers. Given that we obviously cannot accept the Terms and Conditions, we will obviously not be using the website. However, I am not sure how Fair Health believes that given our outright rejection of these terms, they can control what we do or do not link to. Or how they expect to give Google and Yahoo access to their facts about their changing content, but to deny us the same privileges.  Most importantly, in Fair Health’s mind, our ability to read the terms, means that we have agreed to the terms.  Fascinating, no?

Their commitment to “reading but no parsing” goes so deep that Fair Health has disabled right mouse clicks using javascript. This prevents both copy and paste, but also “open in a new tab” etc etc. Normally when I see draconian steps like this, I also see a complete lack of attention to accessibility issues. To their credit, this is not the case with Fair Health. With the exception of some missing labels, their initial form was actually fairly accessible. While I am concerned that some users with disabilities might rely on right click menu items to enable their plugins, most of the standard screen reader technology should work just fine on the site.

This puts Fair Health into the interesting position of providing some transparency, without taking an open data approach. This is problematic because the “interests of the public” are only halfway met. Fair Health indeed enables patients to lookup cost data, but it does not allow for data journalists and data scientists to examine the trends and patterns in the same data. This halfway measure would not be such a problem were it not for the implicit endorsements that Fair Health seems to be getting from both industry and government, that this approach equates to transparency for the health insurance industry. It clearly does not.

Fair Health is obviously providing an important service for consumers, but I fear that in the long term this half measure may come at the expense of true transparency. It is easy enough to endorse them for working hard to move in the right direction. However much we dither about open data tactics: They deserve credit for the progress they have made! Obviously, the DocGraph project will never pay for access to data that comes with a caveat that it cannot be shared openly, and we certainly do not accept the terms of the Fair Health AUP. But in the spirit of not being needlessly confrontational, we can at least link you to Google Search results for Fair Health, rather than linking to them directly.

Great reference article about the settlement.

Comments on the aftermath of Fair Health

UPDATE April 29, 2015: Recently, Fair Health further restricted the number of searches that are available.


HIMSS always has lots of Open Source and Open Data, but you have to know where to look.

This year, the standout sessions were from the FDA Open Data talk and Fair Health.

The FDA is slowly announcing its new set of APIs. You can read more at and follow the latest at the @openfda twitter account. I heard a talk from Taha Kass-Hout, the Chief Informatics Officer of the FDA.

I will try to get slides to post here, or I will upload the blurry photos I took of the most important slides…

For now I want to cover some of the important themes from the talk.

First, the FDA is continuously investing in cloud-based technology that enables it to be open when it should be open, and secure when it should be secure. Pharma companies are now uploading genomic sequencing data to support many of their drug approval processes, and these uploads can be many Terabytes of data each. The space needed to effectively process and examine this data is frequently an order of magnitude larger than this. This is not at all the only source of genomic data that the FDA is processing. When it investigates food-based pathogen outbreak, it is sampling those pathogens and sequencing many of them. All in all, the FDA is now dealing with a deluge of incoming and outgoing data.

The FDA has built/is building a three-tiered cloud approach in response to this massive growth in data processing requirements. First, it is investigating internal cloud infrastructure with strict access control in order to protect the trade secrets that it implicitly gets from pharma companies, along with their data uploads. It is also building a public cloud infrastructure in order to effectively collaborate in the open, when its data allows openness, which it frequently is. Lastly, it is developing a hybrid cloud infrastructure to handle “middle cases”. All of these cloud infrastructures are designed to exchange data with each other, and with FDA sites spread across the country.

Of course, the DocGraph project is probably most interested in coming improvements to FDA labeling data. Drug labels are famously difficult to deal with, they are flat text files which must be carefully processed in order to correctly interpret them. The FDA is promising dramatic improvements in what is available here. This could have tremendous implications for in-the-open analysis of medication data…

The Open FDA also has a mostly empty github page, which is worth watching as it populates.

Janos Hajagos analyzes DocGraph RX

I have just realized that I had missed a very comprehensive analysis of the DocGraph RX dataset by Janos Hajagos.

He looked at providers in Hawaii to limit the scope. You can see his full sized chart here. You can look at his sourcecode here.

Highlights from the post:

The second core aspect of the work was to map semantic clinical drugs (TTY=SCD) to WHO’s ATC drug classification system. ATC drug codes are free to use for non commercial purposes. RxNorm includes an accidental mapping to ATC drug codes but they are based on a single ingredient and not on the route, e.g., oral versus topical. A rather time consuming process, in terms of writing SQL, was done to improve the mapping. The final result while not 100% complete allows drugs to be sorted by a synthetic ATC code. Certain branded drugs like Skelaxin (Metaxalone) are not part of the current ATC release. Whenever possible I try to map to the longest length ATC code. The advantage of using ATC to sort the drugs is that we put drugs that are similiar spatially near each other. The MySQL queries for generating the refined drug database are on GitHub.

This is truly unprecedented work!!!

(Update Feb 2014: I misspelled Hajagos in this article. How embarrassing.)


Berkman Center Supports DocGraph

In an effort to assist the DocGraph Project with adding state-level physician credentialing data, the Community has started working with the Berkman Center for Internet and Society at Harvard University.  The Berkman Center is working to farm out the state-level FOIA issues to Harvard-trained FOIA experts each state.  The Berkman Center is the leading expert in Internet related legal issues and has been at the foundation of some of the most innovative cyberspace self regulation efforts.  The Creative Commons licenses are among successful Berkman projects.