Announcing MrPUP

We are happy to announce that the DocGraph Journal is releasing the first Medicare Public Use File released by a private organization.

The new public use file is called MrPUP, and it details how Outpatient providers in Medicare refer procedures. MrPUP stands for Medicare Referring Provider Utilization for Procedures. You can download the data here, the Open Source Eventually version is free to attendees of Datapalooza!

There are many Medicare procedures, including most lab and imaging tests, that require a specific physician to refer them. When a Medicare beneficiary gets an X-Ray or a blood test done, Medicare wants to know what provider ordered these tests. As a result, for some, but not all, procedures, Medicare requires the referring doctor’s NPI number.

Previously, CMS has released information on the performed procedures in Medicare, coded by NPI and hcpcs_code. MrPUP has a similar data structure, but instead of including the performing NPI, it includes the referring NPI.


The simplest way to explain the value generated with this new data, is to demonstrate how the analysis of lab referrals took place on open data before this data release, and look at it after. Before this data release, it was possible to understand which providers were sending data to which labs by leveraging the teaming dataset, that was the first data set that we released and the one that gave DocGraph its name. We also know, from the current performing NPI PUF from CMS, what kinds of tests those laboratories are performing. Using the analysis tools provided by CareSet Systems we can see how that data analysis might look for a specific doctor:



This is a flow graph from an actual interface inside CareSet systems. On the left we see the procedures that our doctor is responsible for with Medicare in blue. Our doctor, Dr. Magid, is in orange in the center. The green dots represent the laboratories that Dr. Magid shares patients with. The red dots show the procedures that the laboratories are performing. Obviously the labs that this provider works with, which include both LabCorp, Quest and the lab at this providers group practice, do hundreds of different and distinct lab tests with a huge variety and overlap. Our doctor could only be responsible for a tiny fraction of these… but which ones?

Enter MrPUP. Using MrPUP we can generate a new graph, where the edges of the connections between the labs and the provider can be labeled with specific lab tests. This gives us insight into how this doctor is referring labs, in a way that the first analysis just does not.


You can see how much more powerful this analysis method is.


There are some caveats to how this data should be analyzed. Not every procedure in Medicare requires a referral NPI to included in the claim. But all claims allow for this field to be filled in. There are some cases where CMS requires this field to be properly filled in, but there are no cases that we are aware of where its use is forbidden. That means that this data set might include some interesting data that is not indicative of any trends. Lets imagine a Cardiologist who configures her billing system to always include the NPI of the primary care doctor who referred a patient. In our data set, those primary care providers would appear to have a strange and unusual amount of “referred cardiology procedures”. It would make them really stand out in the data set as unique. But in fact, that information does not say anything interesting at all about those primary care doctors… its just an artifact of the strange way that one cardiologist has decided to bill.

There are many cases where the underlying CMS claims data includes what appear to be self-referrals. Of course, self-referral is technically not allowed by policy, but that policy does not extend to requirements about how providers have to fill out specific claims forms. So the underlying data includes lots of cases where the same provider is included in both the referring provider NPI field, and the performing provider NPI field. The vast majority of these are not actually doing anything shady, but are just honoring CMS requirements about how to use the various claims fields. More importantly, when a self-referral is made, then those procedure patterns already appear in the standard CMS outpatient utilization PUF file. For those reasons, and to generally avoid drama, we have excluded self referring procedures from this PUF. We might change how we address this in the future, but this is the simplest way to keep this data release clean.

Hopefully the release of the MrPUP data will draw attention to the requirements that CMS makes regarding this specific billing field, and future policies will ensure that the data becomes more reliable over time. Or not, one never can tell about these things.

Data Licensing

Although this is open data, it is not costless. It takes money for us to work on DocGraph and as a result we are charging a nominal fee for access to the data. If you are a student, researcher, academic or hacker, you probably want to purchase the Open Source Eventually (OSE) version of the data. This version of the data is much cheaper (think textbook) and in one year will become a Creative Commons licensed data file. In the meantime, any work you do with or on the file must be released under the Creative Commons, or some other Open Source License. This version does not allow you to share the data in any way.

If you would like to use the product in your product or service or otherwise leverage this data, you can purchase a commercial-friendly license for it. This costs a little more, but it is still hundreds of thousands of dollars less that it would cost to create the data set yourself. We appreciate those who choose to purchase this license, because this is what allows us to continue our work at DocGraph.

How to get the data for free for Datapalooza attendees:

In order to get the data for free (you will be getting the OSE version), you must @ mention @DocGraph in a tweet that shows you pictured with something fun that clearly demonstrates that you are in attendance at datapalooza. In fact, if you are not at datapalooza, and your tweet pretending that you are at datapalooza is clever enough, we might just decide to give you a free copy in case. After that, go ahead and apply for the free data at the MrPUP download page. Once you have tweeted at us, follow @DocGraph so that we can DM you a link to the download file!


DocGraph Teaming data update

CMS has released an updated version of the DocGraph teaming data set* that was redacted on October 5, 2015. The DocGraph Teaming data set documents how healthcare providers in the U.S. work together. We believe this release corrects the critical issues we identified prior to the redaction. Access to the raw data can be found through our Linea data portal, and higher-level support and software services can be found from our sister company, CareSet Systems.

The improvements to the data set includes:

  • Updated documentation which clearly states the date ranges found in each downloadable file. (The previous version of the data was retracted because the date ranges were mislabeled);
  • Very recent data. This data release has 2015 data updated through Oct 1, 2015;
  • Use of a consistent algorithm for all years from 2009-2015, which makes year-over-year analysis possible. The CareSet Systems blog will have several articles coming out soon about our year-over-year analysis of these data sets.

We expect to work with CMS to validate the data set in the coming months. We know that this data is a dramatic improvement on what was previously available, but we have not had the opportunity to review the data creation methods and validate that it was performed according to our algorithm specification. Until that happens, DocGraph cannot vouch for the data. We can only say that CMS is “vouching” that the data is fixed by releasing it, and that the most of the issues with the previous data sets appear to have been corrected.

As always, DocGraph would like to thank CMS and HHS for their continued commitment to openness.

There are still a few documentation issues with this data release, and we will be coordinating with CMS on an ongoing basis to correct them. Until then, we encourage the DocGraph community to keep the following in mind:

  • CMS has not explicitly documented the algorithm they used to create the data set. This algorithm has changed and is not 1-to-1 comparable with the data sets previously released. They appear to have made the modifications to the algorithm that we suggested, but we have no way of verifying that for now.
  • CMS continues to refer to the data as a “referral” data set, despite it being a “shared patient in time” data set that includes “referrals” as one subset. While this data does include traditional referrals as a subset, this is not strictly referral data. There are fields used in medical claims to document referrals and this data set was not generated using those fields.
  • CMS continues to label the file as “physician” despite it covering all medicare provider types (except pharmacies, due to the exclusion of Part D data). Being a physician is not a prerequisite for being included in these data sets, any provider who bills medicare enough to meet the patient privacy threshold will be included in the data set.
  • There are coding problems in the datasets. Specifically, there are many “impossible” NPIs (National Provider Identifiers). For instance, all of these NPI’s are returned from the query [of which year, 2014]:

npi, npi_count, problem
6073299,2,”This NPI is not 10 digits, it has 7: 6073299″
16073299,2,”This NPI is not 10 digits, it has 8: 16073299″
135632034,8,”This NPI is not 10 digits, it has 9: 135632034″
162909399,1,”This NPI is not 10 digits, it has 9: 162909399″
162915969,1,”This NPI is not 10 digits, it has 9: 162915969″
174031524,1,”This NPI is not 10 digits, it has 9: 174031524″
999999992,1,”This NPI is not 10 digits, it has 9: 999999992″
1063828204,1,”This NPI is does not pass luhn: 1063828204″
1194809840,3,”This NPI is does not pass luhn: 1194809840″
1245396655,1,”This NPI is does not pass luhn: 1245396655″
1619944960,1,”This NPI is does not pass luhn: 1619944960″
1740228645,3,”This NPI is does not pass luhn: 1740228645″
1750458455,5,”This NPI is does not pass luhn: 1750458455″
9999999991,502,”This NPI is does not pass luhn: 9999999991″
9999999992,11634,”This NPI is does not pass luhn: 9999999992″
9999999994,116,”This NPI is does not pass luhn: 9999999994″
9999999996,2975,”This NPI is does not pass luhn: 9999999996″

Based on previous investigations, we know that the contractors who generated the files are faithfully returning what is listed in the NPI field. The underlying problem is with the actual Medicare claims database. We suspect that these are the last vestigial organs of pre-NPI billing systems, but we cannot be sure. Happily these strange numbers are far less common in the 2015 data set. Perhaps CMS is succeeding in squashing the non-NPI coded transactions once and for all.

Here is a NPI validity report for the 2015 data:

npi, npi_count, problem
9999999991,103,”This NPI is does not pass luhn: 9999999991″
9999999992,1569,”This NPI is does not pass luhn: 9999999992″
Much better!

For real-time discussion about these data sets, join the DocGraph google group.

For a more thorough exploration of this data, sign up for CareSet news (link to sign up here).

Enjoy… and watch this space! We will be opening lots more data in 2016 and beyond.


Fred Trotter

Co-founder, DocGraph and CareSet Systems
* DocGraph teaming data shows how healthcare providers who bill Medicare cooperate to deliver care to their patients. Essentially, the teaming dataset documents Medicare providers who share patients in a given year. The result of this is a data structure that data scientist call a weighted directed graph of relationships. To be more specific, the method used to generate the graph converts the bi-partite (two types of nodes, patients and providers) graph structure into a graph with just one type, a graph showing relationships between providers. In layman’s terms, the DocGraph teaming data set is massive map of the healthcare system in the United States. It shows referrals, ordering patterns and many other types of healthcare provider collaborations. This data set exists as the result of a FOIA request made by DocGraph. Since that time, DocGraph has continued to collaborate with CMS to ensure that the data was updated and reliable.

Batea Chrome Extension Announcement

Today we announced the release of Batea, a Chrome extension designed to help editors improve Wikipedia pages. 

Batea allows participants to donate specific browsing data and provide comments on Wikipedia pages. The extension is part of the Batea study, an IRB-approved data donation process.

We invite the DocGraph community to check it out, and if you are interested, to install it!  As always, let us know what you think by emailing us at
Or you can go straight to the extension in the Chrome store here:

DocGraph Launches Linea

Press Release on PR Newswire:

HOUSTON, June 1, 2015 /PRNewswire/ — DocGraph is launching a new web based portal Linea ( to enable the health data science community to discover, aggregate and enrich new open healthcare datasets.

DocGraph Linea is based on technology developed and contributed by Merck (known as MSD outside the United States and Canada). DocGraph Linea will provide data scientists a socially-enabled community open data platform that collects details about disparate healthcare datasets, and further allows the community to extend what data is available. Users will be able to search datasets, understand data lineage, view relationship matrices, add metadata, and see community algorithms.

Initially, DocGraph will seed the site with its known list of viable data sources.  Users will be able to contribute data they discover or create themselves, and DocGraph Linea will act as a marketplace for innovative data releases and code. DocGraph Linea will pull together and link to assorted datasets under Public Domain, Open Source, Creative Commons, and other data licenses specific to the data’s source. The community will be able to review and evaluate datasets on the site to ensure quality. DocGraph Linea will provide a curated, disambiguated, and accessible directory of open data.

Fred Trotter, Founder and Data Journalist at DocGraph, said, “While there are already several places to discover and download open healthcare data, there is almost nothing available to help people learn to exercise these data sets. Merck’s IT group has made a substantial technology contribution, which will allow the larger healthcare community to derive new open healthcare data sets. The end result will be lots of new innovations, many new healthcare data startups, and ultimately better healthcare as our society’s understanding of the nuances of healthcare delivery accelerates.”

Peter Lega, Director of Emerging Technology at Merck, said, “As the ecosystem of open data grows, these new capabilities to easily discover, share and enrich it will help foster collaboration, a better corpus of data, and new insights in the open data community.”

About DocGraph

DocGraph ( is an organization that works to create, maintain, and improve open healthcare datasets.  It aims to grow the open health data movement and build a community of data scientists, journalists, and clinical enterprises who use open data to understand and help evolve the healthcare system.

DocGraph Datathon

As always, DocGraph will be attending Datapalooza, May 31st-Jun 3 2015 in Washington D.C.

Thanks to our friends at we will be hosting a datathon the day before Datapalooza starts.

Join us for:

  • DocGraph data visualization hacking
  • Data structure tutorials for multiple open data sets
  • State-level doctor data hacking
  • Food data demos
  • Sessions on the new prescribing pattern data.
  • Sessions on the procedure pattern data
  • Sessions on the open payments data


1776 (12th Floor)
1133 15th Street Northwest
Washington, DC 20005

Time and Data:

Saturday, May 30, 2015 from 9:00 AM to 5:00 PM (EDT)

It costs $35 to register for

Data release: Open Provider Directory and Open Formulary comment data

Recently, HHS released a proposed rule regarding new regulations for health insurance companies. The specific document is called:

Patient Protection and Affordable Care Act: HHS Notice of Benefit and Payment Parameters for 2016

In that proposed rule were two open data concepts that are worth noting:

  • A suggestion that insurance companies be required to release their formulary data as machine readable data sets.
  • A suggestion that insurance companies be required to release data about their current provider directory as machine readable data sets.

As you might imagine, the DocGraph Journal consistently advocates for open data and indeed, we did submit comments regarding this issue…

We had one of our part time researchers (thanks Armie!!) search all of the comments for mentions of “machine readable” and/or “data” to see who had commented on this matter besides us. Then we created a google sheets page with all of the relevant comments in one place. We are now releasing this data to the public.

Read on to access the data, and to read our first-pass analysis of what we found!

Read more