DocGraph Releases First Cancer Dataset at Digital Pharma East

“The data is aggregated Medicare data starting in 2010 and covers six years worth of data,” says DocGraph founder, Ashish Patel who announced the dataset’s availability in Philadelphia at the 10th Annual Digital Pharma East conference.

Use of this dataset will be open to anyone, including scientists, oncologists, and digital health entrepreneurs. According to DocGraph data scientist Fred Trotter, “We’re interested in exploring important differences in the experience of cancer patients based on factors such as treatment pathway, geography, and types of physicians and providers.”

The Digital Pharma Series is part of the Life Science Digital Marketers’ Forum for learning the latest digital, mobile and social strategies that yield results. Held all over the world, this year’s Digital Pharma East conference hosted more than 800 pharma executives.

“Our Moonshot dataset can be used to understand the patient’s journey and guide new therapies and trails,” says Patel. “Releasing the cancer dataset at Digital Pharma East puts the resource in front of the right people to speed the oncology ecosystem towards cures.”

DocGraph pioneered the release of national provider referral data made available by the US government. DocGraph data releases have empowered researchers and entrepreneurs to create new data-backed healthcare solutions, and has spawned a growing community of problem solvers who have used the data to create new, innovative solutions.

The Cancer Moonshot datasets are available at no charge under the Open Source Eventually License and can be requested at

For a commercial-friendly license of this data, contact DocGraph’s sister company CareSet Systems at


DocGraph joins Claudia Williams, Niall Brennan and Jess Kahn in DC to Promote Transparency and Fairness in the Healthcare System

WASHINGTON, September 28, 2016 – Health data scientist Fred Trotter presented today at the White House Open Data Innovation Summit around national health data transparency. Trotter is the founder of DocGraph which released the US government’s first national Provider referral pattern data in 2012 that enabled researchers, journalists, and companies around the nation to provide data-backed healthcare solutions.

The White House Open Data Innovation Summit is an initiative that showcases the power of data in tackling the biggest challenges of our democracy, creating positive change, and strengthening the broader civic community.

“Our work at DocGraph is designed to bring transparency and ultimately fairness to the healthcare system,” says Trotter. “The Obama administration has been committed to the ideal of using open data to improve the healthcare system since the first year of his first term and indeed, many of the open data policy decisions made then are now bearing fruit.”

DocGraph data is used by a range of organizations from small data science companies to big pharma. Commercialized use of this data can be found at sister company, CareSet Systems, where Trotter serves as Chief Technology Officer.

“DocGraph and CareSet are able to partner with HHS generally and CMS specifically to create new data sets, especially around Medicare and Medicaid data sets,” he says. “Being invited to the Open Data Innovation Summit to demonstrate our work on open healthcare data is an acknowledgement that our work at DocGraph contributes to the public good, and that the public-private collaboration that is the basis of the CareSet business model is sustainable in the long term.”

On June 29, 2016 Fred Trotter joined Vice President Joe Biden and other national healthcare leaders today for the invitation only National Cancer Moonshot Summit. The Vice President announced that DocGraph would release an open cancer dataset in the Fall of 2016 to help improve outcomes in treating cancer. The dataset will contain summarized information about almost a million Medicare cancer patients and more than 10 million specific claims events, providing the most accurate picture to date of how cancer is treated by Medicare.

Trotter presented along with Claudia Williams Senior Advisor, Health Innovation and Technology at White House Office of Science and Technology Policy, Niall Brennan Chief Data Officer, Centers for Medicare & Medicaid Services, and Jessica Kahn Director, Data and Systems Group (DSG) Center for Medicaid and CHIP Services (CMCS) Centers for Medicare & Medicaid Services.

About DocGraph
DocGraph ( creates, maintains, and improves open healthcare datasets and is a pioneer in the open health data movement. DocGraph has helped establish a growing community of data scientists, journalists, and clinical enterprises who use open data to understand and evolve the healthcare system.

About Careset Systems
CareSet Systems ( is the nation’s first vendor with access to 100% Medicare claims and enables the nation’s leading pharmaceutical companies to decode Medicare claims data to guide new drug launches. .

DocGraph to release the most accurate data picture to date of how cancer is treated by Medicare

Health data scientist Fred Trotter joined Vice President Joe Biden and other national healthcare leaders today for the invitation only National Cancer Moonshot Summit. The Vice President announced that Trotter’s company, DocGraph, will release an open cancer dataset this year. The new dataset contains summarized information about almost a million Medicare cancer patients and more than 10 million specific claims events, providing the most accurate picture to date of how cancer is treated by Medicare.

Use of this dataset will be open to anyone, including scientists, oncologists, and digital health entrepreneurs. DocGraph will work with analytics organization CareSet Systems to develop data challenges that engage the data science community in deriving insights from this data. According to Trotter, “we’re interested in exploring important differences in the experience of cancer patients based on factors such as treatment pathway, geography, and types of physicians and providers.”

CareSet CEO Laura Shapland said, “We are honored to support Vice President Biden’s efforts to end cancer as we know it. DocGraph will release the dataset in the fourth quarter, and it will illustrate how Medicare patients travel through the healthcare system in the years before and immediately after their cancer diagnoses, including data about their treatment providers, procedures, medications and survival.”

Trotter is the founder of DocGraph and pioneered the release of the first national provider referral data made available by the US government. DocGraph data releases have empowered researchers and entrepreneurs to create new data-backed healthcare solutions, and has spawned a growing community of problem solvers who have used the data to create new, innovative solutions. The DocGraph analysis is based on Medicare claims data made possible by the Obama Administration’s open data policies.


Announcing MrPUP

We are happy to announce that the DocGraph Journal is releasing the first Medicare Public Use File released by a private organization.

The new public use file is called MrPUP, and it details how Outpatient providers in Medicare refer procedures. MrPUP stands for Medicare Referring Provider Utilization for Procedures. You can download the data here, the Open Source Eventually version is free to attendees of Datapalooza!

There are many Medicare procedures, including most lab and imaging tests, that require a specific physician to refer them. When a Medicare beneficiary gets an X-Ray or a blood test done, Medicare wants to know what provider ordered these tests. As a result, for some, but not all, procedures, Medicare requires the referring doctor’s NPI number.

Previously, CMS has released information on the performed procedures in Medicare, coded by NPI and hcpcs_code. MrPUP has a similar data structure, but instead of including the performing NPI, it includes the referring NPI.


The simplest way to explain the value generated with this new data, is to demonstrate how the analysis of lab referrals took place on open data before this data release, and look at it after. Before this data release, it was possible to understand which providers were sending data to which labs by leveraging the teaming dataset, that was the first data set that we released and the one that gave DocGraph its name. We also know, from the current performing NPI PUF from CMS, what kinds of tests those laboratories are performing. Using the analysis tools provided by CareSet Systems we can see how that data analysis might look for a specific doctor:



This is a flow graph from an actual interface inside CareSet systems. On the left we see the procedures that our doctor is responsible for with Medicare in blue. Our doctor, Dr. Magid, is in orange in the center. The green dots represent the laboratories that Dr. Magid shares patients with. The red dots show the procedures that the laboratories are performing. Obviously the labs that this provider works with, which include both LabCorp, Quest and the lab at this providers group practice, do hundreds of different and distinct lab tests with a huge variety and overlap. Our doctor could only be responsible for a tiny fraction of these… but which ones?

Enter MrPUP. Using MrPUP we can generate a new graph, where the edges of the connections between the labs and the provider can be labeled with specific lab tests. This gives us insight into how this doctor is referring labs, in a way that the first analysis just does not.


You can see how much more powerful this analysis method is.


There are some caveats to how this data should be analyzed. Not every procedure in Medicare requires a referral NPI to included in the claim. But all claims allow for this field to be filled in. There are some cases where CMS requires this field to be properly filled in, but there are no cases that we are aware of where its use is forbidden. That means that this data set might include some interesting data that is not indicative of any trends. Lets imagine a Cardiologist who configures her billing system to always include the NPI of the primary care doctor who referred a patient. In our data set, those primary care providers would appear to have a strange and unusual amount of “referred cardiology procedures”. It would make them really stand out in the data set as unique. But in fact, that information does not say anything interesting at all about those primary care doctors… its just an artifact of the strange way that one cardiologist has decided to bill.

There are many cases where the underlying CMS claims data includes what appear to be self-referrals. Of course, self-referral is technically not allowed by policy, but that policy does not extend to requirements about how providers have to fill out specific claims forms. So the underlying data includes lots of cases where the same provider is included in both the referring provider NPI field, and the performing provider NPI field. The vast majority of these are not actually doing anything shady, but are just honoring CMS requirements about how to use the various claims fields. More importantly, when a self-referral is made, then those procedure patterns already appear in the standard CMS outpatient utilization PUF file. For those reasons, and to generally avoid drama, we have excluded self referring procedures from this PUF. We might change how we address this in the future, but this is the simplest way to keep this data release clean.

Hopefully the release of the MrPUP data will draw attention to the requirements that CMS makes regarding this specific billing field, and future policies will ensure that the data becomes more reliable over time. Or not, one never can tell about these things.

Data Licensing

Although this is open data, it is not costless. It takes money for us to work on DocGraph and as a result we are charging a nominal fee for access to the data. If you are a student, researcher, academic or hacker, you probably want to purchase the Open Source Eventually (OSE) version of the data. This version of the data is much cheaper (think textbook) and in one year will become a Creative Commons licensed data file. In the meantime, any work you do with or on the file must be released under the Creative Commons, or some other Open Source License. This version does not allow you to share the data in any way.

If you would like to use the product in your product or service or otherwise leverage this data, you can purchase a commercial-friendly license for it. This costs a little more, but it is still hundreds of thousands of dollars less that it would cost to create the data set yourself. We appreciate those who choose to purchase this license, because this is what allows us to continue our work at DocGraph.

How to get the data for free for Datapalooza attendees:

In order to get the data for free (you will be getting the OSE version), you must @ mention @DocGraph in a tweet that shows you pictured with something fun that clearly demonstrates that you are in attendance at datapalooza. In fact, if you are not at datapalooza, and your tweet pretending that you are at datapalooza is clever enough, we might just decide to give you a free copy in case. After that, go ahead and apply for the free data at the MrPUP download page. Once you have tweeted at us, follow @DocGraph so that we can DM you a link to the download file!


DocGraph Teaming data update

CMS has released an updated version of the DocGraph teaming data set* that was redacted on October 5, 2015. The DocGraph Teaming data set documents how healthcare providers in the U.S. work together. We believe this release corrects the critical issues we identified prior to the redaction. Access to the raw data can be found through our Linea data portal, and higher-level support and software services can be found from our sister company, CareSet Systems.

The improvements to the data set includes:

  • Updated documentation which clearly states the date ranges found in each downloadable file. (The previous version of the data was retracted because the date ranges were mislabeled);
  • Very recent data. This data release has 2015 data updated through Oct 1, 2015;
  • Use of a consistent algorithm for all years from 2009-2015, which makes year-over-year analysis possible. The CareSet Systems blog will have several articles coming out soon about our year-over-year analysis of these data sets.

We expect to work with CMS to validate the data set in the coming months. We know that this data is a dramatic improvement on what was previously available, but we have not had the opportunity to review the data creation methods and validate that it was performed according to our algorithm specification. Until that happens, DocGraph cannot vouch for the data. We can only say that CMS is “vouching” that the data is fixed by releasing it, and that the most of the issues with the previous data sets appear to have been corrected.

As always, DocGraph would like to thank CMS and HHS for their continued commitment to openness.

There are still a few documentation issues with this data release, and we will be coordinating with CMS on an ongoing basis to correct them. Until then, we encourage the DocGraph community to keep the following in mind:

  • CMS has not explicitly documented the algorithm they used to create the data set. This algorithm has changed and is not 1-to-1 comparable with the data sets previously released. They appear to have made the modifications to the algorithm that we suggested, but we have no way of verifying that for now.
  • CMS continues to refer to the data as a “referral” data set, despite it being a “shared patient in time” data set that includes “referrals” as one subset. While this data does include traditional referrals as a subset, this is not strictly referral data. There are fields used in medical claims to document referrals and this data set was not generated using those fields.
  • CMS continues to label the file as “physician” despite it covering all medicare provider types (except pharmacies, due to the exclusion of Part D data). Being a physician is not a prerequisite for being included in these data sets, any provider who bills medicare enough to meet the patient privacy threshold will be included in the data set.
  • There are coding problems in the datasets. Specifically, there are many “impossible” NPIs (National Provider Identifiers). For instance, all of these NPI’s are returned from the query [of which year, 2014]:

npi, npi_count, problem
6073299,2,”This NPI is not 10 digits, it has 7: 6073299″
16073299,2,”This NPI is not 10 digits, it has 8: 16073299″
135632034,8,”This NPI is not 10 digits, it has 9: 135632034″
162909399,1,”This NPI is not 10 digits, it has 9: 162909399″
162915969,1,”This NPI is not 10 digits, it has 9: 162915969″
174031524,1,”This NPI is not 10 digits, it has 9: 174031524″
999999992,1,”This NPI is not 10 digits, it has 9: 999999992″
1063828204,1,”This NPI is does not pass luhn: 1063828204″
1194809840,3,”This NPI is does not pass luhn: 1194809840″
1245396655,1,”This NPI is does not pass luhn: 1245396655″
1619944960,1,”This NPI is does not pass luhn: 1619944960″
1740228645,3,”This NPI is does not pass luhn: 1740228645″
1750458455,5,”This NPI is does not pass luhn: 1750458455″
9999999991,502,”This NPI is does not pass luhn: 9999999991″
9999999992,11634,”This NPI is does not pass luhn: 9999999992″
9999999994,116,”This NPI is does not pass luhn: 9999999994″
9999999996,2975,”This NPI is does not pass luhn: 9999999996″

Based on previous investigations, we know that the contractors who generated the files are faithfully returning what is listed in the NPI field. The underlying problem is with the actual Medicare claims database. We suspect that these are the last vestigial organs of pre-NPI billing systems, but we cannot be sure. Happily these strange numbers are far less common in the 2015 data set. Perhaps CMS is succeeding in squashing the non-NPI coded transactions once and for all.

Here is a NPI validity report for the 2015 data:

npi, npi_count, problem
9999999991,103,”This NPI is does not pass luhn: 9999999991″
9999999992,1569,”This NPI is does not pass luhn: 9999999992″
Much better!

For real-time discussion about these data sets, join the DocGraph google group.

For a more thorough exploration of this data, sign up for CareSet news (link to sign up here).

Enjoy… and watch this space! We will be opening lots more data in 2016 and beyond.


Fred Trotter

Co-founder, DocGraph and CareSet Systems
* DocGraph teaming data shows how healthcare providers who bill Medicare cooperate to deliver care to their patients. Essentially, the teaming dataset documents Medicare providers who share patients in a given year. The result of this is a data structure that data scientist call a weighted directed graph of relationships. To be more specific, the method used to generate the graph converts the bi-partite (two types of nodes, patients and providers) graph structure into a graph with just one type, a graph showing relationships between providers. In layman’s terms, the DocGraph teaming data set is massive map of the healthcare system in the United States. It shows referrals, ordering patterns and many other types of healthcare provider collaborations. This data set exists as the result of a FOIA request made by DocGraph. Since that time, DocGraph has continued to collaborate with CMS to ensure that the data was updated and reliable.

Batea Chrome Extension Announcement

Today we announced the release of Batea, a Chrome extension designed to help editors improve Wikipedia pages. 

Batea allows participants to donate specific browsing data and provide comments on Wikipedia pages. The extension is part of the Batea study, an IRB-approved data donation process.

We invite the DocGraph community to check it out, and if you are interested, to install it!  As always, let us know what you think by emailing us at
Or you can go straight to the extension in the Chrome store here:

Qualifications for 2014 CEHRT Flexibility Rule

In 2014, ONC passed a rule that allowed providers to receive funding for EHR systems that did not meet the standards that were current at the time.

The funding program for Electronic Health Records (EHRs) in the United States is called the Meaningful Use program, because healthcare providers, mostly doctors and hospitals, are required to demonstrate that they have actually used EHR technology to benefit patients in a meaningful way.

There are three stages for funding in the EHR incentive program, and in 2014 many healthcare providers were supposed to upgrade from the first stage to the second stage of that funding program. Each stage requires hospitals and doctors to leverage more and more complex EHR software. EHR software that was certified to the 2011 standards was meant to be used by hospitals and providers to achieve Meaningful Use Stage 1, in 2011, 2012 and 2013. After that, doctors and hospitals were supposed to adopt EHR software that was certified to the standards released in 2014.

However, many hospitals and doctors protested that the EHR vendor community was not providing access to EHR software certified to the 2014 standards in a timely manner. This created a situation where providers “through no fault of their own” were unable to attest to Stage 2 of the Meaningful Use standards.

In order to ensure that healthcare providers were able to access Meaningful Use funds, ONC released a rule that allowed providers to receive Stage 2 Meaningful Use funds while using Stage 1 EHR software. However, Meaningful Use participants were only able to claim access to these grandfathered funds if they claimed that they had been able to fully adopt current certified EHR technology (or CEHRT) because some vendor did not offer,  or improperly supported, the roll out the new 2014 certified EHR software. This grandfathering of funds is generally called the “flexibility rule” because it allows for hospitals and providers to have flexibility in which versions of CEHRT they attest under.

This was a very controversial policy decision at the time, and it is only with newly accessible data that we are now able to analyze the impacts of this policy on the EHR marketplace. We expect to be releasing more analysis on this issue, and before we do that, we thought it would helpful for us to release a short “required reading” list for those who are interested in this policy decision. This is in addition to our already released comparisons between the Meaningful Use overview for hospitals and Meaningful Use overview for providers.

Specifically we wanted to detail the rules for when this grandfathering was allowed, and when it was not. If you want to track the source policy documents for the flexibility rule you can find them here:

The final rule has specific responses to comments that clarify what counts as valid reasons for exercising the flexibility rule vs. those which do not.

The specific phrase that the final rule puts forward is:

Providers who choose this option must attest that they are unable to fully implement 2014 Edition CEHRT because of issues related to 2014 Edition CEHRT availability delays when they attest to the meaningful use objectives and measures.

and later emphasis

…we stress the delay in 2014 Edition CEHRT availability must be attributable to the issues related to software development, certification, implementation, testing, or release of the product by the EHR vendor which affected 2014 CEHRT availability, which then results in the inability for a provider to fully implement 2014 Edition CEHRT.

Specific reasons that are listed as acceptable under this criteria include:

  • Waiting for the availability of certification of 2014 CEHRT.
  • Waiting for installation of 2014 CEHRT.
  • Waiting for updates or patches to 2014 CEHRT.
  • Patient safety issues related to the adoption of flawed 2014 CEHRT.
  • The attester has multiple software components, at potentially multiple sites, and some of them are incompatible with a 2014 edition CEHRT.
  • A site was unable to reach the an interoperability goal because their referral partners did not have 2014 CEHRT.

The last one is especially interesting, because it allows a hospital or doctor to attest to use the flexibility rule because some other hospital or doctor was not using the right version of CEHRT.

The response to comments also lists specific reasons that would not qualify a hospital or a practice to use the flexibility rule:

  • Financial problems (i.e. not paying for 2014 CEHRT or not paying to implement it).
  • A provider waited too long to attest.
  • Failing to meet a threshold for an 2014 attestation measure.
  • Staff attrition.

As we perform our analysis on this data, it would be invaluable to know precisely why a specific provider or hospital represented that they were exercising the one of the options under the flexibility rule. It would be lovely if we knew, for instance, which sites A. had no CEHRT to install at all vs B. the CEHRT was available but it was flawed vs C. they were unable to exchange data because none of their peer sites were using CEHRT, etc. This information is apparently unavailable. During the attestation process, there was no requirement to categorize why the flexibility option was chosen, but only that it was due to “the unavailability of 2014 CEHRT”.


Providers are required to maintain documentation of their reasons for choosing the flexibility option, but this checkbox was all of the data that was gathered regarding the flexibility option. This documentation requirement was emphasized for the interoperability threshold issues:

However, the referring provider must retain documentation clearly demonstrating that they were unable to meet the 10 percent threshold for the measure to provide an electronic summary of care document for a transition or referral for the reasons previously stated.

Currently there are two different auditing programs that could theoretically check up on the reasons for choosing the flexibility option. The Office of the Inspector General audits the Meaningful Use program. Which is distinct from the Meaningful Use audit program by CMS itself, which is subcontracted to Figliozzi and Company. CMS also has guidance for which documentation hospitals and providers should expect to have available for a Meaningful Use audit.

If you are interested in even more detail than this, we found this article from Hostetler Management Group to be especially helpful as we were researching the qualifications for the flexibility rule, because they specifically quote sections from the final version of the flexibility rule to show what is and is not allowed. We also found that McDermott Will & Emery had an helpful overviews of the OIG auditing program and the CMS auditing program.