DocGraph Teaming data update – The DocGraph Journal

CMS has released an updated version of the DocGraph teaming data set* that was redacted on October 5, 2015. The DocGraph Teaming data set documents how healthcare providers in the U.S. work together. We believe this release corrects the critical issues we identified prior to the redaction. Access to the raw data can be found through our Linea data portal, and higher-level support and software services can be found from our sister company, CareSet Systems.

The improvements to the data set includes:

Updated documentation which clearly states the date ranges found in each downloadable file. (The previous version of the data was retracted because the date ranges were mislabeled);
Very recent data. This data release has 2015 data updated through Oct 1, 2015;
Use of a consistent algorithm for all years from 2009-2015, which makes year-over-year analysis possible. The CareSet Systems blog will have several articles coming out soon about our year-over-year analysis of these data sets.

We expect to work with CMS to validate the data set in the coming months. We know that this data is a dramatic improvement on what was previously available, but we have not had the opportunity to review the data creation methods and validate that it was performed according to our algorithm specification. Until that happens, DocGraph cannot vouch for the data. We can only say that CMS is “vouching” that the data is fixed by releasing it, and that the most of the issues with the previous data sets appear to have been corrected.

As always, DocGraph would like to thank CMS and HHS for their continued commitment to openness.

There are still a few documentation issues with this data release, and we will be coordinating with CMS on an ongoing basis to correct them. Until then, we encourage the DocGraph community to keep the following in mind:

CMS has not explicitly documented the algorithm they used to create the data set. This algorithm has changed and is not 1-to-1 comparable with the data sets previously released. They appear to have made the modifications to the algorithm that we suggested, but we have no way of verifying that for now.
CMS continues to refer to the data as a “referral” data set, despite it being a “shared patient in time” data set that includes “referrals” as one subset. While this data does include traditional referrals as a subset, this is not strictly referral data. There are fields used in medical claims to document referrals and this data set was not generated using those fields.
CMS continues to label the file as “physician” despite it covering all medicare provider types (except pharmacies, due to the exclusion of Part D data). Being a physician is not a prerequisite for being included in these data sets, any provider who bills medicare enough to meet the patient privacy threshold will be included in the data set.
There are coding problems in the datasets. Specifically, there are many “impossible” NPIs (National Provider Identifiers). For instance, all of these NPI’s are returned from the query [of which year, 2014]:

npi, npi_count, problem
6073299,2,”This NPI is not 10 digits, it has 7: 6073299″
16073299,2,”This NPI is not 10 digits, it has 8: 16073299″
135632034,8,”This NPI is not 10 digits, it has 9: 135632034″
162909399,1,”This NPI is not 10 digits, it has 9: 162909399″
162915969,1,”This NPI is not 10 digits, it has 9: 162915969″
174031524,1,”This NPI is not 10 digits, it has 9: 174031524″
999999992,1,”This NPI is not 10 digits, it has 9: 999999992″
1063828204,1,”This NPI is does not pass luhn: 1063828204″
1194809840,3,”This NPI is does not pass luhn: 1194809840″
1245396655,1,”This NPI is does not pass luhn: 1245396655″
1619944960,1,”This NPI is does not pass luhn: 1619944960″
1740228645,3,”This NPI is does not pass luhn: 1740228645″
1750458455,5,”This NPI is does not pass luhn: 1750458455″
9999999991,502,”This NPI is does not pass luhn: 9999999991″
9999999992,11634,”This NPI is does not pass luhn: 9999999992″
9999999994,116,”This NPI is does not pass luhn: 9999999994″
9999999996,2975,”This NPI is does not pass luhn: 9999999996″

Based on previous investigations, we know that the contractors who generated the files are faithfully returning what is listed in the NPI field. The underlying problem is with the actual Medicare claims database. We suspect that these are the last vestigial organs of pre-NPI billing systems, but we cannot be sure. Happily these strange numbers are far less common in the 2015 data set. Perhaps CMS is succeeding in squashing the non-NPI coded transactions once and for all.

Here is a NPI validity report for the 2015 data:

npi, npi_count, problem
9999999991,103,”This NPI is does not pass luhn: 9999999991″
9999999992,1569,”This NPI is does not pass luhn: 9999999992″
Much better!

For real-time discussion about these data sets, join the DocGraph google group.

For a more thorough exploration of this data, sign up for CareSet news (link to sign up here).

Enjoy… and watch this space! We will be opening lots more data in 2016 and beyond.

Fred Trotter

Co-founder, DocGraph and CareSet Systems
* DocGraph teaming data shows how healthcare providers who bill Medicare cooperate to deliver care to their patients. Essentially, the teaming dataset documents Medicare providers who share patients in a given year. The result of this is a data structure that data scientist call a weighted directed graph of relationships. To be more specific, the method used to generate the graph converts the bi-partite (two types of nodes, patients and providers) graph structure into a graph with just one type, a graph showing relationships between providers. In layman’s terms, the DocGraph teaming data set is massive map of the healthcare system in the United States. It shows referrals, ordering patterns and many other types of healthcare provider collaborations. This data set exists as the result of a FOIA request made by DocGraph. Since that time, DocGraph has continued to collaborate with CMS to ensure that the data was updated and reliable.