Notice: Redaction of Teaming Data

Recently, we discovered that the DocGraph Teaming (Shared-Patient-in-Time) data set (aka referral data), which is now available from CMS as a standard data set here, has significantly different contents than what we had been assuming. Specifically, it appears that data sets dated after 2012 have much different periods than as labeled on the website.

We discovered this problem as part of ongoing collaboration that we have with CMS and ProPublica to improve the documentation around the Teaming data set. CMS and DocGraph both became fully aware of the extent of this problem last week, and CMS is continuing to collaborate with us to both update the labels, as well as to create new versions of the data set that better meet industry expectations.

Here is a screengrab of the notice from CMS:


DocGraph originally requested that the data cover claims in a single given year. We think that the data sets for 2009, 2010 and 2011 are in fact one-year periods in conformance with our original request (although we are not completely sure of anything at this point). Based on when CMS released later data sets, and the dates in the file names, we had previously assumed that subsequent data released had coverage periods of 18 months. Specifically we had assumed that data files starting with “2012-2013″ had a period of 18 months.

In fact, based on our reading of recent communications from the contractor who actually created the data set, it appears that some later versions of the data have a coverage period of four years, rather than 18 months. At least some later data sets also appear to have shorter periods. The contractor actually sent us this information on Sept 24 2015, we first carefully reviewed the document on Oct 1 2015. We had a call confirming the problem with CMS today- Oct 5th 2015. With that confirmation both CMS and DocGraph are confident that the data CMS is currently releasing is substantially mislabeled.

CMS is in the process of adding a notice to the download page, and we are working to fix our documentation and inform the community of companies that leverage this data in various tools.

There are other minor documentation related problems with the data as well, but the issue with the periods substantially changes how the data should be interpreted and none of the other problems appear to do that (an assertion that we will try to nail down in the coming weeks).

What data sets are impacted?

This alert covers the shared-patient-in-time data set which is the first data set that DocGraph originally released. In fact, many people continue to call this data set “DocGraph”, which is not unreasonable given that it was the only one we offered at the founding of The DocGraph Journal.

This data set shows relationships between providers, coded by their National Provider Identifier (NPI), based on shared patients. The data set takes the form of a directed graph of relationships, coded by NPI, and weighted by the number of patients, as well as the frequency of patient interactions. The data is created from the Medicare claims data set. Whenever two providers bill Medicare for the same patient at roughly the same time, they get a +1 on the weights between them. To protect patient privacy and to reduce noise, no provider-to-provider relationships that included less than 11 patients are released. This is the most basic description of the data, to get further detail, please consult the documentation we originally released on O’Reilly Radar.

At present, the following table represents our best guess as to the status of the mislabeled data sets:

  • Physician Referral Patterns – 2009  – one year label, period is one year
  • Physician Referral Patterns – 2010  – one year label, period is one year
  • Physician Referral Patterns – 2011  – one year label, period is one year
  • Physician Referral Patterns – 2012 – 2013 30 day interval – two year label, period is four years
  • Physician Referral Patterns – 2012 – 2013 60 day interval – two year label, period is four years
  • Physician Referral Patterns – 2012 – 2013 90 day interval – two year label, period is four years
  • Physician Referral Patterns – 2012 – 2013 180 day interval – two year label, period is four years
  • Physician Referral Patterns – 2012 – 2013 365 day interval – two year label, period is four years
  • Physician Referral Patterns – 2013 – 2014 30 day interval  – two year label, period is one year
  • Physician Referral Patterns – 2013 – 2014 60 day interval  – two year label, period is one year
  • Physician Referral Patterns- 2013 – 2014 90 day interval  – two year label, period is one year
  • Physician Referral Patterns- 2013 – 2014 180 day interval  – two year label, period is one year
  • Physician Referral Patterns- 2013 – 2014 365 day interval – two year label, period is four years
  • Physician Referral Patterns- 2014 – 2015 30 day interval – two year label, period is one quarter
  • Physician Referral Patterns- 2014 – 2015 60 day interval – two year label, period is one quarter
  • Physician Referral Patterns- 2014 – 2015 90 day interval – two year label, period is one quarter
  • Physician Referral Patterns- 2014 – 2015 180 day interval – two year label, period is one quarter
  • Physician Referral Patterns- 2014 – 2015 365 day interval – two year label, period is one quarter

At this time, this information should not be taken as authoritative, however, once new information is made available from the CMS download page itself, you can trust that is correct. It should also be noted that at some point the 2013-2014 30, 60, 90, 180 day window files were changed to have what appears to be a one-year window, but the 2013-2014 365 day window was not changed.

Potential impacts

Many people and companies have consumed “DocGraph” teaming data, either from DocGraph directly or from CMS directly once the data became a standard data set.  Our sister organization CareSet offers services around this data, and CareSet customers have leveraged this data to sever at least some business relationships with specific physicians. We believe that some of these decisions might have been errors, and CareSet is reaching out to all of its customers to let them know about these issues.  Ashish Patel, with CareSet, says, “While the volumes are problematic, the structure is valid.”  CareSet is not the only commercial service around this data set, and it is entirely possible that there were many detrimental business decisions based on wrong assumptions about this data. It is also possible that organizational decisions to move away from certain markets might have been made as the result of this mislabeled data. As a result, it is possible that some of these decisions could have resulted in some patients having reduced access to care.

When looking at specific data files, we assumed that persons or organizations who shared an “edge” had to have worked together in the time period labeled by the data file. In fact, the “edge” could still exist, even though the provider or organization no longer had a business relationship, because data was being included for previous years. That means that it might appear that doctors/hospitals/etc were being “disloyal” by not changing their business relationships in a way that was expected, when they took on new roles.

There could be other misinterpretations that could cause decisions that result in harm, but we know for certain that this error has occurred, and this is enough for us to feel that getting this information out to our competitors and collaborators is critical. Again, we are not sure of the amount or severity of problems that come with misinterpreting this data, but this potential harm alone justifies this notice.

Based on our estimates, we believe that doctors being fired or hospitals closing based on misinterpretations of this data should be relatively rare. However, this is based on our understanding of how other people use the teaming/referral data to offer services, and how the market reacts to these services, both of which include substantial assumptions. We cannot be sure what damage has been done as a result of the mislabeling of this data set.

What caused the problem

There are several factors which have contributed to this problem. First, our FOIA occurred in two communication exchanges. The first exchange created the early one-year, three-field version of the teaming data set, and the second exchange created the multi-year, five-field version of the data set. At some point during the second exchange, I had a miscommunication with CMS about what period of time I wanted the claims to cover, which resulted in a four-year analysis being run, but mislabeled as a two-year analysis.

As a result of this miscommunication, the CMS website does not offer accurate documentation of what the data actually means for any of the multi-year, five-field versions of the data set. The available documentation includes duplications of the contents of my FOIA request, and other communication that I have had with CMS, which are not clear about what the periods should be in later versions.

The documentation that we had released about the data set is far more comprehensive and discusses clearly what we thought the data set was, as well as the caveats and implications that we felt applied to the data sets. If the communications that we have just received are correct, both the extensive documentation that we released and the sparse documentation that CMS released are substantially incorrect regarding the periods covered by the data.  It is also clear that this is not a problem within CMS alone. This problem is due in part to the way in which this data was generated and released inside CMS, and in part because of assumptions that we made with the data set.

CMS is responsible for this problem, and I am also responsible. I should have verified the specific process used to generate this data earlier than I did. CMS has been very willing to work with me on various open data initiatives, including clarifying the labeling of this original data set. I chose to make this particular project a lower priority, which is why we are only discovering this problem now, years after these data sets have been released. This improper data labeling is the result of a systematic problem, and I am very much a part of that system, and deserve my share of the blame. As someone who dedicates most of my time thinking of ways to improve the healthcare system, the revelation that I may have damaged it instead is hard to face. As healthcare becomes more data-driven, bad data or bad interpretations can cause real harm to people.

I am responsible for any harm in this case and for that I am deeply sorry. Almost all of the benefit, or harm, that I end up contributing to the healthcare system will be indirect- typically 4, 5 or 6 times removed from the doctors, nurses and other healthcare providers who directly help patients. It is likely that I will never meet anyone who will be able to attribute a good or bad outcome directly to my work, but that does not mean that I am not indirectly contributing to a good or bad outcome. In this case, it looks like I could have contributed to at least a few bad outcomes, and because I do not know who I should apologize to directly, I am doing so in public instead. If you know that you have been personally harmed by a misinterpretation of this data set, feel free to contact me for a personal apology. You might also be able to give me new information to help me understand how this mistake might have caused harm to others, which may in turn help me to reduce it.

While there are likely several parties within or “near” to CMS who are also partly responsible for this issue, no one person or organization is more responsible than I am personally. Further, this problem would never have been discovered, except that CMS generally, the CMS FOIA office, as well as the contractor who completed the coding work were all willing to openly collaborate on fixing the details of the labeling on this data. Multiple offices at CMS, as well as their contractors, have contributed to maintaining a culture of safety regarding this data release. I will not pretend that I or anyone else fully understands what a culture of safety means for large scale open data releases, CMS continues to take appropriate responsibility for this problem. Their actions are consistent with the commitments that CMS has made towards supporting a culture of safety among its providers.

Blame and accusation are destructive to ongoing patient safety efforts at any level. CMS deserves applause for being willing to expose their mistakes and fix them. The only appropriate way to construe responsibility for this problem is to believe that either it was nobody’s fault, or it that was everyone’s fault and that each person involved has to own up to how they contributed to the problem.

What we are going to do next

CMS, DocGraph and ProPublica are currently working to correct the labeling of the data and to provide improved versions of the data.

At this point we are still investigating the current data set, and defining what the subsequent versions will look like, so any further specificity is premature. As I discover details about the current data set, I will post them to the DocGraph mailing list, so if you want to follow our discovery process, please join us there.

Our collaboration may take some time, because as far as we can tell there is no one person who has a perfect understanding of what the data currently available for download actually is, which is prerequisite for proper labeling.

I expect that our collaboration with CMS/ProPublica will result in improved documentation available in the next few weeks. I further expect that the collaboration will result in improved data releases with the next few months. Neither of these estimates should be regarded as a promise. However, we are certain that the current data plus labels cannot be fully trusted to be used for any of its intended purposes.  It is likely that the previously available data set is still valuable and useful. Correct labeling could improve its usefulness, but we just have too many open questions to be completely sure at this point.

While there are other problems with the data set, we believe that this issue is the only thing that merits a full alert. In the event that we learn that there are other problems that we did not think of, we will either update this alert, create a new one, or for more minor news, update the community using our mailing list.


If you have questions regarding this matter, you can reach us through the DocGraph mailing list or you can tweet to us at @DocGraph. If you want to communicate about this privately reach out to laura at docgraph dot org.

Thank you,

-Fred Trotter