DocGraph Releases First Cancer Dataset at Digital Pharma East

“The data is aggregated Medicare data starting in 2010 and covers six years worth of data,” says DocGraph founder, Ashish Patel who announced the dataset’s availability in Philadelphia at the 10th Annual Digital Pharma East conference.

Use of this dataset will be open to anyone, including scientists, oncologists, and digital health entrepreneurs. According to DocGraph data scientist Fred Trotter, “We’re interested in exploring important differences in the experience of cancer patients based on factors such as treatment pathway, geography, and types of physicians and providers.”

The Digital Pharma Series is part of the Life Science Digital Marketers’ Forum for learning the latest digital, mobile and social strategies that yield results. Held all over the world, this year’s Digital Pharma East conference hosted more than 800 pharma executives.

“Our Moonshot dataset can be used to understand the patient’s journey and guide new therapies and trails,” says Patel. “Releasing the cancer dataset at Digital Pharma East puts the resource in front of the right people to speed the oncology ecosystem towards cures.”

DocGraph pioneered the release of national provider referral data made available by the US government. DocGraph data releases have empowered researchers and entrepreneurs to create new data-backed healthcare solutions, and has spawned a growing community of problem solvers who have used the data to create new, innovative solutions.

The Cancer Moonshot datasets are available at no charge under the Open Source Eventually License and can be requested at

For a commercial-friendly license of this data, contact DocGraph’s sister company CareSet Systems at


DocGraph joins Claudia Williams, Niall Brennan and Jess Kahn in DC to Promote Transparency and Fairness in the Healthcare System

WASHINGTON, September 28, 2016 – Health data scientist Fred Trotter presented today at the White House Open Data Innovation Summit around national health data transparency. Trotter is the founder of DocGraph which released the US government’s first national Provider referral pattern data in 2012 that enabled researchers, journalists, and companies around the nation to provide data-backed healthcare solutions.

The White House Open Data Innovation Summit is an initiative that showcases the power of data in tackling the biggest challenges of our democracy, creating positive change, and strengthening the broader civic community.

“Our work at DocGraph is designed to bring transparency and ultimately fairness to the healthcare system,” says Trotter. “The Obama administration has been committed to the ideal of using open data to improve the healthcare system since the first year of his first term and indeed, many of the open data policy decisions made then are now bearing fruit.”

DocGraph data is used by a range of organizations from small data science companies to big pharma. Commercialized use of this data can be found at sister company, CareSet Systems, where Trotter serves as Chief Technology Officer.

“DocGraph and CareSet are able to partner with HHS generally and CMS specifically to create new data sets, especially around Medicare and Medicaid data sets,” he says. “Being invited to the Open Data Innovation Summit to demonstrate our work on open healthcare data is an acknowledgement that our work at DocGraph contributes to the public good, and that the public-private collaboration that is the basis of the CareSet business model is sustainable in the long term.”

On June 29, 2016 Fred Trotter joined Vice President Joe Biden and other national healthcare leaders today for the invitation only National Cancer Moonshot Summit. The Vice President announced that DocGraph would release an open cancer dataset in the Fall of 2016 to help improve outcomes in treating cancer. The dataset will contain summarized information about almost a million Medicare cancer patients and more than 10 million specific claims events, providing the most accurate picture to date of how cancer is treated by Medicare.

Trotter presented along with Claudia Williams Senior Advisor, Health Innovation and Technology at White House Office of Science and Technology Policy, Niall Brennan Chief Data Officer, Centers for Medicare & Medicaid Services, and Jessica Kahn Director, Data and Systems Group (DSG) Center for Medicaid and CHIP Services (CMCS) Centers for Medicare & Medicaid Services.

About DocGraph
DocGraph ( creates, maintains, and improves open healthcare datasets and is a pioneer in the open health data movement. DocGraph has helped establish a growing community of data scientists, journalists, and clinical enterprises who use open data to understand and evolve the healthcare system.

About Careset Systems
CareSet Systems ( is the nation’s first vendor with access to 100% Medicare claims and enables the nation’s leading pharmaceutical companies to decode Medicare claims data to guide new drug launches. .

DocGraph to release the most accurate data picture to date of how cancer is treated by Medicare

Health data scientist Fred Trotter joined Vice President Joe Biden and other national healthcare leaders today for the invitation only National Cancer Moonshot Summit. The Vice President announced that Trotter’s company, DocGraph, will release an open cancer dataset this year. The new dataset contains summarized information about almost a million Medicare cancer patients and more than 10 million specific claims events, providing the most accurate picture to date of how cancer is treated by Medicare.

Use of this dataset will be open to anyone, including scientists, oncologists, and digital health entrepreneurs. DocGraph will work with analytics organization CareSet Systems to develop data challenges that engage the data science community in deriving insights from this data. According to Trotter, “we’re interested in exploring important differences in the experience of cancer patients based on factors such as treatment pathway, geography, and types of physicians and providers.”

CareSet CEO Laura Shapland said, “We are honored to support Vice President Biden’s efforts to end cancer as we know it. DocGraph will release the dataset in the fourth quarter, and it will illustrate how Medicare patients travel through the healthcare system in the years before and immediately after their cancer diagnoses, including data about their treatment providers, procedures, medications and survival.”

Trotter is the founder of DocGraph and pioneered the release of the first national provider referral data made available by the US government. DocGraph data releases have empowered researchers and entrepreneurs to create new data-backed healthcare solutions, and has spawned a growing community of problem solvers who have used the data to create new, innovative solutions. The DocGraph analysis is based on Medicare claims data made possible by the Obama Administration’s open data policies.


DocGraph Teaming data update

CMS has released an updated version of the DocGraph teaming data set* that was redacted on October 5, 2015. The DocGraph Teaming data set documents how healthcare providers in the U.S. work together. We believe this release corrects the critical issues we identified prior to the redaction. Access to the raw data can be found through our Linea data portal, and higher-level support and software services can be found from our sister company, CareSet Systems.

The improvements to the data set includes:

  • Updated documentation which clearly states the date ranges found in each downloadable file. (The previous version of the data was retracted because the date ranges were mislabeled);
  • Very recent data. This data release has 2015 data updated through Oct 1, 2015;
  • Use of a consistent algorithm for all years from 2009-2015, which makes year-over-year analysis possible. The CareSet Systems blog will have several articles coming out soon about our year-over-year analysis of these data sets.

We expect to work with CMS to validate the data set in the coming months. We know that this data is a dramatic improvement on what was previously available, but we have not had the opportunity to review the data creation methods and validate that it was performed according to our algorithm specification. Until that happens, DocGraph cannot vouch for the data. We can only say that CMS is “vouching” that the data is fixed by releasing it, and that the most of the issues with the previous data sets appear to have been corrected.

As always, DocGraph would like to thank CMS and HHS for their continued commitment to openness.

There are still a few documentation issues with this data release, and we will be coordinating with CMS on an ongoing basis to correct them. Until then, we encourage the DocGraph community to keep the following in mind:

  • CMS has not explicitly documented the algorithm they used to create the data set. This algorithm has changed and is not 1-to-1 comparable with the data sets previously released. They appear to have made the modifications to the algorithm that we suggested, but we have no way of verifying that for now.
  • CMS continues to refer to the data as a “referral” data set, despite it being a “shared patient in time” data set that includes “referrals” as one subset. While this data does include traditional referrals as a subset, this is not strictly referral data. There are fields used in medical claims to document referrals and this data set was not generated using those fields.
  • CMS continues to label the file as “physician” despite it covering all medicare provider types (except pharmacies, due to the exclusion of Part D data). Being a physician is not a prerequisite for being included in these data sets, any provider who bills medicare enough to meet the patient privacy threshold will be included in the data set.
  • There are coding problems in the datasets. Specifically, there are many “impossible” NPIs (National Provider Identifiers). For instance, all of these NPI’s are returned from the query [of which year, 2014]:

npi, npi_count, problem
6073299,2,”This NPI is not 10 digits, it has 7: 6073299″
16073299,2,”This NPI is not 10 digits, it has 8: 16073299″
135632034,8,”This NPI is not 10 digits, it has 9: 135632034″
162909399,1,”This NPI is not 10 digits, it has 9: 162909399″
162915969,1,”This NPI is not 10 digits, it has 9: 162915969″
174031524,1,”This NPI is not 10 digits, it has 9: 174031524″
999999992,1,”This NPI is not 10 digits, it has 9: 999999992″
1063828204,1,”This NPI is does not pass luhn: 1063828204″
1194809840,3,”This NPI is does not pass luhn: 1194809840″
1245396655,1,”This NPI is does not pass luhn: 1245396655″
1619944960,1,”This NPI is does not pass luhn: 1619944960″
1740228645,3,”This NPI is does not pass luhn: 1740228645″
1750458455,5,”This NPI is does not pass luhn: 1750458455″
9999999991,502,”This NPI is does not pass luhn: 9999999991″
9999999992,11634,”This NPI is does not pass luhn: 9999999992″
9999999994,116,”This NPI is does not pass luhn: 9999999994″
9999999996,2975,”This NPI is does not pass luhn: 9999999996″

Based on previous investigations, we know that the contractors who generated the files are faithfully returning what is listed in the NPI field. The underlying problem is with the actual Medicare claims database. We suspect that these are the last vestigial organs of pre-NPI billing systems, but we cannot be sure. Happily these strange numbers are far less common in the 2015 data set. Perhaps CMS is succeeding in squashing the non-NPI coded transactions once and for all.

Here is a NPI validity report for the 2015 data:

npi, npi_count, problem
9999999991,103,”This NPI is does not pass luhn: 9999999991″
9999999992,1569,”This NPI is does not pass luhn: 9999999992″
Much better!

For real-time discussion about these data sets, join the DocGraph google group.

For a more thorough exploration of this data, sign up for CareSet news (link to sign up here).

Enjoy… and watch this space! We will be opening lots more data in 2016 and beyond.


Fred Trotter

Co-founder, DocGraph and CareSet Systems
* DocGraph teaming data shows how healthcare providers who bill Medicare cooperate to deliver care to their patients. Essentially, the teaming dataset documents Medicare providers who share patients in a given year. The result of this is a data structure that data scientist call a weighted directed graph of relationships. To be more specific, the method used to generate the graph converts the bi-partite (two types of nodes, patients and providers) graph structure into a graph with just one type, a graph showing relationships between providers. In layman’s terms, the DocGraph teaming data set is massive map of the healthcare system in the United States. It shows referrals, ordering patterns and many other types of healthcare provider collaborations. This data set exists as the result of a FOIA request made by DocGraph. Since that time, DocGraph has continued to collaborate with CMS to ensure that the data was updated and reliable.

Data release: Open Provider Directory and Open Formulary comment data

Recently, HHS released a proposed rule regarding new regulations for health insurance companies. The specific document is called:

Patient Protection and Affordable Care Act: HHS Notice of Benefit and Payment Parameters for 2016

In that proposed rule were two open data concepts that are worth noting:

  • A suggestion that insurance companies be required to release their formulary data as machine readable data sets.
  • A suggestion that insurance companies be required to release data about their current provider directory as machine readable data sets.

As you might imagine, the DocGraph Journal consistently advocates for open data and indeed, we did submit comments regarding this issue…

We had one of our part time researchers (thanks Armie!!) search all of the comments for mentions of “machine readable” and/or “data” to see who had commented on this matter besides us. Then we created a google sheets page with all of the relevant comments in one place. We are now releasing this data to the public.

Read on to access the data, and to read our first-pass analysis of what we found!

Read more

Tired of “out of network” insurance games

UPDATE (May 20, 2015) It looks like the forces for open data won this one! Here is the summary at fiercehealth and the actual policy change letter sent to payers.

If you spend much time in the patient community you meet someone who has been burned, badly by the “out of network” game that insurance companies play with/against healthcare providers.

Its simple, you get insurance plan A from company Z. Then you go to a specialist or get a scan or something and you ask, “do you take company Z insurance”? They say “sure”. You hand them the insurance card. What they don’t tell you is that they will be billing “out of network” which means they will be hardly covered at all.

You go to the insurance company, they point to the provider. You go to the provider, they point to the insurance company. Who is left with the huge bill? The patient.

Sometimes this gets really bad, in the worst cases important treatments to relieve suffering are delayed.

Are you tired of this? In order to fix this, we need to be able to build systems that tell us for sure which providers are in a given plan at a given time. We need to have that system available when we purchase our health insurance so that we can buy insurance that covers the doctors that we already use, or the ones that we want to use. We can imagine a theoretical tool called that solves this problem in a user friendly way.

There are lots of companies and journalists in the DocGraph community that would love to be able to build such a tool. DocGraph would love to provide the data for such a tool but right now that would require that we scrape the websites of every insurance company provider directory in the country. Those websites are really unfriendly to such efforts. The following text was taken from the user agreement of the doctor finder tool for Aetna:

By using DocFind, you acknowledge and agree that DocFind and all of the data contained in DocFind belongs exclusively to Aetna Inc. and is protected by copyright and other law. DocFind is provided solely for the personal, non-commercial use of current and prospective Aetna members and providers. Use of any robot, spider or other intelligent agent to copy content from DocFind, extract any portion of it or otherwise cause DocFind to be burdened with unwarranted high access or transaction activity is strictly prohibited. Aetna reserves all rights to take appropriate civil, criminal or injunctive action to enforce these terms of use. 

Provider information contained in this directory is updated 6 days per week, excluding holidays, Sundays, or interruptions due to system maintenance, upgrades or unplanned outages. This information is subject to change at any time. Therefore please check with the provider before scheduling your appointment or receiving services to confirm he or she is participating in Aetna’s network. Participating physicians, hospitals and other health care providers are independent contractors and are neither agents nor employees of Aetna. The availability of any particular provider cannot be guaranteed, and provider network composition is subject to change. Notice of the change shall be provided in accordance with applicable state law.

The underlines are mine.

First, Aetna does not want anyone scrapping their website. They do not want people like DocGraph to create these data sets. They view their list of providers as a protected information asset, that only they can leverage.

But more importantly, they put the responsibility on “who is in what plan” squarely on the doctors. Which really means the patients, because the doctors websites will just say “check the insurance company website”. See what I mean about finger pointing?

Insurance companies, and healthcare providers need to be held accountable for their in vs out status. The only way to do this is to create open data set that maps Plans to Providers so that projects like is really easy to build.

The policy wonks at HHS/CMS/ONC et al get this. The have recently added the following text to the rules for the 2016 insurance plans.

…we propose that a QHP issuer must publish an up-to-date, accurate, and complete provider directory, including information on which providers are accepting new patients, the provider’s location, contact information, specialty, medical group, and any institutional affiliations, in a manner that is easily accessible to plan enrollees, prospective enrollees, the State, the Exchange, HHS and OPM. As part of this requirement, we propose that a QHP issuer must update the directory information at least once a month, and that a provider directory will be considered easily accessible when the general public is able to view all of the current providers for a plan on the plan’s public Web site through a clearly identifiable link or tab without having to create or access an account or enter a policy number….(blah blah)…We also are considering requiring issuers to make this information publicly available on their Web sites in a machine-readable file and format specified by HHS.

underlines are mine…

This would solve the problem. Anyone who wanted to could create a website that showed what plans any given provider accepted, would be able to easily do so.

But they key word here is “propose”. Insurance companies in this country benefit greatly from the confusion about in network and out of network, and so do some unethical healthcare providers. There will be lots of people who oppose this proposal.

I hope that I have made the case that this information needs to be open and machine readable. If your convinced, then you can find the comment page to support this policy here. If you disagree with us, and you still want to submit a comment, you can use this page.

Please take a few moments and write in to support this policy change. The comments are due Dec 22nd 2014 which is basically tomorrow.

If you would like to read the in-progress comments from the DocGraph Journal you can go here. Feel free to cut and paste from out comments into your own comments, we would be flattered.

Feel free to tell them that I sent you 😉

-Fred Trotter


DocGraph Summit recap


The DocGraph Summit was a great success, a big thanks to everyone who made it down. We filled our day discussing current open health data initiatives, questions, and goals. We are grateful to Houston Technology Center for providing an excellent venue, and we are already scheming for the 2nd annual Summit next year! Check out the Storify here: .










DocGraph Open Health Data Summit – Oct 8

The DocGraph Summit is just around the corner!

This “unconference” will include short presentations on current projects of the participants, and discussions on the topics, challenges, and ideas deemed most relevant and paramount to the open health data community. Our goal is to set an atmosphere conducive to in-depth dialogue, concept mapping, networking, and brainstorming.

The Summit will also review DocGraph’s open healthcare data initiatives. These projects include food, medical, doctor, and hospital data, as well as other fun topics that are not easily categorized.

Currently we have academics, corporate delegates, researchers and entrepreneurs attending the Summit. Their areas of focus include data analytics, open source drug databases, EHRs, gene/drug interactions, VistA, Health IT, ACOs, statistics, etc. Attendees are coming from from Rice, Stony Brook, UTHSC, e-mds, PwC, Baylor Medicine, the DocGraph community, and more.

Join us!

Eventbrite - The DocGraph Summit

Email for university student and faculty discount codes.

The DocGraph Summit is being held alongside International Conference on Biomedical Ontology (ICBO) 14