Janos Hajagos analyzes DocGraph RX

I have just realized that I had missed a very comprehensive analysis of the DocGraph RX dataset by Janos Hajagos.

He looked at providers in Hawaii to limit the scope. You can see his full sized chart here. You can look at his sourcecode here.

Highlights from the post:

The second core aspect of the work was to map semantic clinical drugs (TTY=SCD) to WHO’s ATC drug classification system. ATC drug codes are free to use for non commercial purposes. RxNorm includes an accidental mapping to ATC drug codes but they are based on a single ingredient and not on the route, e.g., oral versus topical. A rather time consuming process, in terms of writing SQL, was done to improve the mapping. The final result while not 100% complete allows drugs to be sorted by a synthetic ATC code. Certain branded drugs like Skelaxin (Metaxalone) are not part of the current ATC release. Whenever possible I try to map to the longest length ATC code. The advantage of using ATC to sort the drugs is that we put drugs that are similiar spatially near each other. The MySQL queries for generating the refined drug database are on GitHub.

This is truly unprecedented work!!!

(Update Feb 2014: I misspelled Hajagos in this article. How embarrassing.)


Should we have more public data on doctors?

Obviously, we here at DocGraph want to see more and more doctor data released. So we should admit that bias to start.

We also care about what other people think, so we were very excited to see the release of the comments regarding the change of policy regarding the release of doctor-identified patient-blinded claims data. In many cases it is hard to understand how policy makers are thinking into you see how the public is reacting to their decisions. With that in mind, I arm-twisted by sister, Alma, into doing some data analysis on the comments themselves. I wanted to understand what kind of patterns the comments showed. I was expecting, for instance, that organizations that represented patients would comment very different typically than organizations representing doctors.

Alma read every comment, and answered specific questions about each comment in a spreadsheet that we plan to release soon. She also summarized the comments as she went along, and she did such a fine job with that summary that I thought the DocGraph community might be interested in her analysis. Here it is in its entirety:

Regarding the release more doctor data by CMS

The form and direction of the data release was a frequently covered topic. Raw, disaggregate line-item data, aggregated data, or both? Into the hands of the public, or first to an intermediary such as a Qualified Entity, stakeholder or other approved party? Limited to internal or public uses?

A large portion of respondents (frequently, but certainly not limited to, providers) believe the data should be aggregated initially, or that raw data should be presented only to entities with appropriate security safeguards and mechanisms to analyze the data and prevent misinterpretation. The misinterpretation factor was brought up time and again, by different types of responders. The public could easily misread the information and make faulty assumptions about certain doctors, providers and procedures, ultimately taking a step backward from the goal of better care.

While the method of release was highly contested, there was a general agreement that timely updates of the data are key to making it useful. Some frustration expressed with CMS’ past performance in this regard.

Many journalists and open data proponents/analysts think releasing data to the public, or by relaxing the prerequisites and ensuring affordability in obtaining the data, could prevent backlog or delay from information requests and allow for faster analysis. Providing developers and interested parties access to line-item, disaggregate data could also expedite the flow and achieve more meaningful results. Aggregating the data before its release could not only delay its availability, but most importantly constrain the results and insights of future data processing.

Both providers and non-providers suggested an expansion on the QE program to allow a greater number of (experienced and trustworthy) hands access to the data. See pg 209 and 178 for more about that idea.

Most anxiety came from providers, primarily that physician identity and privacy may be at risk and that the data will almost certainly be misinterpreted if released raw to the public. Generally providers desired CMS or a party with proven skills to to aggregate or remove physician identifiable information, such as their NPI, before released to the public. Their major reason being that the public will misconstrue the payments to physicians and make poor health care decisions.The public will simply see a physician’s payment data, without understanding factors such as team-based care, overhead and operational costs, how services are billed, geography, patient load, specialized services, etc. They hope to constantly review their information to check for accuracy, and attach comments and explanations.

Some (mostly non-provider) groups believed that physician privacy was not an issue, as physicians are business entities, making transactions with the government. Their privacy, by law, is not an issue. Others were certain that the precautions taken by CMS, or by approved data handlers, to protect physician (and patient) privacy will be quite sufficient.

Patient privacy was also a major concern from providers (though obviously not as great as physician privacy), with fears that the patients could be identified through rare, costly, or unusual procedures.

Still, I got the impression that providers believe the move to release this data is inevitable (and perhaps beneficial in the long term), so they addressed the most important risks with hope that the release could improve healthcare more than hurt it. Desire that CMS will work with them in developing the policies for this release, and monitor the results from data analyses.

Most everyone believed that a great benefit will arise by combining the Medicare data with other types of payment data, to get a broader sense of how physicians are performing. Also that the public will need encouragement and aid to actually use this information and understand it in a clear and meaningful way.

The administrative burdens, misdirected audits/misuse suspicion arising from this information were a concern from providers. Especially that this will lead to doctors no longer accepting Medicare patients. However, many groups of all kinds believed that with the proper context in place, prevention of fraud and waste could be possible.

Regarding the comments made by Individuals rather than groups:

Lots of regular joe’s are ready to pounce on data.. Many disgruntled folks who want to know where their tax dollars are going and/or those who feel blind in the healthcare system where they are frequently gouged. Again, believe that if someone is doing business with the government, they lose their right to privacy.

However, some joe’s think this will only lead to more bureaucratic nonsense. Waste of time and efforts, that will make lawsuits more frequent and invade physician privacy. Might as well release individual’s tax payments towards Medicare publicly as well. America!

Individual doctors more vehement and caustic about releasing their data than their above representatives (though some are eager to see change and progress through data availability). Some threatened to drop Medicare patients to avoid the hassle. Mentions that a greater class divide will result, and loads feel that their privacy will greatly violated.

Thanks for the analysis Sis!

IMS Informatics releases Wikipedia/Social Media Correlations

Found out from Rachel Feltman at Quartz that IMS Institutie for Health Informatics has released a very comprehensive report titled “Engaging Patients on Social Media“. Here is the link to their Press Release on the same. (Apparently IMS is not “into” human readable and/or stable urls.)

The simplified title belies the complexity of the release here, which is basically IMS correlating things that are happing on the web in social media and wikipedia, with the tremendous amount of healthcare system data that they are privy to. I was most interested in the conclusions they put forward regarding Wikipedia. From the report:

  • Wikipedia is the leading single source of healthcare information for patients and healthcare professionals.
  • Visits to Wikipedia pages are higher for rarer diseases than for common diseases.
  • Wikipedia is used throughout the entire patient journey, not just at the point of treatment initiation or change in therapy.
  • Correlation between Wikipedia use and medicine use can be identified for a large number of disease areas.
  • Younger people tend to investigate conditions and treatment options online before treatment is started whereas patients of age 50+ tend to start their treatment first and then seek information online thereafter.
  • Content incorporated or changed at healthcare related Wikipedia pages is subject to constant change, often overseen by informal or formal working groups.
  • At least half of all healthcare related changes on assessed Wikipedia disease articles are changes to patient relevant information.

These are pretty profound conclusions and it is worth reading the whole report carefully. IMS needs to be applauded for releasing this type of treasure trove of healthcare insights.

Here is a graphic from their separately published figures.


NPPES collaboration underway

The NPPES database is the central repository of data about doctors and hospitals in the United States. It is the core of the DocGraph dataset, forming the node data, which we add various edges to. The database has been a mess for years. Lots of non-compliant data.

For those who have not been paying attention, CMS has an internal innovation project to improve the NPPES database. Recently, Alan Viars, one of the strongest Health IT hackers I know took the job as the outside Entrepreneur to improve the database. I deeply thankful that Alan decided to take his turn serving in the government directly, and he and I have been complaining together for years about our shared frustrations with the quality of the NPPES data.

Recently Alan announced a new google group mailing list for those interested in collaborating with him and CMS about improvements to the data.


So if you are “into” the NPPES dataset as much as we are over at the DocGraph project, you should join the group and start contributing.




Which Doctor’s Next? More from RWeald

In this article, Weald explores typical treatment paths revealed in the DocGraph data.  An excerpt of his work is below:

Provider Type Seen First Provider Type Seen Second Number of Patients
Radiology – Diagnostic Radiology Internal Medicine – General 115,602,860
Internal Medicine – General Radiology – Diagnostic Radiology 91,632,055
Internal Medicine – Cardiovascular Disease Internal Medicine – General 54,260,749
Radiology – Diagnostic Radiology Internal Medicine – Cardiovascular Disease 49,406,691
Internal Medicine – Cardiovascular Disease Radiology – Diagnostic Radiology 47,820,945
Internal Medicine – General Internal Medicine – Cardiovascular Disease 47,351,852
Radiology – Diagnostic Radiology Family Medicine – General 45,078,839
Family Medicine – General Radiology – Diagnostic Radiology 40,181,846
Emergency Medicine – General Radiology – Diagnostic Radiology 33,797,598
Emergency Medicine – General Internal Medicine – General 32,236,140


The full table, including the Hive scripts used to perform the analysis are found on in this article.