Recruiting for Your Pragmatic Clinical Study
4. Healthplan Data: A Treasure Trove for Your Pragmatic Study, Part 2
Because it covers the care continuum for huge populations, healthplan data offers strong advantages for planning, recruiting for, and analyzing outcomes of your pragmatic clinical study. But it’s designed for payment—not studies. Do you know how to leverage its strengths while minimizing the ‘gotchas?’
Published on
January 11, 2019
Healthplan data can be extremely valuable, if you know how to interpret them
“If you’re not confused, you’re not paying attention.” - Tom Peters

“How do I find 3,000 geographically-dispersed diabetics who are currently taking metformin, and who do not have chronic kidney disease--of which at least 1000 are currently in good control, and at least 1000 are not?”

Welcome back! Our opening question illustrates the opportunity of healthplan data for efficient recruiting for a widely-distributed pragmatic study (WDPS, as discussed in Episodes 2 and 3 of this series); and we’ll use the question to illustrate the strengths, limitations and “gotchas” inherent in using this kind of data--and how it can be used wisely in expert hands.

Cut to the punchline: Healthplan data carries many advantages in recruiting for your pragmatic study (as well as for the retrospective analytics often needed to clarify questions and design your study). But, it’s designed for payment, not for clinical outcomes studies. Yet it can be extremely useful in expert hands when properly used. Working directly with a healthplan can have additional advantages, too, but also certain caveats. We provide a handy checklist to help you navigate.

Why healthplan data? The WDPS can yield actionable insights about the use, adherence to, and consequences (outcomes) of treatments over a wide range of appropriate real-world clinical scenarios, patient and clinician characteristics, and healthcare settings. This implies your study will need to recruit and track a lot of patients (subjects) and clinicians over a breadth of geography. Large healthplans have this kind of data.

How big a population do you need to do your study? For example, if you need 3000 type 2 diabetics age 45-64, let’s estimate that: Your healthplan age distribution mirrors the overall US population, with 26.4% age 45-64 (1) (of course, if you don’t have Medicare data, this proportion is probably larger); that the prevalence of this scenario in your healthplan data approximates the CDC’s estimate, 12.7% diagnosed (2); that 8.5% of them will be excluded due to chronic kidney failure (3); that 60% are taking metformin of which 85% aren’t using insulin and of those, 70% aren’t taking other oral hypoglycemics (4). This gets us to about 109,000 potential candidates.

But then you(probably through the healthplan)  have to recruit their doctors (directly--or indirectly by first reaching out to potentially-qualifying healthplan members), who in turn screen and recruit patients, some of whom drop out:

Finding your recruiting candidates: Assuming you will be reaching out to clinicians who have patients that appear to fit your study’s basic criteria. After eliminating healthplan members who aren’t the right age/gender, who aren’t covered by the right plan type (e.g., members of school or military plans for some studies), and who don’t have sufficient data history (for confident clinical inclusion/exclusion and other pre-study characteristics) must now identify (1) Type 2 diabetics who (2) don’t have chronic kidney disease and (3) are taking metformin but (4) not insulin and  (5) who aren’t taking any other oral hypoglycemic agents

Those numbers look promising, but if you know healthplan data you’ll know that you still have some cutting down to do: For example, your protocol may specify that qualifying members must have been “continuously eligible (insured)” for at least 12 months, not become age 65 during the 12-month study, and not have been in the ER or hospital with hypo- or hyperglycemia or acute cardiovascular disease in the past 12 months. And, of course, they must give consent if the study involves being randomized to, or offered a choice to receive, an intervention (or if your IRB says consent is required, whatever you may think).  

The accuracy vs. inclusiveness trade-off: The CDC prevalence is based on more accurate sources than claims, which are notoriously susceptible to false-positive disease identifications. (5) False-negatives (failure to identify a diabetic) may occur, too, if there’s inadequate data history. In a study (6) that compared several Medicare claims-based algorithms and used self-report as the gold standard (“correct by definition”), the best model’s sensitivity was about 70% and specificity 97.5%. (7) Of course, self-report is not as gold standard as lab testing which--with today’s electronic medical records--may automatically inform a claim (billing) diagnosis code. Unsurprisingly, the study found that combining more than 1 data source (e.g. inpatient, outpatient, lab, pharmacy), having longer claims history, and requiring multiple claims if services were outpatient reduced the false-positive rate. The higher the bar, the likelier your clues point to the real McCoy, but also, the more real McCoys you will miss.

Strengths and limitations of healthplan data for recruiting, outcomes:

Whether you’re looking to use healthplan data for recruiting or as part of assessing treatment, you must understand how to utilize this fabulous data trove wisely.

Here’s a checklist for using healthplan data:

  • Are you able to track individuals (people) through time and the continuum of healthcare? (Even better if you can track them if they change insurance plans or employers!)
  • How complete are the data? For example, what % of individuals have data for age, gender, insurance type, insurance carrier, zip code? What % of claims have a valid procedure (CPT or ICD9v3/ICD10-PCS) code?
  • Is there a lookup table for clinicians? (For example, are ‘John Jones,’ ‘Jonathan Jones,’ John Q. Jones,’ and ‘Jonathan Q. Jones’ all the same doctor?) Are the doctors’ specialties all listed and consistent?
  • Can I tell whether each individual member has only medical, only pharmacy, or both coverages for each month in which they have insurance eligibility?
  • Do I know how to identify people who ‘have’ disease X? If so, do I know how to distinguish people who are incident (no prior history) from prevalent (have had X for some period of time)?
  • Can you associate diagnoses and procedures with the care setting, such as doctor’s office, ER, urgent care, inpatient, etc?
  • Are my claims ‘compressed’ (see table), so that you don’t mistakenly conclude a person was hospitalized 11 times when it was only once (because the hospitalization generated 11 service sub-claims?)
  • If you want to study newborns, do you know how to identify them? (In fact, do you know why I’m asking this?)
  • Do you know how to use the data structure to analyze cost-drivers?
  • Do you know how to identify and take into account the components of cost trend? (roughly, these components include: health risk, age/gender risk net of health risk, service unit cost, service unit volume, and benefit plan design including demand elasticity)
Do we have to work with a healthplan to access healthplan data?

The easiest way to work with healthplan data is, of course, to work with a healthplan! While it’s unlikely they’ll give you direct access to their data, their informatics specialists--who know their data extremely well--will know how to elicit your research questions and recruiting criteria, convert them to queries, and--if the data will be used as part of outcomes evaluation--develop analytics. In addition, healthplans are entitled to reach out to their members and contracted clinicians and facilities (with certain constraints).  

Working with a healthplan may imbue your relationship with a heightened sense of collaboration. A downside to working with a single healthplan is limitation in the number of potential subjects for a study--could be important with a rare disease or treatment. In some circumstances it may be possible to work with more than one plan; and in the near future, multi-data vendors may arise, with capabilities of identifying patients and doctors for recruiting.

Closing remarks: Healthplan data offers a wealth of advantages and opportunities for recruiting and gaining insights into the drivers and outcomes of therapies. This is especially important to the widely-distributed type of pragmatic study--but please engage (or be) an expert!

Next up: Let’s talk recruiting!

Want to know more? Find us HERE!


  1. US Census Bureau, 2010:
  2. 2015 data from the National Diabetes Statistics Report, 2017, National Center for Disease Prevention and Health Promotion. See
  3. Estimated GFR < 60 in adults under age 65 in Bailey RA et al. Chronic kidney disease in US adults with type 2 diabetes: An updated national estimate of prevalence based on Kidney Disease: Improving Global Outcomes (KDIGO) staging. BMC Res Notes 2014;7:415.
  4. Latest CDC estimate (2011) ( was 50.3% were taking oral hypoglycemics only, but many would be taking metformin or taking metformin plus another oral hypoglycemic.
  5. The reasons for this are numerous but all depend on the use of diagnostic codes on claims being intended for billing--not science.
  6. Hebert PL, et al. Identifying persons with diabetes using Medicare claims data. Am J Med Qual 1999;14(6):270-77.
  7. Sensitivity (in this context) is the percentage of people with diabetes who are identified by the algorithm; specificity is the percentage of people who don’t have diabetes who are identified as nondiabetic (high specificity means few false-positive identifications). However, to tell how useful a test is, we also need to know the prevalence of the condition in the population of interest; then we can calculate what we really want to know: how likely it is that a person with a positive test (by claims model) is diabetic, and that a person with a negative test is not diabetic. Make sure you understand this distinction if you use healthplan data for recruitment.
  8. For example, the Health Effectiveness and Data Evaluation Set (HEDIS) ( measure for testing for glycemic control requires only one HbA1c performed in a 12 month period, though clinically, the American Diabetes Assocation recommends testing at 3 to 6 month intervals (depending on how well controlled the patient’s diabetes is)
Recruiting for Your Pragmatic Clinical Study