LibGuides: CHARTing Health Information: Data Problems

Common Data Problems

As you view the various links to sources of health data, keep in mind that data does not tell the whole story. There may be a story behind the facts. As you will see, data collection and gathering are not perfect – if you see an anomaly or large deviation, find out why. Don't assume it’s correct.

However, once you look over some of the caveats detailed on this page, please keep in mind that there is still comparability amongst the data once you understand the anomalies! And, whenever possible, take a look at the technical notes for more insight into the data.

1. Do you know what data is really gathered?

Be certain you understand the rules of data gathering before you try to interpret the data! Below is an example of how the data gathering can mislead.

A. Infectious Diseases: Changes to the list of notifiable diseases

In 1994 there were 59 infectious diseases notifiable at the national level. In 2010, there were 100. Not knowing when an infectious disease becomes notifiable can lead to a misinterpretation of the data. Looking at chlamydia, for example, we see the following data for Harris County.

Chlamydia in Texas
	2000	1999	1995	1990
Count	68,814	62,958	44,627	20,560
Rate per 100,000	328.39	306.24	235.39	120.54

(From the CDC Wonder Sexually Transmitted Disease Morbidity, 1984-2009)

There is what appears to be a chlamydia epidemic in Texas. Between 1990 and 1999, the number of chlamydia cases tripled. But wait-- when did chlamydia become a notifiable disease? Based on the data, we might guess that it was 1995 as that is when we see a large increase in both count and rate. Prior to 1995, chlamydia was only voluntarily reported; it became a notifiable disease in 1995. Learn more about data reporting for chlamydia.

The CDC lists other concerns when interpreting data:

"Incidence data in the Summary are presented by the date of report to CDC as determined by the MMWR week and year assigned by the state or territorial health department.....Thus, surveillance data reported by other CDC programs may vary from data reported in the Summary because of differences in 1) the date used to aggregate data (e.g., date of report, date of disease occurrence), 2) the timing of reports, 3) the source of the data, 4) surveillance case definitions, and 5) policies regarding case jurisdiction (i.e., which state should report the case to CDC).

The data reported in the Summary are useful for analyzing disease trends and determining relative disease burdens. However, these data must be interpreted in light of reporting practices. Some diseases that cause severe clinical illness (e.g., plague and rabies) are most likely reported accurately if they were diagnosed by a clinician. However, persons who have diseases that are clinically mild and infrequently associated with serious consequences (e.g., salmonellosis) might not seek medical care from a health-care provider. Even if these less severe diseases are diagnosed, they are less likely to be reported.

The degree of completeness of data reporting also is influenced by the diagnostic facilities available; the control measures in effect; public awareness of a specific disease; and interests, resources, and priorities of state and local officials responsible for disease control and public health surveillance. Finally, factors such as changes in the case definitions for public health surveillance, introduction of new diagnostic tests, or discovery of new disease entities can cause changes in disease reporting that are independent of the true incidence of disease."

2. How have standards changed in the reporting or collection of data?

A. Has an age-adjustment been made on the data? If so, which standard was used?
Age adjustments are used to compare two populations during the same time period or the same population during different time periods. They are used to eliminate observed differences in the population that are age-related. There are four common standards, the most current being the 2000 standard. Other standards include: the 1980 standard (not as common), the 1970 standard (common), and the 1940 standard (common). In order to get a viable comparison, you must use the same standard.

B. Which international classification of disease (ICD) revision was used to report the data?
ICD-9 and ICD-10 are both still used when classifying mortality data. This international classification provides a means of comparison between the U.S. and other countries. ICD-10 is more detailed than ICD-9 and utilizes an alpha-numeric system; ICD-9 was a numeric only system. For the purpose of comparison, see Anderson, RN, et al. (2001). Comparability of Cause of Death Between ICD-9 and ICD-10: Preliminary Estimates. National Vital Statistics Reports, 49(2). The World Health Organization has posted ICD-10 codes online.

C. Is the mortality data measuring the underlying cause of death or multiple causes of death?
Some Healthy People data (specifically diabetes) reports using multiple causes; mortality data via many of the sources on these pages show underlying causes only. Be certain you know which you are looking at so you aren't misled by conflicting data.

3. What is the unit of measure of the data?
As you look at the data, be sure you understand the actual unit of measure. Are you looking at a count or a rate? Is the rate age-adjusted? Which standard was used? If you aren't certain what that means, take a look at the Rates and Formulae page for additional information.

Be sure you understand the unit of measure so you can compare apples to apples. You cannot compare a non-adjusted rate with an age-adjusted rate. And quite honestly, you probably do not want to compare crude rates if there are several years separating them (e.g. a decade), especially in areas that are rapidly changing. It is possible to calculate an age-adjusted rate fairly easily. A general epidemiology book will explain how.

4. What has changed in medicine to affect the data?
Medicine has made great strides in keeping people alive when twenty years ago, or even ten years ago, they would have died. Think about AIDS, heart attacks, strokes, and cancer. Mortality data is not always the most accurate reflection of the health of a people.

Another example is infant mortality. Rates increased in the United States in 2002, from 6.8 deaths per 1,000 births in 2001 to 7.0 deaths in 2002. What happened? The causes are not fully known yet, but the CDC has some thoughts on the reasons why. Take a look at the "Supplemental Analyses of Recent Trends in Infant Mortality." Again, medical technology could have influenced infant mortality. For instance, what was once considered a miscarriage is now phrased as a preterm delivery.

5. How has the population changed?
Has the population aged? If so, we may see a sharp increase in cancer and cardiovascular diseases. Is it a younger population? Then there may be an increase in the number of pregnancies and STDs. Be sure to look at the demographics of the population when examining the number of occurrences of an event.

6. Is the data reporting self-reported behaviors?
Data for self-reported behaviors cannot expect to be accurate. After all, a survey participant may be asked about behaviors that are embarrassing or even illegal. Consequently, when questioned, participants may under-report certain behaviors (drinking while pregnant) and over-report others (exercise).

7. Is the frequency of events or the population (or both) a very small number?
Be careful when working with diseases in which there are not a large number of deaths or the population is fairly small. While the mortality rate may be expressed as the number per 100,000, you may also need to take into consideration the confidence interval (see https://www.health.ny.gov/diseases/chronic/confint.htm).

Last Updated: 12/01/23