CarePrecise: May 2023

"Artificial intelligence (AI) and machine learning (ML) are used in healthcare to combat unsustainable spending and produce better outcomes with limited resources," says Ben Tuck in a recent article on the healthcare data blog ClosedLoop.ai. The article stresses the importance of keeping algorithmic bias in check, and goes on to offer four steps to address it.

When machine learning occurs, particularly in neural network-based systems where it is essentially impossible to fully grasp what's happening within the "mind" of the AI, the system may rely on data that reflects cultural biases, such as racism, sexism, homophobia, ageism, and all of the other stereotyping structures that have become written across our languages, interests, parenting, habits - whether we can precisely identify them (or openly admit them) or not.

Tuck's post identifies two general causes, or types, of algorithmic bias: subgroup invalidity and label choice bias.

Subgroup Invalidity Bias

Subgroup invalidity arises where the AI isn't up to the task of modeling the behavior of certain subgroups, due to training on homogeneous populations. Tuck offers the example of a study of pulse oximeter algorithms that demonstrated bias as a result of training on non-diverse data. The study found that "Black patients had nearly three times the frequency of occult hypoxemia that was not detected by pulse oximetry as white patients." The possibility for adverse health outcomes is obvious.

Label Choice Bias

Label choice bias is harder to detect. This is the situation when the AI's process returns a proxy variable —a stand-in for the real thing when the target metric is unavailable. The use of cost data to predict the need for future healthcare resources is an example; because Black people experience discrimination that results in their receiving less of the care received by the White population. Cost metrics, as derived from mostly white consumers' episodes, is used as though it applies to everyone. An argument can be made that minorities receiving less acute care when needed may actually bias the model in exactly the opposite direction, and the existence of the argument is a strong reason to improve the way the model is built by including race very thoughtfully in the source investigations and in the model's computations.

Fixing It

To limit bias and make the models useful, is possible, Tuck says. "Organizations are taking major steps to ensure AI/ML is unbiased, fair, and explainable," pointing to a playbook developed by the Booth School of Business at the University of Chicago - a guide for healthcare organizations and policy makers on catching, quantifying, and reducing bias. Read Ben Tuck's article for steps that can be taken, and review the Algorithmic Bias Playbook for more on how to define, measure, and mitigate bias in AI/ML algorithms.

-------------------

CarePrecise is a supplier of authoritative healthcare provider data and insights used across the healthcare community.

Record-linkage is a term referring to technologies that make it possible to merge data on people and organizations from multiple, disparate sources. Early development of the technology was largely related to marketing, for instance, as a means of connecting magazine subscribers' contact information to sales records belonging to retail stores. It's still used that way (more than ever), but some very important applications have emerged since those early days in the 1950s and 1960s, when computers filled whole rooms and developing highly complex software that would use years of run time was pointless.

CarePrecise uses record linkage to create business intelligence datasets from a broad range of information available through the U.S. Department of Health and Human Services, Department of Commerce, USPS, and other resources. For example, by merging Medicare claims data with NPI registry data and other federal data sources, we can build a 360 degree view of the U.S. healthcare system - from the health systems to the hospitals to the medical practice groups and clinics, to individual clinicians. Today, record linkage is also making significant inroads in improving patient care.

What is record linkage technology and how does it work?

Record linkage is becoming a vital tool for getting the most out of many types of data. Record linkage technology works by creating a unique identifier for each patient that is used to combine information from multiple sources. There are two general types of record linkage: Exact (deterministic) matching and statistical (probabilistic) matching.

Disambiguation. Exact matching is, of course, ideal. Linking records based on email addresses and tax identification numbers are excellent examples. "Disambiguation" occurs when otherwise disconnected data can be "hard matched" to create an unambiguous match, for which one unique identifier - a number or other code - can be assigned.

Arriving an unambiguous match may not be as easy as comparing Social Security Numbers. That's when we turn to statistical matching. This is trickier, and almost always less reliable. Probabilistic record linkage uses "fuzzy" matching algorithms to compare data points and make links between different records that may not have the same exact details. For example, if two records had similar birth dates or home addresses, the algorithm would recognize these as potential matches and create a statistical link between them.

Relying on one or a few non-deterministic data points to match records is, naturally, a bad idea. People tend to change home addresses several times over their lifetimes, so using a street address, or phone number or email address, for that matter, would likely miss a number of records. Also, even if these markers have remained constant, another problem, frequently referred to as "fat fingering," occurs when a name, address, phone, etc. is wrongly entered in a database.

Deliberate ambiguation. Early techniques for reducing this kind of ambiguity between datasets included creating a data field in which all of the vowels are removed from a name or street address. This "works" because numbers and consonants are statistically far less likely to be typed incorrectly. Not a good system, but better than nothing. A "false positive," when records are matched that shouldn't be, and "false negatives," when records that should be matched aren't, abound using only this ham-handed method, but it can still be a part of the record linkage process. Where patient data is involved, and where scientists are relying on clean data to glean truth, much more must be done.

Tighter matching for critical healthcare data

Data that can be linked include sensitive medical records, hospital records, laboratory tests, insurance claims data and administrative databases. When used for research involving patient records, record linkage often involves matching information from multiple sources to create a single unified patient record identifier, sometimes called a Master Patient Identifier (MPI), that can be used to track and analyze health outcomes over time. By combining different datasets, researchers can gain insights into the effectiveness of treatments and interventions, as well as uncover patterns in disease progression or risk factors that would not be visible if looking at one dataset alone.

This allows researchers to gain insights into patient care outcomes by combining information from multiple sources and looking at patients over time. As data science developed, and much larger datasets became available, scholarly efforts to improve record matching began to emerge. Systems that compare text strings and score the difference have been among these methods. An algorithm known as Soundex compares text strings phonetically; the words "Mary" and "Merry" would have a low text-only score, but Soundex can add weight to the match because the words sound alike.

Other fuzzy-logic methods exist, and can even be bought as part of record linkage software. "Standardization" essentially means making all of the same kinds of data appear the same way across different datasets. One such technique is address standardization, based either on proprietary technologies such as the CoLoCode technique developed by CarePrecise, or other, less precise, methods such as the USPS "Pub 28" standard. Getting mail delivered properly is important, to be sure, but the post office to its advantage the benefit of mail carriers' knowledge of their routes and the human ability to disambiguate on the fly. When comparing thousands or millions of rows of data, as is not unusual in medical research applications, "eyeballing" is not an option.

Rather than get too deep in the weeds here, a fine elucidation on record linkage in medicine can be found on the National Library of Medicine website.

Benefits of record linkage technology in medicine

Data merged from many sources can provide a more comprehensive view of the patient, allowing researchers to make more accurate and reliable conclusions about healthcare outcomes. By combining multiple datasets, researchers can gain deeper insight into medical conditions and how treatments affect patients over time. It also makes it easier to compare health outcomes across different populations, as well as detect potential errors or risks in patient care.

Additionally, record linkage technology can be used to reduce medical costs and improve efficiency in the healthcare system. By linking administrative databases with clinical data, researchers can better understand why certain treatments cost more than others and identify areas where cost savings can be made. This could lead to improved healthcare decisions, including changes in treatment protocols or resource allocations.

Record linkage has also been used to analyze the prevalence of medical conditions in various populations, create predictive models for patient care, and identify potential drug interactions. All of these studies have helped to improve our understanding of healthcare outcomes and inform decisions about how best to provide care for different patient groups.

Researchers at the University of California‐San Francisco used record linkage to combine patient records from different providers and examine how electronic medical records could be used to improve care coordination.

Challenges in using record linkage technology

Despite the many potential benefits of record linkage technology, there are still challenges that must be overcome. Lack of standardization between datasets can make it difficult for algorithms to identify matches, and data quality issues can lead to incorrect links or missing information.

Additionally, privacy concerns arise when combining multiple datasets, as linking patient records can reveal identifying information about individuals. In order to ensure that patient data is kept secure and confidential, there must be safeguards in place to prevent unauthorized access or misuse of the information. This includes developing secure protocols for data sharing, as well as strong regulations for protecting patient privacy.

It is important to consider the ethics of combining multiple datasets in order to identify a single patient. This could lead to potential issues such as discrimination or stigmatization, and researchers must make sure that they are adhering to ethical codes when collecting and analyzing data.

These issues must be addressed in order to ensure that record linkage technology is used responsibly and efficiently. Solutions such as secure data sharing protocols, improved standards for data quality, and rigorous processes for privacy can help researchers harness the power of record linkage technology while protecting patient privacy.

Examples of recent uses of advanced record linkage technology in medical research

Mortality in children under 5 years of age with congenital syphilis in Brazil: A nationwide cohort study

Causes of death in children with congenital Zika syndrome in Brazil, 2015 to 2018 - PLOS

Using privacy preserving record linkage to understand deaths by political affiliation during ...

May 22, 2023

Algorithmic Bias in Healthcare AI