Skip to content
9 min read

Removing uncertainty in real-world evidence and under-represented populations


Executive summary

Under-represented patient populations introduce uncertainty into the evidence used for drug development when results are applied beyond the patients included in the data. This affects feasibility, safety assessment, analytical performance, and real-world outcomes. 

This article examines: 

  • How data structure, access, and governance determine whether evidence can be generated  
  • How population gaps affect the reliability of evidence used in drug development and decision-making  
  • Where disease burden and available datasets diverge globally  
  • Why data volume alone does not resolve these gaps  

Drug development relies on robust evidence generated in defined populations and applied among broader patient groups. When those populations are weakly represented in the underlying data, error margins widen and model performance degrades when applied outside the original population. This matters for: 

  • Feasibility assessments 
  • Safety characterization 
  • Treatment response expectations 
  • Real-world evidence and post-market studies 
  • AI and advanced analytics 

Population coverage determines how reliably evidence can be applied across different populations. 

When disease burden and evidence do not match 

Most datasets used in drug development come from a limited set of patient populations, while a large share of the global disease burden often lies in populations that are likely under-represented in those datasets. 

Asia, the Middle East, Africa, and Latin America account for a large share of global cases forย serious diseases andย conditions including liver cancer, nasopharyngeal carcinoma, IgA nephropathy, thalassemia, and G6PD deficiency.ย Theseย diseasesย are often studied using Western-centric datasets and trial populations, which do not reflect the regions where they are most prevalent and can limit the relevance of resultingย evidence.ย 

According to the PHG Foundation:

At the same time, populations that represent a large share of global patients remain minimally represented. For example: 

Despite comprising nearly 25% of the worldโ€™s population, South Asians make up less than 2% of participants in global genome-wide association studies (GWAS).1

This gap introduces uncertainty when evidence generated in one population informs decisions in another. This means that treatment efficacy and safety profiles may differ when applied to populations that are not well represented in the data. 

When population categories are too broad for analysis 

In many datasets, patients are grouped into broad categories such as โ€œAsianโ€ or โ€œSouth Asian.โ€ These categories combine populations that differ in genetics, disease risk, and treatment response. As the PHG Foundation notes:

This oversimplification obscures the deep internal diversity of South Asian populations.ยน 

Genetic variants that affect how drugs are metabolized can vary widely within these groups. As the PHG Foundation highlights:

Variants in these genes didnโ€™t just differ from European populations โ€” they varied widely within South Asian subgroups.ยน 

This grouping hides differences in drug response, safety, and biomarker signals within the same population. As a result, important variations between patient subgroups can be missed. Capturing and structuring data at a more granular level allows these subgroup differences to be identified and analyzed. 

Data exists, but cannot be reused  

Large volumes of data are generated across under-represented regions such as Asia, the Middle East, Africa, and Latin America, including clinical records, claims data, registries, biobanks, and national genomic programs. These datasets are stored in separate systems, follow different standards, and are not linked across clinical, genomic, and imaging data. This prevents them from being combined, compared, or reused across studies.ย ย 

India illustrates thisย gap: The countryย representsย nearly 20%ย of the worldโ€™s population, yet populations from this region remain under-represented in global genomic datasets, accounting for less than 2% of participants in genome-wide association studies.ย The same pattern appears across Asianย and Middle East / North Africa (MENA) markets, where large volumes of clinical and genomic data are generated within national health systems, hospital networks, and research programs, but remain difficult to combine or reuse across countries and studies.ย Data volume does not reduce uncertainty if datasets lack longitudinal structure, harmonization, and interoperability.ย 

When available data does not improve representation 

Adding more data from under-represented regions does not automatically improve population coverage. Data may exist in local healthcare systems, registries, or national programs, but if it cannot be combined with other datasets, it remains isolated. This means that populations remain under-represented in the datasets used for global analysis, even when data has been collected. 

A 2023 Deloitte Health Forward blog, Intentional data collection could help advance health equity, reports: 

Medical data often fails to provide a complete picture โ€ฆ because it is often incomplete, biased, or both.2ย 

For pharma teams, this means that expanding into new regions or adding new datasets does not necessarily improve evidence. In practice, increasing data volume does not resolve population gaps unless that data can be integrated and analyzed alongside existing datasets. Improving population coverage therefore requires not only access to data, but the ability to use it across studies and regions. 

Population gaps surface as operational risk 

These limitations affect how clinical trials and real-world studies are designed, conducted, and used to make decisions across regions. 

  • Feasibility estimates fail to translate across regions: Patient availability, eligibility criteria, and disease characteristics differ across populations. Cohorts identified in one region may not be replicable in another, leading to delays in site selection and recruitment. 

  • Safety signals remain under-characterized: If certain populations are under-represented, adverse events specific to those groups may not be detected before approval, increasing uncertainty around safety in real-world use. 

  • Real-world outcomes diverge from expectations:ย Treatment responseย observedย in trials or initial datasets may not match outcomes in broader populations due to differences in baseline risk, comorbidities, or care pathways.ย 

  • Additional studies become necessary post-approval: When evidence does not transfer across populations, sponsors may need to conduct additional studies or generate new real-world evidence to support use in specific regions. 

When results from one population are used in another, they may not hold. International Council for Harmonisation guideline E5(R1) states:

Ethnic differences may affect a medicineโ€™s safety, efficacy, dosage or dose regimen.3ย 

If a population is not included in the original data, new data must be generated in that population:

The sponsor may need to generate โ€ฆ clinical data in the new regionย in order toย โ€˜bridgeโ€™ the clinical data between the two regions.3

For pharma teams, this means running additional studies or generating new real-world evidence to demonstrate that results are valid in each target population. This increases development timelines, costs, and operational complexity.  

Improving population coverage earlier in the process reduces the need for these additional studies and supports faster, more reliable evidence generation across regions. 

Why infrastructure and governance matter 

But improving population coverage requires more than access to additional datasets.  Data from multiple regions must be comparable and usable within the same analysis. 

The first challenge is data harmonization. Clinical, genomic, imaging, and real-world data are captured in different formats, use different data models, and reflect local clinical practices. Without harmonization, these datasets cannot be analyzed together in a consistent way. 

Addressing this requires mapping datasets to a common data model and standardizing clinical concepts. Diagnoses, treatments, and outcomes are aligned to shared terminologies, and variables such as lab values are normalized so they can be compared across datasets. In practice, this often relies on recognized standards such as OMOP, FHIR, or SDTM, which provide common structures and definitions for organizing and exchanging health data across systems. Without this step, the same condition or outcome may be recorded differently across datasets, making combined analysis unreliable. 

Once data is harmonized, the remaining challenge is regulatory. Even when data can be harmonized, it cannot always be centralized or transferred across borders. Data sovereignty requirements, local governance rules, and institutional constraints limit how data can be accessed and used. This is where trusted research environments (TREs) become necessary. 

A TRE provides a secure, governed workspace where approved users can access and analyze sensitive health data without copying or exporting it. Data remains under the control of the original data holder, while access rules, audit trails, and usage controls are enforced by design. This allows multiple datasets to be analyzed together across institutions and countries without being pooled into a single repository. Instead of moving data, analysis is brought to the data. For pharma teams, this changes how evidence is generated. Data from different regions can be analyzed together within a single study, rather than requiring separate studies in each geography. 

TREs also enable collaboration across organizations. Research teams, partners, and sponsors can work on the same datasets under shared governance, with full traceability of how data is accessed and used. This is essential for generating regulatory-grade evidence. 

Combined with harmonization, it’s possible to include data from under-represented regions in analyses that were previously limited to a small number of patients. 

Reducing uncertainty through expanding population coverage 

Implementing this model at scale requires three capabilities: access to diverse datasets, harmonization into a common structure, and infrastructure that allows data to be analyzed across regions in compliance with local privacy and governance standards. 

BC Platforms bringsย these capabilities together through our global data partner network of 150+ organizationsย across Europe, Asia, the Middle East, Africa, and Latin America.ย This includes partnerships with organizations such as GeneVault Lifesciences and OmicsBank, who bring access to under-represented populationsย in these areas.ย Our global data partner network provides access to more than 187 million patient lives across over 35 countries, with more than 90% of data originating outside the United States.ย 

Data is harmonized into research-ready datasets and made available through trusted research environments and federated architectures, enabling analysis across regions while respecting local governance and data residency requirements. This allows life sciences teams to work with analysis-ready, multi-modal datasets that can be used consistently across studies, rather than rebuilding datasets for each region. More representative data can be used earlier in development, reducing the need for region-specific validation studies and improving the transferability of results across geographies. 

For under-represented populations, it changes how data contributes to research. Instead of remaining isolated in local systems, data can be included in global analyses and used to inform development, safety, and treatment strategies that specifically target the patient profile for a given therapy. Improved population coverage enables: 

  • More accurate feasibility assessments across regions: Cohorts can be identified using data that reflects local patient populations, improving estimates of patient availability, eligibility, and site selection.  

  • Better characterization of safety and treatment response in diverse populations:ย Including a wider range of patient groups helps detect differences in adverse events, drug metabolism, and treatment outcomes that may not appear in limited datasets.ย ย 

  • Stronger and more transferable regulatory evidence: Evidence generated across multiple populations is more likely to meet regulatory expectations and reduces the need for additional bridging studies in new regions.   

  • More reliable real-world outcomes: Observed treatment effects better reflect actual clinical practice across geographies, improving confidence in post-launch performance and long-term health impact.  

Relevant population coverage is not only a question of data availability; itย determinesย whether evidence reflects the populationsย a therapyย is intended for. When under-represented populations are excluded, uncertainty persists. When they are included, evidence becomes more reliable and transferable across regionsย โ€“ and therapies are developed with greater precision based on the unique genetic needs of that target population.ย ย 

Are you using the right evidence base in your research? 

Including the right patient population in your evidence base is critical for drug development, regulatory approvals and market access. Letโ€™s talk about how BC Platforms can provide representative real-world data that supports feasibility, safety, and regulatory submissions in your markets of interest. 

References 

  1. PHG Foundation. Closing the genomic data diversity gap: lessons from South Asia. 2023. 
    https://www.phgfoundation.org/report/closing-the-genomic-data-diversity-gap-lessons-from-south-asia  
  1. Deloitte. Intentional data collection could help advance health equity. Deloitte Health Forward Blog, 2023. 
    https://www2.deloitte.com/us/en/blog/health-forward/2023/intentional-data-collection-could-help-advance-health-equity.html  
  1. International Council for Harmonisation (ICH). E5(R1): Ethnic Factors in the Acceptability of Foreign Clinical Data. 1998. 
    https://www.ema.europa.eu/en/documents/scientific-guideline/ich-e-5-r1-ethnic-factors-acceptability-foreign-clinical-data-step-5_en.pdf