BC Mosaic, our AI-powered trusted research environment for life sciences and healthcareย
Explore these FAQs to learnย howย BC Mosaic,ย our AI-poweredย TRE,ย enables secure, compliant research on clinical, real-world, and multi-omics data.
Drug development relies on robust evidence generated in defined populations and applied among broader patient groups. When those populations are weakly represented in the underlying data, error margins widen and model performance degrades when applied outside the original population. This matters for:
Population coverage determines how reliably evidence can be applied across different populations.
Most datasets used in drug development come from a limited set of patient populations, while a large share of the global disease burden often lies in populations that are likely under-represented in those datasets.
Asia, the Middle East, Africa, and Latin America account for a large share of global cases for serious diseases and conditions including liver cancer, nasopharyngeal carcinoma, IgA nephropathy, thalassemia, and G6PD deficiency. These diseases are often studied using Western-centric datasets and trial populations, which do not reflect the regions where they are most prevalent and can limit the relevance of resulting evidence.
According to the PHG Foundation:
Over 80% of the genomic data driving these advances comes from people of European ancestry, creating a biased foundation for medical research and clinical decision-making.1
At the same time, populations that represent a large share of global patients remain minimally represented. For example:
Despite comprising nearly 25% of the worldโs population, South Asians make up less than 2% of participants in global genome-wide association studies (GWAS).1
This gap introduces uncertainty when evidence generated in one population informs decisions in another. This means that treatment efficacy and safety profiles may differ when applied to populations that are not well represented in the data.
In many datasets, patients are grouped into broad categories such as โAsianโ or โSouth Asian.โ These categories combine populations that differ in genetics, disease risk, and treatment response. As the PHG Foundation notes:
This oversimplification obscures the deep internal diversity of South Asian populations.ยน
Genetic variants that affect how drugs are metabolized can vary widely within these groups. As the PHG Foundation highlights:
Variants in these genes didnโt just differ from European populations โ they varied widely within South Asian subgroups.ยน
This grouping hides differences in drug response, safety, and biomarker signals within the same population. As a result, important variations between patient subgroups can be missed. Capturing and structuring data at a more granular level allows these subgroup differences to be identified and analyzed.
Large volumes of data are generated across under-represented regions such as Asia, the Middle East, Africa, and Latin America, including clinical records, claims data, registries, biobanks, and national genomic programs. These datasets are stored in separate systems, follow different standards, and are not linked across clinical, genomic, and imaging data. This prevents them from being combined, compared, or reused across studies.
India illustrates this gap: The country represents nearly 20% of the worldโs population, yet populations from this region remain under-represented in global genomic datasets, accounting for less than 2% of participants in genome-wide association studies. The same pattern appears across Asian and Middle East / North Africa (MENA) markets, where large volumes of clinical and genomic data are generated within national health systems, hospital networks, and research programs, but remain difficult to combine or reuse across countries and studies. Data volume does not reduce uncertainty if datasets lack longitudinal structure, harmonization, and interoperability.
Adding more data from under-represented regions does not automatically improve population coverage. Data may exist in local healthcare systems, registries, or national programs, but if it cannot be combined with other datasets, it remains isolated. This means that populations remain under-represented in the datasets used for global analysis, even when data has been collected.
A 2023 Deloitte Health Forward blog, Intentional data collection could help advance health equity, reports:
Medical data often fails to provide a complete picture โฆ because it is often incomplete, biased, or both.2
For pharma teams, this means that expanding into new regions or adding new datasets does not necessarily improve evidence. In practice, increasing data volume does not resolve population gaps unless that data can be integrated and analyzed alongside existing datasets. Improving population coverage therefore requires not only access to data, but the ability to use it across studies and regions.
These limitations affect how clinical trials and real-world studies are designed, conducted, and used to make decisions across regions.
When results from one population are used in another, they may not hold. International Council for Harmonisation guideline E5(R1) states:
Ethnic differences may affect a medicineโs safety, efficacy, dosage or dose regimen.3
If a population is not included in the original data, new data must be generated in that population:
The sponsor may need to generate โฆ clinical data in the new region in order to โbridgeโ the clinical data between the two regions.3
For pharma teams, this means running additional studies or generating new real-world evidence to demonstrate that results are valid in each target population. This increases development timelines, costs, and operational complexity.
Improving population coverage earlier in the process reduces the need for these additional studies and supports faster, more reliable evidence generation across regions.
But improving population coverage requires more than access to additional datasets. Data from multiple regions must be comparable and usable within the same analysis.
The first challenge is data harmonization. Clinical, genomic, imaging, and real-world data are captured in different formats, use different data models, and reflect local clinical practices. Without harmonization, these datasets cannot be analyzed together in a consistent way.
Addressing this requires mapping datasets to a common data model and standardizing clinical concepts. Diagnoses, treatments, and outcomes are aligned to shared terminologies, and variables such as lab values are normalized so they can be compared across datasets. In practice, this often relies on recognized standards such as OMOP, FHIR, or SDTM, which provide common structures and definitions for organizing and exchanging health data across systems. Without this step, the same condition or outcome may be recorded differently across datasets, making combined analysis unreliable.
Once data is harmonized, the remaining challenge is regulatory. Even when data can be harmonized, it cannot always be centralized or transferred across borders. Data sovereignty requirements, local governance rules, and institutional constraints limit how data can be accessed and used. This is where trusted research environments (TREs) become necessary.
A TRE provides a secure, governed workspace where approved users can access and analyze sensitive health data without copying or exporting it. Data remains under the control of the original data holder, while access rules, audit trails, and usage controls are enforced by design. This allows multiple datasets to be analyzed together across institutions and countries without being pooled into a single repository. Instead of moving data, analysis is brought to the data. For pharma teams, this changes how evidence is generated. Data from different regions can be analyzed together within a single study, rather than requiring separate studies in each geography.
TREs also enable collaboration across organizations. Research teams, partners, and sponsors can work on the same datasets under shared governance, with full traceability of how data is accessed and used. This is essential for generating regulatory-grade evidence.
Combined with harmonization, it’s possible to include data from under-represented regions in analyses that were previously limited to a small number of patients.
Implementing this model at scale requires three capabilities: access to diverse datasets, harmonization into a common structure, and infrastructure that allows data to be analyzed across regions in compliance with local privacy and governance standards.
BC Platforms brings these capabilities together through our global data partner network of 150+ organizations across Europe, Asia, the Middle East, Africa, and Latin America. This includes partnerships with organizations such as GeneVault Lifesciences and OmicsBank, who bring access to under-represented populations in these areas. Our global data partner network provides access to more than 187 million patient lives across over 35 countries, with more than 90% of data originating outside the United States.
Data is harmonized into research-ready datasets and made available through trusted research environments and federated architectures, enabling analysis across regions while respecting local governance and data residency requirements. This allows life sciences teams to work with analysis-ready, multi-modal datasets that can be used consistently across studies, rather than rebuilding datasets for each region. More representative data can be used earlier in development, reducing the need for region-specific validation studies and improving the transferability of results across geographies.
For under-represented populations, it changes how data contributes to research. Instead of remaining isolated in local systems, data can be included in global analyses and used to inform development, safety, and treatment strategies that specifically target the patient profile for a given therapy. Improved population coverage enables:
Relevant population coverage is not only a question of data availability; it determines whether evidence reflects the populations a therapy is intended for. When under-represented populations are excluded, uncertainty persists. When they are included, evidence becomes more reliable and transferable across regions โ and therapies are developed with greater precision based on the unique genetic needs of that target population.
This article is part of a series examining how under-representation introduces uncertainty into real-world evidence and analytics, and how more representative, usable data can reduce risk in the decisions life sciences companies make across development, regulatory, and commercialization activities.
Including the right patient population in your evidence base is critical for drug development, regulatory approvals and market access. Letโs talk about how BC Platforms can provide representative real-world data that supports feasibility, safety, and regulatory submissions in your markets of interest.