Greater genetic diversity is needed in human pluripotent stem cell models

While the need for more diverse hPSC resources is clear, substantial challenges remain with regard to expanding cell collections and implementing these resources in the laboratory. First, it is critical to acknowledge that current efforts toward greater inclusivity exist in a historical context of discrimination, where actions as well as inactions have eroded trust in scientific and medical establishments (for additional discussion on historical issues of race and ancestry in medicine, please see ref. 22). Therefore, conscious efforts towards rebuilding trust and increasing participation are essential, including ensuring informed consent; broad access to collected resources, data and results; purposeful and continued engagement with all stakeholders contributing to cell collections; and clear, accurate language to describe race, ethnicity, ancestry, and their potential roles in specific biological findings. As global cell collections expand, it is essential to invest in training and capacity building specifically within currently underrepresented countries, and to establish scientific partnerships to facilitate the utilization of hPSC resources in the communities from which they are derived. In parallel, countries with established cell banking capacities must continue to improve the representation of various ancestries from donors that may be available within those countries. Cell line distribution and data-sharing can also be subject to country-specific limitations, which underscores the importance of simultaneously advocating for increased diversity of collections within a given country, as well as increasing global collaborative efforts between countries. Data governance is another important consideration. While deeper and more extensive clinical phenotyping and metadata would be extremely valuable when coupled with genetic and cellular resources to enable genotype-phenotype associations, it can be challenging to make such sensitive data available to the scientific community, as it often falls in the protected health information category, which, if shared, could compromise patient privacy and affect identifiability. Thus, the depth of shareable information must be appropriately balanced with donor privacy. Increasing diversity in cellular models also raises obvious questions of feasibility, as it requires laboratories to invest time and financial resources to incorporate additional hPSC lines into experimental paradigms. As discussed above, some laboratories may leverage diverse cell lines to interrogate known alleles across different genetic backgrounds using targeted approaches and thus require relatively small sample sets, while others may engage in discovery studies such as mapping the effects of genetic variants on cellular phenotypes which require substantial scale. Here, repositories with well-characterized, diverse and accessible hPSC lines, combined with additional support from research funding mechanisms specifically for purposes of incorporating hPSC lines from underrepresented populations, and clear reporting on ancestry selection in individual studies will be critical for the practical implementation of these resources (see additional recommendations from ref. 23).

As efforts are underway to increase diversity, it is worth taking a moment to consider how different populations are ascertained and described in hPSC collections. Regarding ascertainment, most cell banks use self-reported race or ethnicity as opposed to genetically inferred ancestry (with HipSci being a notable exception). Self-reported race or ethnicity reflects identity categories that can change over time, while genetically inferred ancestry (eg, quantitative estimates of ancestral components by continent) reflects aspects of underlying biology which remain static for a given individual. Both types of data provide relevant information, but reliance upon self-reported race or ethnicity alone presents several specific limitations. The 2020 United States Census provides a timely example of how shifting social, political and cultural factors can influence self-reporting24which is less reliable for populations composed of multiple ancestries and individuals who identify with multiple races or ethnicities25. Indeed, even discussed by ref. 26, an individual’s racial or ethnic identity may have little concordance with their genetic ancestry. One study investigating the accuracy of self-reporting for over 9000 individuals found that the method of data collection itself, in this case, a requisition form versus consultation, was sufficient to impact the level of concordance with genetic ancestry27. Another study analyzing nearly 2000 individuals in a pediatric HIV/AIDS cohort asked to self-identify as either “Black/African American”, “White” or “Hispanic”, found that when using the highest % genetic ancestry, 9.5% of subjects were mis-identified based on self-reporting and when (ge)75% genetic ancestry of a specific population was required, 26% of individuals were mis-identified based on self-reporting28. These and other studies underscore how reliance upon self-reported race or ethnicity in hPSC collections may impact the accuracy as well as the longevity of the resources particularly as identity labels, some with fraught history, shift. Inclusion of genetically inferred ancestry is one strategy to improve the accuracy of cell-based resources, to achieve greater insight into the genetic architectures of specific subpopulations and to ensure that resources maintain their utility as identity labels change. Coupling this information with self-reported race or ethnicity using standardized nomenclature will provide a more complete picture of individual donors. This will of course require clear communication to ensure that donors understand and agree to genetic analyzes to infer ancestry.

Regarding the language used to describe race, ethnicity and/or ancestry, there is a general lack of concordance between different hPSC banks, between different genomic studies, and across hPSC banks and genomic studies (Fig. 2a). Moreover, some hPSC banks rely on terms such as “Other” or “More than one race”, which fails to capture the increasing degree of ancestral complexity in global populations and essentially excludes these individuals from accurate representation. These issues make it challenging to identify relevant hPSC lines in order to pursue insights from human genomic datasets. As one example, the HLA-B*5701 variant associated with a hypersensitivity to Abacavir, a medication used to treat HIV, has a frequency of 13.6% among individuals in the Masai group in Kenya, 0% among individuals in the Yoruba group in Nigeria and 5.8% among individuals with European ancestry29. Here, the allelic variant does not segregate within the population terms used in any hPSC banks. While different studies will require varying levels of granularity in the populations under investigation, current estimates put the number of subcontinental ancestries at a minimum of 21, with 97.3% of individuals harboring ancestral heterogeneity30. In other words, while many hPSC collections were launched prior to or simultaneous with large-scale genomic initiatives (Fig. 1a), it is essential for hPSC collections to now consider how best to adapt to the rapidly expanding genomic insights from ancestrally diverse populations ( Fig. 1c).

Fig. 2: Considerations for reporting and expanding stem cell diversity.
figure 2

or Left, Examples of how individuals of European (blue) and Asian (green) ancestry are reported in current hPSC banks, including CIRM (USA), WiCell (USA), Coriell (USA), SKiP (Japan) and HipSci (UK). Right, Examples of how individuals of European (blue) and Asian (green) ancestry are reported in human genomic studies, including Bergstrom et al. 2020 (Human Genome Diversity Project (HGDP))37, Karczewski et al. 2020 (gnomAD)38 and Smedley et al. 2021 (100,000 Pilot Genomes)39. b Key recommendations towards expanding hPSC diversity. Map adapted from Templates by Yourfreetemplates.com/.

In a best-case scenario, for participants who have consented to iPSC derivation, material for iPSC reprogramming and banking would be collected alongside material for genomic and/or phenotypic studies, thus providing a direct link between these data and available cellular resources (Fig. 2b). Notably, such initiatives must be paired with community engagement efforts to ensure that participants have an appropriate understanding of how their samples will be used and stored, of the possible scientific and medical benefits that might ensue, but also that such benefits might not be immediate and /or personal. Efforts such as NeuroDev have succeeded in establishing sample collections for both exome sequencing as well as lymphocyte cell banking from individuals in South Africa31 and similar approaches could be undertaken for future hPSC collections. Working with participants from The Brazilian Longitudinal Study of Adult Health (ELSA-Brasil), laboratories within Brazil have derived iPSC lines and performed ancestry analyzes to establish cellular resources that better reflect the Brazilian population, linked with clinical phenotyping data32. Alternatively, neither groups like TOPMed33 and 1000 Genomes34 are expanding their reference populations, and cell banks like the California Institute for Regenerative Medicine are utilizing SNP microarrays to assay genomic integrity, these data could be combined to ascertain more refined ancestry predictions as opposed to relying solely on self-reported race or ethnicity. These more diverse reference panels will also further enable analyzes of experimental models from underrepresented populations. At a minimum, standardized and more accurate descriptions of race, ethnicity, and ancestry should be used in future hPSC collections. Here, frameworks developed for reporting data in genomic studies could be leveraged to provide greater harmonization across disciplines (eg, Morales 2018 Genome Biology)35.