Genome-wide association studies

Genome-wide association studies (GWAS) test hundreds of thousands of genetic variants across many genomes to find those statistically associated with a specific trait or disease. This methodology has generated a myriad of robust associations for a range of traits and diseases, and the number of associated variants is expected to grow steadily as GWAS sample sizes increase. GWAS results have a range of applications, such as gaining insight into a phenotype’s underlying biology, estimating its heritability, calculating genetic correlations, making clinical risk predictions, informing drug development programmes and inferring potential causal relationships between risk factors and health outcomes. In this Primer, we provide the reader with an introduction to GWAS, explaining their statistical basis and how they are conducted, describe state-of-the art approaches and discuss limitations and challenges, concluding with an overview of the current and future applications for GWAS results.

Similar content being viewed by others

Boosting the power of genome-wide association studies within and across ancestries by using polygenic scores

Article 18 September 2023

Rare-variant collapsing analyses for complex traits: guidelines and applications

Article 11 October 2019

Genome-wide large-scale multi-trait analysis characterizes global patterns of pleiotropy and unique trait-specific variants

Article Open access 14 August 2024

Introduction

Genome-wide association studies (GWAS) aim to identify associations of genotypes with phenotypes by testing for differences in the allele frequency of genetic variants between individuals who are ancestrally similar but differ phenotypically. GWAS can consider copy-number variants or sequence variations in the human genome, although the most commonly studied genetic variants in GWAS are single-nucleotide polymorphisms (SNPs). GWAS typically report blocks of correlated SNPs that all show a statistically significant association with the trait of interest, known as genomic risk loci. After 15 years of GWAS 1 , many replicated genomic risk loci have been associated with diseases and traits 1 , such as FTO 2 for obesity and PTPN22 (ref. 3 ) for autoimmune diseases. These results have sometimes provided hints into disease biology; for example, a GWAS implicated the IL-12/IL-23 pathway in the development of Crohn’s disease 4 , which supported subsequent clinical trials for drugs targeting the IL-12/IL-23 pathway 5 .

Results from GWAS can be used for a range of applications. For example, trait-associated genetic variants can be used as control variables in epidemiology studies to account for confounding genetic group differences 6 . Further, results can be used to predict an individual’s risk for physical and mental disease based on their genetic profile. Indeed, a recent study showed that genomic risk prediction using genome-wide polygenic risk scores (PRSs) for coronary artery disease, atrial fibrillation, type 2 diabetes, inflammatory bowel disease and breast cancer can identify disease risk as well as monogenic risk prediction strategies based on rare, highly penetrant mutations 7 . Genomic risk prediction may soon be allowed for clinical use as a stratification tool and a genetically based biomarker 7 .

More than 5,700 GWAS have now been conducted for more than 3,300 traits 8 and a push for more statistical power has thrust GWAS sample sizes well beyond a million participants 9,10 , yielding numerous associated and replicable variants for many heritable traits. Now that reliable genetic associations for various phenotypes are known, we are faced with the next big challenge: interpreting these associations in a biological and genomic context. Previous GWAS have shown that most traits are influenced by thousands of causal variants 11 that individually confer very little risk, are often associated with many other traits 8 and are correlated with causal and non-causal variants that are physically close as a result of linkage disequilibrium 12 , making direct biological, causal inferences complicated 13 . Further, genetic associations may differ across ancestries, complicating direct comparisons between groups of individuals. Some of these limitations hamper drawing unambiguous conclusions about the biological meaning of GWAS results, sometimes limiting their utility to produce mechanistic insights or to serve as starting points for drug development 1 .

In this Primer, we aim to provide the reader with a comprehensive overview of GWAS, covering practical considerations, such as experimental design, robust data analysis and data deposition, ethical implications and reproducibility of results. We also provide guidance on how to interpret results from GWAS using several post-GWAS strategies and functional follow-up experiments, as well as a discussion of the above-mentioned limitations and future challenges of GWAS.

Experimentation

The experimental workflow of a GWAS involves several steps, including the collection of DNA and phenotypic information from a group of individuals (such as disease status and demographic information such as age and sex); genotyping of each individual using available GWAS arrays or sequencing strategies; quality control; imputation of untyped variants using haplotype phasing and reference populations; conducting the statistical test for association; conducting a meta-analysis (optional); seeking an independent replication; and interpreting the results by conducting multiple post-GWAS analyses (Fig. 1). At each step, possible biases and errors may enter the study, and therefore careful planning is required when setting up a GWAS, and adherence to standardized quality control and analysis protocols is advised. We detail these steps below. We note that most of the issues that may arise when conducting GWAS, such as carefully selecting participants or the steps that are needed in quality control, apply both to GWAS that include common variants and to studies that include rare variants such as whole-exome sequencing (WES) studies and whole-genome sequencing (WGS) studies; the sections below concern the analysis of common variants, except when explicitly stated (Box 1).

figure 1

Box 1 Common and rare variants

Genome-wide association studies (GWAS) generally involve targeted genotyping of specific and pre-selected variants using microarrays, whereas whole-exome sequencing (WES) and whole-genome sequencing (WGS) studies aim to capture all genetic variation. Strictly speaking, both WES and WGS studies are also GWAS, although in the literature ‘GWAS’ mostly refers to genome-wide studies of common variants and is sometimes considered separate from WGS and WES studies. Declaring a variant as common or rare is population-specific and cannot be generalized across populations. Generally, common variants are those with a minor allele frequency above 10%, although as population sizes grow this threshold can be as low as 1% as researchers typically adhere to a minimum minor allele count; for example, at least 100 individuals who carry at least one copy of the minor allele. With WGS and WES studies just beginning to mature, current analysis protocols may need to be extended to also cover specific issues that arise when analysing rare variants, for example, when controlling for population stratification, or imputing missing genotypes.

Conducting GWAS

Selecting study populations

GWAS often require very large sample sizes to identify reproducible genome-wide significant associations and the desired sample size can be determined using power calculations in software tools such as CaTS 14 or GPC 15 . Study designs can involve the inclusion of cases and controls when the trait of interest is dichotomous, or quantitative measurements on the whole study sample when the trait is quantitative. In addition, one can choose between population-based and family-based designs. The choice of data resource and study design for a GWAS depends on the required sample size, the experimental question and the availability of pre-existing data or the ease with which new data can be collected. GWAS can be conducted using data from resources such as biobanks or cohorts with disease-focused or population-based recruitment, or through direct to consumer studies. Assembling data sets of a sufficient size to run a well-powered GWAS for a complex trait requires major investments of time and money that go beyond the capacity of most individual laboratories. However, there are several excellent public resources available that provide access to large cohorts with both genotypic and phenotypic information, and the majority of GWAS are conducted using these pre-existing resources. Even when new data have been collected in-house, these will typically be co-analysed with data from pre-existing resources; collecting new data is usually required when more refined phenotyping is desired.

For all study designs, recruitment strategies must be carefully considered as these can induce collider bias and other forms of bias in the resultant data 16 . For example, widely used research cohorts such as the UK Biobank recruit participants through a volunteer-based strategy, which results in participants who are, on average, healthier, wealthier and more educated than the general population 17 . Further, cohorts that enrol participants from hospitals based on their disease status (such as BioBank Japan) will have different selection biases to cohorts recruited from the general population 18 . Different ethnicities can be included in the same study, as long as the population substructure is considered to avoid false positive results. Individual cohorts with detailed clinical measures may not be able to meet the required sample size; in these cases, ‘proxy’ phenotypes that are easier to measure and for which there are more data can be used (for example, educational attainment can be used as a proxy for intelligence, or depressive symptoms can be used as a proxy for a clinical diagnosis of depression) 19 .

Genotyping

Genotyping of individuals is typically done using microarrays for common variants or next-generation sequencing methods such as WES or WGS that also include rare variants. Microarray-based genotyping is the most commonly used method for obtaining genotypes for GWAS owing to the current cost of next-generation sequencing. However, the choice of genotyping platform depends on many factors and tends to be guided by the purpose of the GWAS; for example, in a consortium-led GWAS, it is usually wise to have all individual cohorts genotyped on the same genotyping platform. Ideally, WGS — which determines nearly every genotype of a full genome — is preferred over WES and microarrays, and is expected to become the method of choice over the next couple of years with the increasing availability of low-cost WGS technology.

Data processing

Input files for a GWAS include anonymized individual ID numbers, coded family relations between individuals, sex, phenotype information, covariates, genotype calls for all called variants and information on the genotyping batch. Following input of the data, generating reliable results from GWAS requires careful quality control. Some example steps include removing rare or monomorphic variants, removing variants that are not in Hardy–Weinberg equilibrium, filtering SNPs that are missing from a fraction of individuals in the cohort, identifying and removing genotyping errors, and ensuring that phenotypes are well matched with genetic data, often by comparing self-reported sex versus sex based on the X and Y chromosomes. Software tools such as PLINK have been specifically designed to analyse genetic data and can be used to conduct many of these quality control steps 20 (further software for quality control analysis and other stages of GWAS are summarized in Table 1). Once sample and variant quality control have been performed on GWAS array data, variants usually undergo phasing and are imputed using a sequenced haplotype reference panel such as the 1000 Genomes Project or TOPMed 21,22 , which involves the statistical inference of genotypes that have not been assayed directly (Box 2). GWAS consortia routinely follow pipelines for conducting quality control steps and imputation, using, for example, RICOPILI 23 or similar software, or upload their data to imputation servers (for example, the Michigan Imputation Server or the TOPMed Imputation Server) where these standardized pipelines have been implemented. Because genetic data sets are typically large and analysis pipelines can be run in parallel, computer clusters or cloud environments that can distribute jobs to many computers are often used. To achieve the large sample sizes typical in genetic studies in a logistically feasible manner that follows data protection rules, the above steps are often done separately for many different cohorts of varying sample size (see section Genome-wide association meta-analysis (GWAMA)).

figure 2

figure 3

Statistical fine-mapping

Many non-causal variants are significantly associated with a trait of interest owing to linkage disequilibrium; whether these reach the significance threshold depends on their level of correlation with and the strength of association of the causal variant 12 . The output of GWAS is therefore clustered in risk loci — sets of correlated variants that all show a statistically significant association with the trait of interest — and linkage disequilibrium typically prevents pinpointing causal variants without further analysis.

Fine-mapping is an in silico process designed to prioritize the set of variants that are most likely to be causal to the target phenotype within each of the genetic loci identified by GWAS, based on observed patterns of linkage disequilibrium and association statistics 90,91 . The set of variants that most parsimoniously explain regional association signals are defined as credible variants. The lead variant with the most significant association would be expected to be the most credible causal variant, although there are several situations where the most significant association may be non-causal. For example, where multiple independent risk variants are present in a locus, the combination of multiple signals can shift the most significant association from causal variants to a neighbouring non-causal variant. This can also occur owing to heterogeneity in variant genotype imputation quality, which induces fluctuations in the association signal statistics among neighbouring variants in linkage disequilibrium.

The simplest fine-mapping analysis is a conditional association analysis of the regional variants, which adjusts the regional association signals according to the set of variants in the locus by including the lead variant as a covariate in genotype–phenotype regression models. When multiple association signals exist, forward stepwise selection is commonly used until no associations remain. This method, known as stepwise conditional analysis, is limited to searching all of the combinatory patterns of potential credible variants. This is because the variant search pattern in each iterative step is strongly dependent on the previously selected variant sets and the lead initial step often includes the lead variant. When full genotype data are not available, conditional association analyses can be conducted on summary statistics using GCTA-COJO software 92 .

Several sophisticated fine-mapping approaches are based on Bayesian models, including CAVIAR 93 , FINEMAP 94 , PAINTOR 95 and SuSIE 96 . These approaches optimize the selection of variables for a regression model by using a prior probability distribution, or prior, to estimate a posterior probability distribution, or posterior. An advantage of using Bayesian models over conditional association analysis is that priors can consider additional information such as imputation accuracy in addition to association signals; however, sets of credible variants output using Bayesian modelling are generally not consistent across different methods, especially when multiple independent association signals exist within a locus. In general, the statistical power to correctly detect credible variant sets declines as the number of independent signals increases 96 .

In silico fine-mapping can find credible variants that modulate the expression patterns and functions of causal genes (SNP to gene mapping) or contribute to the development of the target phenotype (SNP to biology mapping). A basic principle of successful fine-mapping is to expand the coverage of the genetic variants assessed by using, for example, WGS-based genotype imputation reference panels 97 . Reference panels with large samples sizes and/or that include other types of non-SNP genetic variants such as insertions, deletions and copy number variants can further expand the coverage of variants for fine-mapping. Recently released large-scale WGS resources with detailed variant annotations (such as the gnomAD 98 and TOPMed 22 databases, which contain >10,000 and >90,000 whole-genome sequences, respectively) serve as valuable resources for high-resolution fine-mapping. It should be noted that structural variants and short tandem repeats are not always accurately captured by current WGS technologies. Further, there are several regions where WGS-based imputation estimates genotypes inaccurately and custom imputation approaches may be needed to fine-map such regions. For example, the genomic region corresponding to the HLA complex (also known as the major histocompatibility complex (MHC)) is highly pleiotropic for various human traits related to the immune system and infectious disease 99 . The complicated linkage disequilibrium structure in this region prevents WGS-based SNP imputation from unambiguously determining their genotypes. The construction of HLA reference panels and custom imputation methods targeting HLA polymorphisms, such as the software packages SNP2HLA (refs 100,101,102 ), HIBAG 103 and HLA*IMP 104 , have provided a catalogue of HLA variant–phenotype association maps 105 . Customized regional imputation methods have also been reported for targeting missing variants at other gene loci; for example, the KIR*IMP software for the killer-cell immunoglobulin-like receptor (KIR) gene locus 106 . Specific resources also exist for use with mitochondrial genomes 107 .

Prioritization of a credible SNP over highly correlated SNPs with absolute linkage disequilibrium is challenging. Fine-mapping of associations from a GWAS for inflammatory bowel disease implicated a single candidate causal variant in only 12% of loci and 1–5 candidate causal variants in 30% of loci 108 , and fine-mapping of a breast cancer GWAS showed similar figures 109 . Prioritizing variants can be improved by integrating functional annotations of the SNPs — for example, expression quantitative trait loci (eQTLs) or epigenomic motifs — into the priors of the Bayesian fine-mapping models. A trans-ethnic GWAS meta-analysis can also help fine-mapping of highly correlated SNPs as differences in linkage disequilibrium structure among ancestries can narrow down the regional windows of associations 91 .

Functional inference from GWAS

A major motivation for conducting GWAS is to use the identified associations to determine the biological cause of heritable phenotypes and provide a starting point for investigating potential therapeutic interventions. Although GWAS have led to the identification of thousands of complex trait-associated genetic variants 110 and fine-mapping has provided sets of credible SNPs, the biological implications of these variants are typically not easily inferred (with some exceptions 111 ). After fine-mapping, the full mechanistic dissection of a locus identified by a GWAS includes identifying the immediate effects of causal variants (for example, on protein or enhancer function), the affected gene or genes in the locus that mediate the disease association, the downstream network or pathway effects that lead to changes in cellular and physiological function, and the relevant tissue, cell type and cell state for all these effects. Currently, this information exists for only a few loci, such as FTO 112 and SORT1 (ref. 113 ). However, a diverse set of approaches have been developed to infer the molecular effects of variants identified by GWAS.

Determining the affected gene

Prioritizing the likely affected gene is perhaps the most crucial part of the functional interpretation of GWAS loci. For the 2–3% of GWAS loci fine-mapped to coding variants 1 , tools such as ANNOVAR 114 or VEP 115 can be used to infer their potential effect on genes. However, the vast majority of associated, fine-mapped SNPs are located outside coding regions, do not affect protein structure and have unknown regulatory functions 116,117 . The causal gene or genes in the locus — those for which regulatory changes mediate disease association — are often those closest to the association signal 118,119 , although a recent preprint article suggests this is not always the case 120 . One approach for identifying regulatory target genes of genetic variants is molecular quantitative trait loci (molQTLs) analysis, which associates genetic variants with specific molecular phenotypes; for example, eQTL analysis identifies loci associated with RNA expression. The same approach can be applied to other molecular phenotypes such as splicing, chromatin accessibility or methylation status. By integrating this information with GWAS results, trait-associated variants can be mapped to the genes they are likely to regulate in specific tissues and the molecular processes mediating these associations 121,122 . Comprehensive, accessible QTL catalogues are available for community use; for example, the Genotype–Tissue Expression (GTEx) resource catalogues eQTL and splicing QTL for 49 tissues 122 , the eQTLGen resource provides a map of both cis-eQTL and trans-eQTL 123 associations in blood with data from more than 30,000 donors and the eQTL Catalogue has compiled multiple eQTL data sets, as reported in a recent preprint article 124 . The eQTL framework can be extended to transcriptome-wide association studies 125,126 , where gene expression levels are imputed into data from GWAS and tested for association with a trait.

eQTL and splicing QTL approaches suffer from some limitations. As any non-causal variant in high linkage disequilibrium with a truly causal variant will likely show a statistical association with a trait, assigning a functional or regulatory effect to a variant does not automatically mean that the variant is causal. eQTLs should be integrated with GWAS data using co-localization approaches to pinpoint loci where the regulatory association and disease association share the same causal variant 127,128,129 . Further, eQTLs often affect several genes and, therefore, other data sources or functional annotations can be used to prioritize those genes that mediate disease. Finally, molQTL catalogues lack data from many relevant tissues, and data from specific cell types and molecular phenotypes other than expression and splicing are limited. Thus, although molQTL mapping is a powerful and popular approach for creating hypotheses for the regulatory mechanisms and target genes behind GWAS loci, such gene mapping approaches are not as conclusive as those for coding variants (although it should be noted that detectable coding variants for most genes are rare).

As an alternative to molQTL mapping, fine-mapped GWAS variants in enhancers can be linked to genes using methods based on chromatin conformation capture (3C), such as chromosome conformation capture on chip (4C), chromosome confirmation capture carbon copy (5C) and high-throughput chromosome conformation capture (Hi-C), which define regions of chromatin that are frequently in close spatial proximity and may reflect enhancer–promoter loops that control proximal or distal genes 130,131 . Other approaches include correlating enhancer and gene activities 132 and performing large-scale experimental perturbation of enhancers 133 , although enhancer–gene catalogues are far from complete. There is still a need for methods that integrate different types of data for probabilistic prioritization of target genes at GWAS loci.

Recently, the development of highly scalable experimental assays for perturbation of the genome has expanded the functional genomics toolkit. These assays include massively parallel regulatory assays 134 , which test synthetic regulatory sequences by screening variants in thousands of untranscribed or untranslated sequences for functional effects in a single experiment, and CRISPR techniques that allow for the introduction of mutations into the genome and perturbation of regulatory element activity 133,135 . These approaches are increasingly popular and informative, but substantial work is still needed to improve the scalability and interpretability of the data. Although not restricted to existing genetic variation in linkage disequilibrium, they rely, to a large extent, on cellular model systems that may not always recapitulate cells in vivo. Furthermore, the integration of data from both human populations and experimental perturbations is still in its infancy.

Determining regulatory pathways and cellular effects

Highly polygenic signals from GWAS for any given trait converge on a limited number of biological processes, and the pathway-level effects of genetic variants can be determined and linked to cellular and physiological functions. One approach to achieve this is to test genes identified from GWAS and post-GWAS analyses for convergent functions using tools such as MAGMA 136 and DEPICT 137 . These tools test sets of genes involved in specific biological pathways or linked to specific tissues, cell types, developmental stages or protein networks that are putative, proximal causes of the studied trait for association with that trait. The way gene sets are defined is critical; for example, a randomly chosen set of genes would not be biologically meaningful and sets created based on biological annotations rely on the accuracy of those annotations. We refer readers to a recent resource for defining gene sets 13 . Another approach is to associate genetic variants with molecular changes using trans-molQTL approaches to identify distal genes that are regulated by the GWAS locus. trans-eQTL have been shown to be strongly enriched among GWAS loci and have the potential to pinpoint distal genes regulated by the GWAS locus, although this approach requires molecular data from a large number of samples and the analysis and interpretation can be challenging 122,138 . Finally, experimental perturbation of genes followed by cellular phenotyping is becoming increasingly scalable and informative for interpretation of GWAS loci and genes 139,140 .

Considering the tissue type, cell type or cell state is essential for all functional interpretation work, and particularly important when analysing network effects as genes may have pleiotropic effects across different cellular contexts. For example, tissue-level molecular data can blend cell type-specific signals, further complicating interpretation or masking true signals from rare cell types. Upcoming single-cell and cell type-specific functional genomic data sets 123,141 are therefore likely to advance GWAS interpretation.

Applications

Above, we have described how GWAS can pinpoint statistically associated variants and be used to understand the role of these variants in a biological context. The results of GWAS can also be used for applications such as predicting disease risk and understanding the genetic architecture of traits. We discuss several of these applications of GWAS below.

Risk prediction

PRSs are commonly used to predict the risk of disease in a target cohort using the GWAS summary statistics of an independent discovery cohort (Fig. 4). PRSs can be used to identify individuals at a high risk of disease for clinical interventions and provide additional information over traditional clinical risk scores for stratified screening. They are calculated as weighted sum scores of risk alleles, with weights based on the effect sizes from GWAS 142,143 . There are many methods for computing a PRS; the simplest and most practical method is pruning and thresholding, which involves selecting subsets of SNPs based on P values of statistical association with the trait 144,145 . More complex methods include those that model the linkage disequilibrium structure, incorporate functional information, weigh the results of multiple discovery cohorts in proportion to genome-wide admixture proportions and consider additional types of genomic or functional information; these methods can improve PRS prediction accuracy through improved estimation of marginal effect sizes 146,147,148,149,150,151 . Accuracy of the PRS can be assessed by various metrics, with the choice of metric based on downstream goals and whether the phenotype is continuous or binary. Accuracy measurements can be inflated if the discovery GWAS and the target cohort share individuals. For continuous traits, the phenotypic variance explained by the PRS is typically quantified as a coefficient of determination (R 2 ). When computing effects of PRSs in GWAS regression models, covariates such as age, sex and ancestry are typically included, and PRS effects are assessed by comparing the difference in explained variance in two models, which can be written as follows:

where H0 represents the model used in the null hypothesis with no effect of the PRS, H1 represents the model used in the alternative hypothesis that does include an effect of PRS on the phenotype and e denotes an error term. Analysis of variance comparing these two models can be performed to determine the phenotypic variance explained specifically by the PRS term and not the other covariates included in the comparison model. For binary traits, pseudo-R 2 values are typically computed using logistic regression models. To ensure that pseudo-R 2 values are comparable across studies and scaled appropriately, these are typically interpreted on the liability scale by adjusting for the prevalence of a trait or disease 152,153 . The maximum predictive accuracy of polygenic scores is determined by the SNP-based heritability of the disease — the proportion of phenotypic variance explained by all SNPs — and the performance of PRS analysis depends on the polygenicity of the disease and the magnitude of the effect sizes of causal variants. One of the best-performing PRSs to date has been developed for glaucoma; individuals in the top decile of the score distribution have a 4.2-fold increase in risk compared with the bottom 90% 154 . A commonly used metric for assessing PRS accuracy is the area under the receiver operating characteristic curve (AUC). The AUC quantifies the performance of the models when the aim is to discriminate between two groups. For the best-performing model, a threshold must be set at which to classify individuals as high risk; choosing a threshold is based on weighing the costs and benefits of false positives versus false negatives, and is thus context-specific and often subjective (see ref. 155 for software that can aid in selecting thresholds). Importantly, metrics such as the AUC or pseudo-R 2 do not necessarily reflect clinical utility 156,157 . A high AUC or odds ratio (the odds of an event given an exposure versus the odds in the absence of an exposure) does not promise an enrichment of high-risk individuals in the top percentile of the score distribution 158 ; a study converting odds ratios into other screening performance measures found that, at a 5% false positive rate, the polygenic score for coronary artery disease proposed in a recent study 7 would miss 85% of individuals with diseases. Reclassification measures such as the net reclassification index are more clinically relevant than odds ratios or AUC curves and can assess the extent to which polygenic scores improve the reclassification of both patients and controls over existing clinical risk predictors 159,160,161,162 .

figure 4

An obstacle to equitable clinical implementation of PRSs is that their accuracy decays with increasing ancestral distance between GWAS discovery cohorts and the target cohorts. As most discovery cohorts are European, this often results in PRSs that diminish in accuracy with ancestral distance from Europe 163,164,165 . The predictable basis of these disparities can be explained by differences in factors such as minor allele frequencies and linkage disequilibrium across populations. Further, subtle population stratification even within a single population is known to induce regional biases in the baseline values of PRS estimation 29,166 . Increasing diversity in GWAS discovery cohorts is the most impactful approach for improving PRS accuracy for all populations, with most benefit for populations currently under-represented in GWAS cohorts 167,168 .

The Polygenic Risk Score Reporting Standards 169 and the Polygenic Score Catalog 170 , a database of PRSs, have recently been developed to improve the dissemination of PRSs and encourage their application and translation into clinical care. Such continued standardization of PRS reporting and deposition promises to increase the reproducibility of PRSs in the future.

Understanding trait genetic architecture

Determining the genetic architecture of a trait involves estimating the number of causal variants, their corresponding effect sizes and their frequencies, and allows the estimation of heritability, or the proportion of variation in the trait that can be explained by genetic variation in the population. Modern large-scale human genetics data sets commonly estimate heritability in genotyped data sets of unrelated individuals. There are numerous statistical methods and computational tools for quantifying heritability 171 . Approaches are typically delineated into broad-sense heritability (H 2 ) — which measures the fraction of phenotypic variation explained by both additive and dominance effects — and narrow-sense heritability (h 2 ), which considers additive effects only 172 . Population-based methods can estimate SNP-based heritability using individual-level genotype and phenotype data; for example, genome-based restricted maximum likelihood, as implemented in genome-wide complex trait analysis 173 , partitions variance component models with a genomic relationship matrix, which allows the regression of the level of phenotypic similarity on the level of genotypic similarity. Alternatively, linkage disequilibrium score regression can be used to estimate SNP-based heritability from GWAS summary statistics and a panel of linkage disequilibrium scores 174 . Importantly, SNP-based heritability only measures the variance explained by additive effects of the genotyped or imputed SNPs. Data discussed in a recent preprint article have highlighted the importance of including rare variants when assessing SNP-based heritability 175 . Indeed, whereas common variants contribute more to SNP-based heritability in a population 176 , rare variants can nevertheless have large effects in individuals 177 . Regardless of approach, heritability is importantly not a fixed entity and varies with age 178 , sex 179 , social factors 180 , phenotype precision and other complex factors. Ancestry heterogeneity is also important to consider, as population structure can inflate heritability estimates 181 .

Although it is informative to know heritability for a single trait, it is often more useful to understand the genetic relationships between multiple traits, as SNPs are often associated with many, sometimes seemingly unrelated, phenotypes 8,182 . Both linkage disequilibrium score regression and genome-wide complex trait analysis allow the estimation of genetic correlations, or the extent to which genetic variants that account for a trait are also important for another trait, provided that the effects are in the same direction. Tools such as superGNOVA 183 , ρ-HESS 184 and LAVA 185 from a recent preprint article allow the estimation of local correlations, determining which specific genomic regions exert genetic effects on the correlated phenotypes in the same or opposing directions. Genetic correlations should be interpreted in the context of SNP-based heritabilities; for example, if these are low for the respective phenotypes, genetic correlation is not expected to play a major part in explaining why two traits correlate at the phenotypic level. Further, genetic correlation does not provide information about causation between two traits. Indeed, genetic correlation can be caused by vertical pleiotropy, where trait A causes trait B; horizontal pleiotropy, where a variant directly influences two traits; linkage disequilibrium-induced horizontal pleiotropy, where two different variants that are in linkage disequilibrium each influence one of two traits; or polygenicity-induced pleiotropy, where multiple variants influence both traits and the underlying patterns are a mix of the above 186 .

Mendelian randomization can be employed to assess causal relations between different phenotypes using GWAS summary statistics 187 . Mendelian randomization is an epidemiological technique that uses genetic variants as instrumental variables acting as proxy measures for an environmental exposure. These techniques can be applied when a randomized control trial is not feasible. Although Mendelian randomization is a powerful design, there are several strong assumptions: the genetic variants used as instrumental variables need to be associated with the exposure; those genetic variants should not be associated with any confounding variables; and those genetic variants are only associated with the outcome through their effect on the exposure 188 .

Reproducibility and data deposition

GWAS for most traits require large (>10,000) sample sizes to yield reproducible results. Such sample sizes can only be generated through collaboration and data sharing agreements. Further, reproducible results depend on sound study design and robust methodology. To further the usefulness of GWAS results, a minimum set of statistics need to be reported. We discuss these considerations below.

Collaboration and data sharing in GWAS

One of the key factors driving the success of GWAS was an early commitment to collaboration and data sharing. In 1997, the Bermuda Principles set out that “all human genomic sequence information, generated by centres funded for large-scale human sequencing, should be freely available and in the public domain”. These principles were enforced in the 2003 Fort Lauderdale Agreement 189 , which proposed the continued prepublication release of genomic data as a community resource and suggested a system of responsibility where funders, data generators and data users all carry responsibility to foster the responsible sharing of genomic data before publication. Sharing of prepublication genomic data is now a standard condition of funding for genomics research projects. The existence of many genetics consortia and initiatives such as the Psychiatric Genomics Consortium and the recently formed COVID-19 Host Genetics Initiative 190 build on these initial agreements and are enabled by the willingness of contributors to share and aggregate data. Attempts at fostering the interoperability of genomic databases through the agreement of shared principles and practices for data governance, for instance through the Global Alliance for Genomics and Health 191 , have strengthened the ability of researchers to share and use publicly available genomic data.

Data protections increasingly rely on specific consent by individuals before data can be shared or used. In the European Union, increased privacy protections introduced with the General Data Protection Regulation have introduced stringent requirements for de-identification and consent 192 , which complicates sharing of genomic data both within and between countries. Other jurisdictions, including some in Africa, have equally moved to increase privacy protections 193 . To address concerns about the impact of data protection legislation on research, researchers globally have argued for the development of codes of conduct for the sharing of genomic data in ways that are aligned with legislated data protection principles 194 . Codes of conduct would encourage data controllers or processors such as genomic research institutes to apply data protection provisions effectively and allow them to demonstrate compliance in a way that promotes national and international transfers of data. To date, the development of such codes of conduct has proven to be time and resource intensive, and it is not clear how perceived tensions between privacy concerns and sharing of research data will be adequately resolved. Other potential solutions are the introduction of separate privacy consent forms that particularly cover the use of personal information in research, the preparation of data privacy notices for participants and the completion of data privacy impact assessments for each research project. Several universities across Europe and North America have issued guidance to researchers for the preparation of privacy documents and templates for data privacy documents are available online.

To foster effective collaboration and to increase the use of genomic data — especially for rare conditions — it is essential that genomic data sets are interoperable. In recent years, steps have been taken to develop the tools and approaches that allow for interoperability. Central to this aim are the FAIR (findability, accessibility, interoperability, reusability) principles for scientific data management and stewardship 195 , which are now a condition of funding for many GWAS.

Data equity

An important ethical challenge relating to the sharing of genomic data relates to ensuring fairness for researchers. A key consideration is that data can be shared in a way that affords researchers across the world equal opportunities to analyse and publish results, including researchers in smaller institutions or based in lower-income and middle-income countries 196 . To address these concerns, initiatives such as the Ebola Data Platform and the H3Africa Consortium have identified principles and practices for governing genomics data to advance equity for researchers from lower-resourced countries 197,198 , including solidarity, reciprocity, transparency and trust 199 . Other broader concerns relate to mitigating harmful uses of publicly available data and ensuring public benefit. To address these various concerns, many international genomic research collaborations have turned to the use of governance frameworks. A recent analysis of these initiatives found five key functions of good governance for data sharing, namely that the governance framework enables data access, ensures legal compliance, supports appropriate data use and mitigates harms, promotes equity in the use of genomic data and uses genomic data for public benefit 200 .

In addition to the sharing of individual-level data, there is also an evolution towards the sharing of GWAS summary statistics. Databases such as the GWAS Catalog 110 and GWAS Atlas 8 allow easy access to summary statistics for thousands of traits (Table 3). Access to and use of GWAS summary statistics can further be improved through adoption of universal data formats, such as the recently proposed GWAS-VCF format 201 . Summary statistics should include the genomic build, SNP ID and location, allele, strand information, effect size and associated standard error, P value, test statistics, minor allele frequency and sample size.

Table 3 Databases of GWAS summary statistics

Preregistration in GWAS

Preregistration of GWAS can improve reproducibility. In preregistration 202 , all analyses, variables, available protocols, data sets and analytic decisions are pre-specified and recorded before the study is conducted to prevent post hoc rationalizing and ‘HARKing’ (hypothesizing after results are known) 203,204 , which could potentially invalidate statistical inferences and inflate type I error rates. Indeed, these practices have contributed to a lack of reproducible results in genetic association studies 205 . Today, GWAS are generally performed in a hypothesis-free manner, and corrected, reported and published regardless of the results; however, post-GWAS analyses have many more researcher degrees of freedom and are, nowadays, more determinant of publication than the mere number of GWAS hits. Hence, there are more incentives and possibilities for questionable research practices 206 and the benefit of preregistration is greater for these analyses. Analysis plans can be uploaded at the Open Science Framework with a preset moratorium. In a format known as registered reports 207 , peer review occurs before data are collected or analysed and is based on the introduction and methods sections alone. As a consequence, publication is conditional on methodological rigour as opposed to results, which aids in attenuating publication bias 208 . In contrast to preregistration, registered reports are submitted to specific journals that offer this scheme (more details can be found at the Open Science Framework Registered Reports resource). Preregistrations and registered reports are mostly used in data-generating research but can also be beneficial for the more common analysis of secondary data 209,210 .

Limitations and optimizations

GWAS have proven to be a highly successful method for identifying trait-associated variants, yet several outstanding methodological challenges still need to be addressed, such as population stratification and high polygenicity. Additionally, GWAS raise a range of ethical issues that require careful consideration, which we discuss below.

Methodological challenges

Population stratification

Although current methods can address unaccounted-for population stratification, it can still cause spurious or biased associations — particularly in the meta-analyses of multiple cohorts 211,212 . Effects are most pronounced in the analyses of polygenic scores that include thousands of SNPs below genome-wide significance 29,213 . Population stratification can occur even in homogeneous populations; for example, studies have uncovered population stratification and related bias in the UK Biobank, which is predominantly composed of white British participants 214,215 . As current methods for correcting the effects of stratification are based on common variants, such as principal component analysis or linear mixed models, they are insufficient when many rare variants are included in the analyses, especially when population stratification is driven by recent demographic changes 26,30 . Family-based association studies 31,50,216 can avoid stratification, although they tend to be underpowered compared with population-based studies. Significant variants can be identified in population-based GWAS and effect sizes re-estimated in family-based studies to try to obtain estimates that are not confounded by population structure 50,51,211,217 . However, this approach cannot completely eliminate population stratification in PRS data if the lead SNPs identified in the original GWAS are correlated with the environment 30,51 . Further work is needed to better correct for population structure in GWAS and associated analyses. Methods based on principal component analysis of rare variants or identity by descent may be appropriate in cases of recently acquired population substructure.

Polygenicity

The extreme polygenicity of many traits 8,11,218,219,220 can pose a challenge when attempting to uncover underlying biological mechanisms, particularly in cases where thousands of variants each have a small effect on a trait 13,221 . To avoid these issues, WES and WGS studies are increasingly being used to discover rare variants of large effect — particularly coding variants from exome sequencing — for which causal mechanisms are generally easier to elucidate 87,222,223,224 . Rare variants of large effect have yet to be reported for all traits and looking for convergence of the effects of thousands of variants remains the best strategy for traits not linked to rare variants of large effect. Further novel methods are needed that address polygenicity and facilitate translating the findings of GWAS into mechanistic insight. High polygenicity also implies that individuals with the same disease may have unique genetic profiles that map distinct biological routes towards the same disease. If genetic heterogeneity is also linked to treatment sensitivity, the development of novel treatments should take this into account. However, as it is mostly unknown how patients should be genetically stratified, this remains an outstanding challenge, with treatments not yet fully tailored to relevant genetic profiles.

Ethical challenges

In addition to the data protection and equity issues discussed in the Reproducibility and data deposition section, GWAS raise ethical issues relating to consent for future use of samples and data, storage and reuse of samples and data, privacy challenges and sharing data with individual participants. Over the past decade, apparent consensus amongst researchers and bioethicists suggests that broad and tiered consent models that seek permission for sample and data storage and unspecified future use are appropriate 225,226,227 . There is also apparent agreement in the research community that individual genetic research results that are medically actionable, robustly associated with the phenotype and predictive of conditions that are unlikely to have been otherwise diagnosed should be fed back to research participants if they consent to receive such results 228,229 , although this may not yet be possible in resource-scarce contexts 230 .

Arguably, the primary ethical challenge facing GWAS today relates to issues of diversity and inclusion, ensuring that GWAS result in fair opportunities to promote health and well-being for all humans regardless of race, gender or geographical location 231,232 . This means, amongst other factors, proactively working to ensure that the samples and data used for GWAS are representative of the global human population and that the genomics workforce is diverse. Equally important is the leadership that indigenous researchers in different parts of the world have shown in designing culturally appropriate approaches to indigenous genomics 233,234 and the real-time tracking of diversity in GWAS 235 .

The increasing research on and clinical use of PRSs raise questions about the communication of risk information 236,237 and raise issues regarding genetic determinism, the perception that traits are unavoidable and unalterable. First, PRSs have been proposed as a means for embryo selection based on GWAS results, which has proved to be highly controversial 238 . Second, genetic determinism may lead to stigma for patients or their family members 239,240 . Robust community engagement and the development of mitigation strategies are imperative in mitigating the possibility of stigmatization, as is ensuring that research teams have a high degree of cultural competence 234 . Additionally, researchers must not sensationalize or link their findings to pejorative stereotypes; an example of the latter is linking study findings to a supposed ‘warrior predisposition’ of the Maori 241 .

Finally, the growth of direct to consumer laboratory testing 242 by companies offering genetic risk profiles or genetic ancestry information with sometimes questionable scientific validity 243 and recruitment practices where scientists or companies recruit participants via the Internet 244 raise important ethical challenges, including those around scientific evidence, the quality of the informed consent process, maintaining privacy and confidentiality, benefit sharing arrangements and challenges relating to social justice and equity. There are few agreed international guidelines or standards for ethical conduct in situations where GWAS and commercial interests are interwoven and there is great need for their development.

Outlook

Following the publication of the first GWAS 15 years ago, an impressive number of trait-associated variants have been revealed, along with important insights into biology. Current trends in GWAS include an increasingly interdisciplinary approach, covering statistics, data science, genetics and molecular biology. As sample sizes reach more than 1 million participants and genotyping and sequencing costs reduce, GWAS are increasingly using WES and WGS to allow the identification of rare variants, which could potentially explain much of the missing heritability in complex traits 175,245,246 (however, see ref. 246 for a discussion of potential methodological issues in ref. 175 ). Minimal phenotyping may be a cost-effective and quick way of gaining power 247 and deep phenotyping and item-level analyses 248 are becoming important to further our understanding of distinct symptoms as opposed to diagnoses, which tend to be a collection of symptoms. Finally, the GWAS field is expanding to better represent the global community through the inclusion of under-represented populations.

GWAS could improve on the current low success rates and increasing costs and time required for drug development 249 . Retrospective reviews of drug development projects have shown that studies targeting GWAS disease risk genes were less likely to fail owing to lack of efficacy 250 . Drug discovery efforts have been especially successful when targeting rare variants identified by Mendelian pedigree studies; for example, the indication of an inhibitor of the key cholesterol metabolism regulator PCSK9 for hyperlipidaemia was inspired by the discovery of the rare PCSK9 loss-of-function variant 249 . Identifying drug targets from GWAS results is now a promising area of research. Chemical compounds that directly target the protein products of GWAS risk genes are promising candidates for drug repurposing; for example, CDK4/CDK6 inhibitors for rheumatoid arthritis 251 . Databases such as Open Targets 252 and software such as GREP 253 — which integrate connective networks among GWAS risk genes, compounds and clinical indications — should accelerate the integration of GWAS disease risk genes into drug discovery efforts.

Genetic studies of complex disease may inform the clinical application of therapies. GWAS for measures of treatment responses could allow for the stratification of individuals into responders and non-responders based on genetic factors. Further, integration of multi-omics data and the application of new machine learning approaches to these data sets could further improve patient stratification. A push for personalized medicine based on complex disease genetics seems ethically and economically necessary given that even the highest-grossing drugs in the United States only benefit from 1 in 4 to 1 in 24 patients 254 .

Lastly, GWAS results are now actively used to direct biomedical science in novel, transdisciplinary collaborations between geneticists and domain-specific molecular biologists. The International Common Disease Alliance has assembled a host of funders and scientists in academia and industry with the aim of using genetic disease maps to gain biological and medical insight into common diseases. Similarly, the goal of the BRAINSCAPES consortium is to bridge the gap between genetics and neurobiology by designing and conducting GWAS-informed functional follow-up studies. The promise of the next 15 years of GWAS is thus to gain biological insight into more refined phenotypes, link genetics to biology, develop genetically informed drug treatments, improve clinical risk prediction and ensure that these have positive impacts for the global community.

References

  1. Visscher, P. M. et al. 10 years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet.101, 5–22 (2017). This article provides an excellent overview of the main conclusions from 10 years of GWAS and addresses future challenges for the field. Google Scholar
  2. Frayling, T. M. et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science316, 889–894 (2007). ADSGoogle Scholar
  3. Siminovitch, K. A. PTPN22 and autoimmune disease. Nat. Genet.36, 1248–1249 (2004). Google Scholar
  4. Wang, K. et al. Diverse genome-wide association studies associate the IL12/IL23 pathway with Crohn disease. Am. J. Hum. Genet.84, 399–405 (2009). Google Scholar
  5. Moschen, A. R., Tilg, H. & Raine, T. IL-12, IL-23 and IL-17 in IBD: immunobiology and therapeutic targeting. Nat. Rev. Gastroenterol. Hepatol.16, 185–196 (2019). Google Scholar
  6. Benjamin, D. J. et al. The promises and pitfalls of genoeconomics. Annu. Rev. Econ.4, 627–662 (2012). Google Scholar
  7. Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet.50, 1219–1224 (2018). Google Scholar
  8. Watanabe, K. et al. A global overview of pleiotropy and genetic architecture in complex traits. Nat. Genet.51, 1339–1348 (2019). This paper analyses thousands of complex traits to chart the extent of pleiotropy in the human genome, finding trait-associated loci spread across much of the genome, and the majority associated with more than one trait. Google Scholar
  9. Lee, J. J. et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nat. Genet.50, 1112–1121 (2018). Google Scholar
  10. Jansen, P. R. et al. Genome-wide analysis of insomnia in 1,331,010 individuals identifies new risk loci and functional pathways. Nat. Genet.51, 394–403 (2019). Together with Lee et al. (2018), this study was the first GWAS to have a sample size >1,000,000. Google Scholar
  11. Holland, D. et al. Beyond SNP heritability: polygenicity and discoverability of phenotypes estimated with a univariate Gaussian mixture model. PLOS Genet.16, e1008612 (2020). Google Scholar
  12. Slatkin, M. Linkage disequilibrium — understanding the evolutionary past and mapping the medical future. Nat. Rev. Genet.9, 477–485 (2008). Google Scholar
  13. Uffelmann, E. & Posthuma, D. Emerging methods and resources for biological interrogation of neuropsychiatric polygenic signal. Biol. Psychiatry89, 41–53 (2021). Google Scholar
  14. Skol, A. D., Scott, L. J., Abecasis, G. R. & Boehnke, M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat. Genet.38, 209–213 (2006). Google Scholar
  15. Purcell, S., Cherny, S. S. & Sham, P. C. Genetic Power Calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics19, 149–150 (2003). Google Scholar
  16. Holmes, M. V., Ala-Korpela, M. & Smith, G. D. Mendelian randomization in cardiometabolic disease: challenges in evaluating causality. Nat. Rev. Cardiol.14, 577–590 (2017). Google Scholar
  17. Fry, A. et al. Comparison of sociodemographic and health-related characteristics of UK biobank participants with those of the general population. Am. J. Epidemiol.186, 1026–1034 (2017). Google Scholar
  18. Nagai, A. et al. Overview of the BioBank Japan Project: study design and profile. J. Epidemiol.27, S2–S8 (2017). Google Scholar
  19. Rietveld, C. A. et al. Common genetic variants associated with cognitive performance identified using the proxy-phenotype method. Proc. Natl Acad. Sci. USA111, 13790–13794 (2014). ADSGoogle Scholar
  20. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet.81, 559–575 (2007). Google Scholar
  21. Auton, A. et al. A global reference for human genetic variation. Nature526, 68–74 (2015). ADSGoogle Scholar
  22. Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed program. Nature590, 290–299 (2021). ADSGoogle Scholar
  23. Lam, M. et al. RICOPILI: rapid imputation for COnsortias PIpeLIne. Bioinformatics36, 930–933 (2020). Google Scholar
  24. Marchini, J., Cardon, L. R., Phillips, M. S. & Donnelly, P. The effects of human population structure on large genetic association studies. Nat. Genet.36, 512–517 (2004). Google Scholar
  25. Novembre, J. et al. Genes mirror geography within Europe. Nature456, 98–101 (2008). ADSGoogle Scholar
  26. Lawson, D. J. et al. Is population structure in the genetic biobank era irrelevant, a challenge, or an opportunity? Hum. Genet.139, 23–41 (2020). Google Scholar
  27. Morris, T. T., Davies, N. M., Hemani, G. & Smith, G. D. Population phenomena inflate genetic associations of complex social traits. Sci. Adv.6, eaay0328 (2020). ADSGoogle Scholar
  28. Young, A. I. et al. Relatedness disequilibrium regression estimates heritability without environmental bias. Nat. Genet.50, 1304–1310 (2018). Google Scholar
  29. Kerminen, S. et al. Geographic variation and bias in the polygenic scores of complex diseases and traits in Finland. Am. J. Hum. Genet.104, 1169–1181 (2019). Google Scholar
  30. Zaidi, A. A. & Mathieson, I. Demographic history mediates the effect of stratification on polygenic scores. eLife9, e61548 (2020). This paper investigates the effects of residual population structure on GWAS in simulated populations with different demographic histories and shows that commonly used methods such as principal components of common variants cannot correct for recent population stratification. Google Scholar
  31. Brumpton, B. et al. Avoiding dynastic, assortative mating, and population stratification biases in Mendelian randomization through within-family analyses. Nat. Commun.11, 3519 (2020). ADSGoogle Scholar
  32. Lander, E. S. & Schork, N. J. Genetic dissection of complex traits. Science265, 2037–2048 (1994). ADSGoogle Scholar
  33. Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet.38, 904–909 (2006). Google Scholar
  34. Pirinen, M., Donnelly, P. & Spencer, C. C. A. Including known covariates can reduce power to detect genetic effects in case–control studies. Nat. Genet.44, 848–851 (2012). Google Scholar
  35. Zhou, W. et al. Efficiently controlling for case–control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet.50, 1335–1341 (2018). Google Scholar
  36. Loh, P.-R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-scale datasets. Nat. Genet.50, 906–908 (2018). Google Scholar
  37. Jiang, L. et al. A resource-efficient tool for mixed model association analysis of large-scale data. Nat. Genet.51, 1749–1755 (2019). Google Scholar
  38. Altshuler, D. & Donnelly, P., The International HapMap Consortium. A haplotype map of the human genome. Nature437, 1299–1320 (2005). ADSGoogle Scholar
  39. Willer, C. J., Li, Y. & Abecasis, G. R. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics26, 2190–2191 (2010). Google Scholar
  40. Baselmans, B. M. L. et al. Multivariate genome-wide analyses of the well-being spectrum. Nat. Genet.51, 445–451 (2019). Google Scholar
  41. Rangamaran, V. R., Uppili, B., Gopal, D. & Ramalingam, K. EasyQC: tool with interactive user interface for efficient next-generation sequencing data quality control. J. Comput. Biol.25, 1301–1311 (2018). Google Scholar
  42. Winkler, T. W. et al. Quality control and conduct of genome-wide association meta-analyses. Nat. Protoc.9, 1192–1212 (2014). Google Scholar
  43. Wu, M. C. et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet.89, 82–93 (2011). Google Scholar
  44. Neale, B. M. et al. Testing for an unusual distribution of rare variants. PLoS Genet.7, e1001322 (2011). Google Scholar
  45. Zaitlen, N. et al. Informed conditioning on clinical covariates increases power in case–control association studies. PLoS Genet.8, e1003032 (2012). Google Scholar
  46. Moskvina, V., Holmans, P., Schmidt, K. M. & Craddock, N. Design of case–controls studies with unscreened controls. Ann. Hum. Genet.69, 566–576 (2005). Google Scholar
  47. Pirastu, N. et al. Genetic analyses identify widespread sex-differential participation bias. Nat. Genet.53, 663–671 (2021). Google Scholar
  48. Benyamin, B., Visscher, P. M. & McRae, A. F. Family-based genome-wide association studies. Pharmacogenomics10, 181–190 (2009). Google Scholar
  49. Teng, J. & Risch, N. The relative power of family-based and case–control designs for linkage disequilibrium studies of complex human diseases. II. individual genotyping. Genome Res.9, 234–241 (1999). Google Scholar
  50. Mostafavi, H. et al. Variable prediction accuracy of polygenic scores within an ancestry group. eLife9, e48376 (2020). Google Scholar
  51. Robinson, M. R. et al. Population genetic differentiation of height and body mass index across Europe. Nat. Genet.47, 1357–1362 (2015). Google Scholar
  52. Purcell, S., Sham, P. & Daly, M. J. Parental phenotypes in family-based association analysis. Am. J. Hum. Genet.76, 249–259 (2005). Google Scholar
  53. Abecasis, G. R., Cardon, L. R. & Cookson, W. O. C. A general test of association for quantitative traits in nuclear families. Am. J. Hum. Genet.66, 279–292 (2000). Google Scholar
  54. Fulker, D. W., Cherny, S. S., Sham, P. C. & Hewitt, J. K. Combined linkage and association sib-pair analysis for quantitative traits. Am. J. Hum. Genet.64, 259–267 (1999). Google Scholar
  55. Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet.44, 821–824 (2012). Google Scholar
  56. Mbatchou, J. et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet.5, 1097–1103 (2021). Google Scholar
  57. Kong, A. et al. The nature of nurture: effects of parental genotypes. Science359, 424–428 (2018). This paper shows for the first time that part of the signal in the GWAS for some traits is from ‘indirect genetic effects’ that act through parents rather than directly on the index individual, and shows how these can be disentangled with family data. ADSGoogle Scholar
  58. Bates, T. C. et al. The nature of nurture: using a virtual-parent design to test parenting effects on children’s educational attainment in genotyped families. Twin Res. Hum. Genet.21, 73–83 (2018). Google Scholar
  59. Young, A. I. et al. Mendelian imputation of parental genotypes for genome-wide estimation of direct and indirect genetic effects. Preprint at bioRxivhttps://doi.org/10.1101/2020.07.02.185199v1 (2020). ArticleGoogle Scholar
  60. Howe, L. J. et al. Within-sibship GWAS improve estimates of direct genetic effects. Preprint at bioRxivhttps://doi.org/10.1101/2021.03.05.433935v1 (2021). This study is the largest within-sibship GWAS to date and illustrates the value of this method for disentangling direct genetic effects from indirect genetic effects and population structure. ArticleGoogle Scholar
  61. Liu, J. Z., Erlich, Y. & Pickrell, J. K. Case–control association mapping by proxy using family history of disease. Nat. Genet.49, 325–331 (2017). Google Scholar
  62. Hujoel, M. L. A., Gazal, S., Loh, P.-R., Patterson, N. & Price, A. L. Liability threshold modeling of case–control status and family history of disease increases association power. Nat. Genet.52, 541–547 (2020). Google Scholar
  63. Hatzikotoulas, K., Gilly, A. & Zeggini, E. Using population isolates in genetic association studies. Brief. Funct. Genomics13, 371–377 (2014). Google Scholar
  64. Xue, Y. et al. Enrichment of low-frequency functional variants revealed by whole-genome sequencing of multiple isolated European populations. Nat. Commun.8, 15927 (2017). ADSGoogle Scholar
  65. Chheda, H. et al. Whole-genome view of the consequences of a population bottleneck using 2926 genome sequences from Finland and United Kingdom. Eur. J. Hum. Genet.25, 477–484 (2017). Google Scholar
  66. Lim, E. T. et al. Distribution and medical impact of loss-of-function variants in the finnish founder population. PLoS Genet.10, e1004494 (2014). This paper gives a good illustration of the value of isolated populations for identifying founder variants of large effect that are rare in other populations. Google Scholar
  67. Service, S. et al. Magnitude and distribution of linkage disequilibrium in population isolates and implications for genome-wide association studies. Nat. Genet.38, 556–560 (2006). Google Scholar
  68. Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nat. Genet.40, 1068–1075 (2008). Google Scholar
  69. Palin, K., Campbell, H., Wright, A. F., Wilson, J. F. & Durbin, R. Identity-by-descent-based phasing and imputation in founder populations using graphical models. Genet. Epidemiol.35, 853–860 (2011). Google Scholar
  70. Glodzik, D. et al. Inference of identity by descent in population isolates and optimal sequencing studies. Eur. J. Hum. Genet.21, 1140–1145 (2013). Google Scholar
  71. Uricchio, L. H., Chong, J. X., Ross, K. D., Ober, C. & Nicolae, D. L. Accurate imputation of rare and common variants in a founder population from a small number of sequenced individuals. Genet. Epidemiol.36, 312–319 (2012). Google Scholar
  72. Herzig, A. F. et al. Strategies for phasing and imputation in a population isolate. Genet. Epidemiol.42, 201–213 (2018). Google Scholar
  73. Zeggini, E., Gloyn, A. L. & Hansen, T. Insights into metabolic disease from studying genetics in isolated populations: stories from Greece to Greenland. Diabetologia59, 938–941 (2016). Google Scholar
  74. Sidore, C. et al. Genome sequencing elucidates Sardinian genetic architecture and augments association analyses for lipid and blood inflammatory markers. Nat. Genet.47, 1272–1281 (2015). Google Scholar
  75. Do, R. et al. Exome sequencing identifies rare LDLR and APOA5 alleles conferring risk for myocardial infarction. Nature518, 102–106 (2015). Google Scholar
  76. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature562, 203–209 (2018). This paper describes the production of genetic data for the UK Biobank, which has been widely used in GWAS. ADSGoogle Scholar
  77. Yengo, L. et al. Meta-analysis of genome-wide association studies for height and body mass index in ∼ 700000 individuals of European ancestry. Hum. Mol. Genet.27, 3641–3649 (2018). Google Scholar
  78. Astle, W. J. et al. The allelic landscape of human blood cell trait variation and links to common complex disease. Cell167, 1415–1429.e19 (2016). Google Scholar
  79. Sinnott-Armstrong, N. et al. Genetics of 35 blood and urine biomarkers in the UK Biobank. Nat. Genet.53, 185–194 (2021). Google Scholar
  80. Hill, W. D. et al. A combined analysis of genetically correlated traits identifies 187 loci and a role for neurogenesis and myelination in intelligence. Mol. Psychiatry24, 169–181 (2019). Google Scholar
  81. Elliott, L. T. et al. Genome-wide association studies of brain imaging phenotypes in UK Biobank. Nature562, 210–216 (2018). ADSGoogle Scholar
  82. Thorp, J. G. et al. Symptom-level modelling unravels the shared genetic architecture of anxiety and depression. Nat. Hum. Behav.https://doi.org/10.1038/s41562-021-01094-9 (2021). ArticleGoogle Scholar
  83. Christophersen, I. E. et al. Large-scale analyses of common and rare variants identify 12 new loci associated with atrial fibrillation. Nat. Genet.49, 946–952 (2017). Google Scholar
  84. Ferreira, M. A. R. et al. Age-of-onset information helps identify 76 genetic variants associated with allergic disease. PLoS Genet.16, e1008725 (2020). Google Scholar
  85. Purves, K. L. et al. A major role for common genetic variation in anxiety disorders. Mol. Psychiatryhttps://doi.org/10.1038/s41380-019-0559-1 (2019). ArticleGoogle Scholar
  86. Peterson, R. E. et al. Genome-wide association studies in ancestrally diverse populations: opportunities, methods, pitfalls, and recommendations. Cell179, 589–603 (2019). Google Scholar
  87. Van Hout, C. V. et al. Exome sequencing and characterization of 49,960 individuals in the UK Biobank. Nature586, 749–756 (2020). ADSGoogle Scholar
  88. Watanabe, K., Taskesen, E., van Bochoven, A. & Posthuma, D. Functional mapping and annotation of genetic associations with FUMA. Nat. Commun.8, 1826 (2017). ADSGoogle Scholar
  89. Pruim, R. J. et al. LocusZoom: regional visualization of genome-wide association scan results. Bioinformatics26, 2336–2337 (2010). Google Scholar
  90. Raychaudhuri, S. Mapping rare and common causal alleles for complex human diseases. Cell147, 57–69 (2011). Google Scholar
  91. Schaid, D. J., Chen, W. & Larson, N. B. From genome-wide associations to candidate causal variants by statistical fine-mapping. Nat. Rev. Genet.19, 491–504 (2018). Google Scholar
  92. Yang, J. et al. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat. Genet.44, 369–375 (2012). Google Scholar
  93. Hormozdiari, F., Kostem, E., Kang, E. Y., Pasaniuc, B. & Eskin, E. Identifying causal variants at loci with multiple signals of association. Genetics198, 497–508 (2014). Google Scholar
  94. Benner, C. et al. FINEMAP: efficient variable selection using summary data from genome-wide association studies. Bioinformatics32, 1493–1501 (2016). Google Scholar
  95. Kichaev, G. et al. Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet.10, e1004722 (2014). Google Scholar
  96. Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Ser. B Stat. Methodol.82, 1273–1300 (2020). MathSciNetGoogle Scholar
  97. Durbin, R. M. et al. A map of human genome variation from population-scale sequencing. Nature467, 1061–1073 (2010). ADSGoogle Scholar
  98. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature581, 434–443 (2020). ADSGoogle Scholar
  99. Dendrou, C. A., Petersen, J., Rossjohn, J. & Fugger, L. HLA variation and disease. Nat. Rev. Immunol.18, 325–339 (2018). Google Scholar
  100. Study, T. I. H. C. The major genetic determinants of HIV-1 control affect HLA class I peptide presentation. Science330, 1551–1557 (2010). ADSGoogle Scholar
  101. Raychaudhuri, S. et al. Five amino acids in three HLA proteins explain most of the association between MHC and seropositive rheumatoid arthritis. Nat. Genet.44, 291–296 (2012). Google Scholar
  102. Jia, X. et al. Imputing amino acid polymorphisms in human leukocyte antigens. PLoS ONE8, e64683 (2013). ADSGoogle Scholar
  103. Zheng, X. et al. HIBAG — HLA genotype imputation with attribute bagging. Pharmacogenomics J.14, 192–200 (2014). Google Scholar
  104. Dilthey, A. T., Moutsianas, L., Leslie, S. & McVean, G. HLA*IMP — an integrated framework for imputing classical HLA alleles from SNP genotypes. Bioinformatics27, 968–972 (2011). Google Scholar
  105. Hirata, J. et al. Genetic and phenotypic landscape of the major histocompatibilty complex region in the Japanese population. Nat. Genet.51, 470–480 (2019). Google Scholar
  106. Vukcevic, D. et al. Imputation of KIR types from SNP variation data. Am. J. Hum. Genet.97, 593–607 (2015). Google Scholar
  107. Yamamoto, K. et al. Genetic and phenotypic landscape of the mitochondrial genome in the Japanese population. Commun. Biol.3, 1–11 (2020). Google Scholar
  108. Huang, H. et al. Fine-mapping inflammatory bowel disease loci to single variant resolution. Nature547, 173–178 (2017). ADSGoogle Scholar
  109. Fachal, L. et al. Fine-mapping of 150 breast cancer risk regions identifies 191 likely target genes. Nat. Genet.52, 56–73 (2020). Google Scholar
  110. Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res.47, D1005–D1012 (2019). Google Scholar
  111. Sinnott-Armstrong, N., Naqvi, S., Rivas, M. & Pritchard, J. K. GWAS of three molecular traits highlights core genes and pathways alongside a highly polygenic background. eLife10, e58615 (2021). Google Scholar
  112. Smemo, S. et al. Obesity-associated variants within FTO form long-range functional connections with IRX3. Nature507, 371–375 (2014). ADSGoogle Scholar
  113. Musunuru, K. et al. From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature466, 714–719 (2010). ADSGoogle Scholar
  114. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res.38, e164–e164 (2010). Google Scholar
  115. McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol.17, 122 (2016). Google Scholar
  116. Maurano, M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science337, 1190–1195 (2012). ADSGoogle Scholar
  117. Tak, Y. G. & Farnham, P. J. Making sense of GWAS: using epigenomics and genome engineering to understand the functional relevance of SNPs in non-coding regions of the human genome. Epigenetics Chromatin8, 57 (2015). Google Scholar
  118. Barbeira, A. N. et al. Exploiting the GTEx resources to decipher the mechanisms at GWAS loci. Genome Biol.22, 49 (2021). Google Scholar
  119. Nasser, J. et al. Genome-wide enhancer maps link risk variants to disease genes. Nature593, 238–243 (2021). ADSGoogle Scholar
  120. Morris, J. A. et al. Discovery of target genes and pathways of blood trait loci using pooled CRISPR screens and single cell RNA sequencing. Preprint at bioRxivhttps://doi.org/10.1101/2021.04.07.438882v1 (2021). ArticleGoogle Scholar
  121. Li, Y. I. et al. RNA splicing is a primary link between genetic variation and disease. Science352, 600–604 (2016). ADSGoogle Scholar
  122. GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science369, 1318–1330 (2020). Google Scholar
  123. van der Wijst, M. et al. The single-cell eQTLGen consortium. eLife9, e52155 (2020). Google Scholar
  124. Kerimov, N. et al. eQTL Catalogue: a compendium of uniformly processed human gene expression and splicing QTLs. Preprint at bioRxivhttps://doi.org/10.1101/2020.01.29.924266v1 (2020). ArticleGoogle Scholar
  125. Gusev, A. et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet.48, 245–252 (2016). Google Scholar
  126. GTEx Consortium et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet.47, 1091–1098 (2015). Google Scholar
  127. Hormozdiari, F. et al. Colocalization of GWAS and eQTL signals detects target genes. Am. J. Hum. Genet.99, 1245–1260 (2016). Google Scholar
  128. Wen, X., Pique-Regi, R. & Luca, F. Integrating molecular QTL data into genome-wide genetic association analysis: probabilistic assessment of enrichment and colocalization. PLoS Genet.13, e1006646 (2017). Google Scholar
  129. Giambartolomei, C. et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet.10, e1004383 (2014). Google Scholar
  130. Kleinjan, D. A. & van Heyningen, V. Long-range control of gene expression: emerging mechanisms and disruption in disease. Am. J. Hum. Genet.76, 8–32 (2005). Google Scholar
  131. Greenwald, W. W. et al. Subtle changes in chromatin loop contact propensity are associated with differential gene regulation and expression. Nat. Commun.10, 1054 (2019). ADSGoogle Scholar
  132. Thurman, R. E. et al. The accessible chromatin landscape of the human genome. Nature489, 75–82 (2012). ADSGoogle Scholar
  133. Gasperini, M. et al. A genome-wide framework for mapping gene regulation via cellular genetic screens. Cell176, 377–390.e19 (2019). Google Scholar
  134. Mulvey, B., Lagunas, T. & Dougherty, J. D. Massively parallel reporter assays: defining functional psychiatric genetic variants across biological contexts. Biol. Psychiatryhttps://doi.org/10.1016/j.biopsych.2020.06.011 (2020). ArticleGoogle Scholar
  135. Canver, M. C. et al. BCL11A enhancer dissection by Cas9-mediated in situ saturating mutagenesis. Nature527, 192–197 (2015). ADSGoogle Scholar
  136. de Leeuw, C. A., Mooij, J. M., Heskes, T. & Posthuma, D. MAGMA: generalized gene-set analysis of GWAS data. PLoS Comput. Biol.11, e1004219 (2015). Google Scholar
  137. Pers, T. H. et al. Biological interpretation of genome-wide association studies using predicted gene functions. Nat. Commun.6, 5890 (2015). Google Scholar
  138. Võsa, U. et al. Unraveling the polygenic architecture of complex traits using blood eQTL metaanalysis. Preprint at bioRxivhttps://doi.org/10.1101/447367 (2018). ArticleGoogle Scholar
  139. Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell167, 1853–1866.e17 (2016). Google Scholar
  140. Adamson, B. et al. A multiplexed single-cell CRISPR screening platform enables systematic dissection of the unfolded protein response. Cell167, 1867–1882.e21 (2016). Google Scholar
  141. Regev, A. et al. The Human Cell Atlas. eLife6, e27041 (2017). Google Scholar
  142. Choi, S. W., Mak, T. S.-H. & O’Reilly, P. F. Tutorial: a guide to performing polygenic risk score analyses. Nat. Protoc.15, 2759–2772 (2020). Google Scholar
  143. Martin, A. R., Daly, M. J., Robinson, E. B., Hyman, S. E. & Neale, B. M. Predicting polygenic risk of psychiatric disorders. Biol. Psychiatry86, 97–109 (2019). Google Scholar
  144. Euesden, J., Lewis, C. M. & O’Reilly, P. F. PRSice: polygenic risk score software. Bioinformatics31, 1466–1468 (2015). Google Scholar
  145. International Schizophrenia Consortium. et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature460, 748–752 (2009). Google Scholar
  146. Ge, T., Chen, C.-Y., Ni, Y., Feng, Y.-C. A. & Smoller, J. W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun.10, 1776 (2019). ADSGoogle Scholar
  147. Lloyd-Jones, L. R. et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat. Commun.10, 5086 (2019). ADSGoogle Scholar
  148. Márquez-Luna, C., Loh, P.-R., South Asian Type 2 Diabetes (SAT2D) Consortium, SIGMA Type 2 Diabetes Consortium & Price, A. L. Multiethnic polygenic risk scores improve risk prediction in diverse populations. Genet. Epidemiol.41, 811–823 (2017). Google Scholar
  149. Márquez-Luna, C. et al. Modeling functional enrichment improves polygenic prediction accuracy in UK Biobank and 23andMe data sets. Preprint at bioRxivhttps://doi.org/10.1101/375337v1 (2018). ArticleGoogle Scholar
  150. Privé, F., Arbel, J. & Vilhjálmsson, B. J. LDpred2: better, faster, stronger. Bioinformaticshttps://doi.org/10.1093/bioinformatics/btaa1029 (2020). ArticleGoogle Scholar
  151. Vilhjálmsson, B. J. et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Hum. Genet.97, 576–592 (2015). Google Scholar
  152. Lee, S. H., Wray, N. R., Goddard, M. E. & Visscher, P. M. Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet.88, 294–305 (2011). Google Scholar
  153. Golan, D., Lander, E. S. & Rosset, S. Measuring missing heritability: inferring the contribution of common variants. Proc. Natl Acad. Sci. USA111, E5272–E5281 (2014). ADSGoogle Scholar
  154. Craig, J. E. et al. Multitrait analysis of glaucoma identifies new risk loci and enables polygenic prediction of disease susceptibility and progression. Nat. Genet.52, 160–166 (2020). Google Scholar
  155. López-Ratón, M., Rodríguez-Álvarez, M. X., Cadarso-Suárez, C. & Gude-Sampedro, F. OptimalCutpoints: an R package for selecting optimal cutpoints in diagnostic tests. J. Stat. Softw.61, 1–36 (2014). Google Scholar
  156. Wald, N. J. & Old, R. The illusion of polygenic disease risk prediction. Genet. Med.21, 1705–1707 (2019). Google Scholar
  157. Mihaescu, R. et al. Improvement of risk prediction by genomic profiling: reclassification measures versus the area under the receiver operating characteristic curve. Am. J. Epidemiol.172, 353–361 (2010). Google Scholar
  158. Li, R., Chen, Y., Ritchie, M. D. & Moore, J. H. Electronic health records and polygenic risk scores for predicting disease risk. Nat. Rev. Genet.21, 493–502 (2020). Google Scholar
  159. Mars, N. et al. Polygenic and clinical risk scores and their impact on age at onset and prediction of cardiometabolic diseases and common cancers. Nat. Med.26, 549–557 (2020). Google Scholar
  160. Riveros-Mckay, F. et al. Integrated polygenic tool substantially enhances coronary artery disease prediction. Circ. Genomic Precis. Med.14, e003304 (2021). This paper proposes a method to integrate clinical risk scores and PRSs for coronary artery disease and shows the improved predictive accuracy of PRSs over established clinical risk factors in European-ancestry individuals from the UK Biobank. Google Scholar
  161. Sun, L. et al. Polygenic risk scores in cardiovascular risk prediction: a cohort study and modelling analyses. PLoS Med.18, e1003498 (2021). This paper recalibrated risk prediction models in the UK Biobank to what would be expected in an unbiased UK population to account for the bias caused by UK Biobank participants being healthier and wealthier, which is seldom considered in other studies in this field.Google Scholar
  162. Weale, M. E. et al. Validation of an integrated risk tool, including polygenic risk score, for atherosclerotic cardiovascular disease in multiple ethnicities and ancestries. Am. J. Cardiol.148, 157–164 (2021). This paper applies the integrated model proposed by Riveros-Mckay et al. (2021) to diverse populations in the UK Biobank and provides the first cross-ancestry validation of the clinical utility of adding polygenic scores into clinical risk tools. Google Scholar
  163. Martin, A. R. et al. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet.100, 635–649 (2017). Google Scholar
  164. Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet.51, 584–591 (2019). Google Scholar
  165. Scutari, M., Mackay, I. & Balding, D. Using genetic distance to infer the accuracy of genomic prediction. PLoS Genet.12, e1006288 (2016). Google Scholar
  166. Sakaue, S. et al. Functional variants in ADH1B and ALDH2 are non-additively associated with all-cause mortality in Japanese population. Eur. J. Hum. Genet.28, 378–382 (2020). Google Scholar
  167. Cavazos, T. B. & Witte, J. S. Inclusion of variants discovered from diverse populations improves polygenic risk score transferability. HGG Adv.2, 100017 (2021). Google Scholar
  168. Lam, M. et al. Comparative genetic architectures of schizophrenia in East Asian and European populations. Nat. Genet.51, 1670–1678 (2019). Google Scholar
  169. Wand, H. et al. Improving reporting standards for polygenic scores in risk prediction studies. Nature591, 211–219 (2021). ADSGoogle Scholar
  170. Lambert, S. A. et al. The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nat. Genet.53, 420–425 (2021). Google Scholar
  171. Fisher, R. A. XV. — The correlation between relatives on the supposition of Mendelian inheritance. Earth Environ. Sci. Trans. R. Soc. Edinb.52, 399–433 (1919). Google Scholar
  172. Falconer, D. S. & Mackay, T. F. C. Introduction to Quantitative Genetics (Pearson, Prentice Hall, 2009).
  173. Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet.88, 76–82 (2011). Google Scholar
  174. Schizophrenia Working Group of the Psychiatric Genomics Consortium. et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet.47, 291–295 (2015). Google Scholar
  175. Wainschtein, P. et al. Recovery of trait heritability from whole genome sequence data. Preprint at bioRxivhttps://doi.org/10.1101/588020 (2019). ArticleGoogle Scholar
  176. Schoech, A. P. et al. Quantification of frequency-dependent genetic architectures in 25 UK Biobank traits reveals action of negative selection. Nat. Commun.10, 790 (2019). ADSGoogle Scholar
  177. Bomba, L., Walter, K. & Soranzo, N. The impact of rare and low-frequency genetic variants in common disease. Genome Biol.18, 77 (2017). Google Scholar
  178. Bergen, S. E., Gardner, C. O. & Kendler, K. S. Age-related changes in heritability of behavioral phenotypes over adolescence and young adulthood: a meta-analysis. Twin Res. Hum. Genet.10, 423–433 (2007). Google Scholar
  179. Bernabeu, E. et al. Sexual differences in genetic architecture in UK Biobank. Preprint at bioRxivhttps://doi.org/10.1101/2020.07.20.211813v1 (2020). ArticleGoogle Scholar
  180. Heath, A. C. et al. Education policy and the heritability of educational attainment. Nature314, 734–736 (1985). ADSGoogle Scholar
  181. Browning, S. R. & Browning, B. L. Population structure can inflate SNP-based heritability estimates. Am. J. Hum. Genet.89, 191–193; author reply 193–195 (2011). Google Scholar
  182. Verbanck, M., Chen, C.-Y., Neale, B. & Do, R. Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nat. Genet.50, 693–698 (2018). Google Scholar
  183. Zhang, Y. et al. Local genetic correlation analysis reveals heterogeneous etiologic sharing of complex traits. Preprint at bioRxivhttps://doi.org/10.1101/2020.05.08.084475v1 (2020). ArticleGoogle Scholar
  184. Shi, H., Mancuso, N., Spendlove, S. & Pasaniuc, B. Local genetic correlation gives insights into the shared genetic architecture of complex traits. Am. J. Hum. Genet.101, 737–751 (2017). Google Scholar
  185. Werme, J., Sluis, Svander, Posthuma, D. & de Leeuw, C. A. LAVA: an integrated framework for local genetic correlation analysis. Preprint at bioRxivhttps://doi.org/10.1101/2020.12.31.424652v1 (2021). ArticleGoogle Scholar
  186. Jordan, D. M., Verbanck, M. & Do, R. HOPS: a quantitative score reveals pervasive horizontal pleiotropy in human genetic variation is driven by extreme polygenicity of human traits and diseases. Genome Biol.20, 222 (2019). Google Scholar
  187. Smith, G. D. & Ebrahim, S. ‘Mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? Int. J. Epidemiol.32, 1–22 (2003). Google Scholar
  188. Evans, D. M. & Smiths, G. D. Mendelian randomization: new applications in the coming age of hypothesis-free causality. Annu. Rev. Genomics Hum. Genet.16, 327–350 (2015). Google Scholar
  189. Wellcome Trust. Sharing Data from Large-scale Biological Research Projects: A System of Tripartite Responsibility Vol. 6 (Wellcome Trust, 2003).
  190. COVID-19 Host Genetics Initiative. The COVID-19 Host Genetics Initiative, a global initiative to elucidate the role of host genetic factors in susceptibility and severity of the SARS-CoV-2 virus pandemic. Eur. J. Hum. Genet.28, 715–718 (2020). This paper presents the recently established COVID-19 Host Genetics Initiative as a prime example of collaboration and team science, forming within a few months, rapidly aggregating data into a massive resource, rapidly crystallizing results and making it all freely available to academics. Google Scholar
  191. Knoppers, B. M. Framework for responsible sharing of genomic and health-related data. HUGO J.8, 3 (2014). Google Scholar
  192. Peloquin, D., DiMaio, M., Bierer, B. & Barnes, M. Disruptive and avoidable: GDPR challenges to secondary research uses of data. Eur. J. Hum. Genet.28, 697–705 (2020). Google Scholar
  193. Staunton, C. et al. Protection of Personal Information Act 2013 and data protection for health research in South Africa. Int. Data Priv. Law10, 160–179 (2020). Google Scholar
  194. Molnár-Gábor, F. & Korbel, J. O. Genomic data sharing in Europe is stumbling — could a code of conduct prevent its fall? EMBO Mol. Med.12, e11421 (2020). Google Scholar
  195. Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data3, 160018 (2016). Google Scholar
  196. Bezuidenhout, L. & Chakauya, E. Hidden concerns of sharing research data by low/middle-income country scientists. Glob. Bioeth. Probl. Bioet.29, 39–54 (2018). Google Scholar
  197. Bull, S. Review: Ensuring global equity in open research. Wellcome Trusthttps://doi.org/10.6084/M9.FIGSHARE.4055181.V1 (2016). ArticleGoogle Scholar
  198. de Vries, J. et al. The H3Africa policy framework: negotiating fairness in genomics. Trends Genet.31, 117–119 (2015). Google Scholar
  199. Yakubu, A. et al. Model framework for governance of genomic research and biobanking in Africa — a content description. AAS Open Res.1, 13 (2018). Google Scholar
  200. O’Doherty, K. C. et al. Toward better governance of human genomic data. Nat. Genet.53, 2–8 (2021). Google Scholar
  201. Lyon, M. S. et al. The variant call format provides efficient and robust storage of GWAS summary statistics. Genome Biol.22, 32 (2021). Google Scholar
  202. Nosek, B. A., Ebersole, C. R., DeHaven, A. C. & Mellor, D. T. The preregistration revolution. Proc. Natl Acad. Sci. USA115, 2600–2606 (2018). Google Scholar
  203. Bosco, F. A., Aguinis, H., Field, J. G., Pierce, C. A. & Dalton, D. R. HARKing’s threat to organizational research: evidence from primary and meta-analytic sources. Pers. Psychol.69, 709–750 (2016). Google Scholar
  204. Kerr, N. L. HARKing: hypothesizing after the results are known. Personal. Soc. Psychol. Rev.2, 196–217 (1998). Google Scholar
  205. Colhoun, H. M., McKeigue, P. M. & Smith, G. D. Problems of reporting genetic associations with complex outcomes. Lancet361, 865–872 (2003). Google Scholar
  206. John, L. K., Loewenstein, G. & Prelec, D. Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol. Sci.23, 524–532 (2012). Google Scholar
  207. Chambers, C. D., Feredoes, E., Muthukumaraswamy, S. D. & Etchells, P. J. Instead of ‘playing the game’ it is time to change the rules: Registered Reports at AIMS Neuroscience and beyond. AIMS Neurosci.1, 4 (2014). This paper introduces the Registered Reports concept, a publishing format in which peer review occurs before data collection and analysis. Google Scholar
  208. Song, F., Hooper & Loke, Y. Publication bias: what is it? How do we measure it? How do we avoid it? Open Access J. Clin. Trialshttps://doi.org/10.2147/OAJCT.S34419 (2013). ArticleGoogle Scholar
  209. Syed, M. & Donnellan, M. B. Registered reports with developmental and secondary data: some brief observations and introduction to the special issue. Emerg. Adulthood8, 255–258 (2020). Google Scholar
  210. Van den Akker, O. et al. Preregistration of secondary data analysis: a template and tutorial. Preprint at PsyArXivhttps://doi.org/10.31234/osf.io/hvfmr (2019). ArticleGoogle Scholar
  211. Berg, J. J. et al. Reduced signal for polygenic adaptation of height in UK Biobank. eLife8, e39725 (2019). This paper shows that the polygenic selection signal of height in European-ancestry individuals is strongly attenuated when using GWAS summary statistics generated from the UK Biobank rather than the largest GWAS meta-analysis (GIANT consortium). Google Scholar
  212. Refoyo-Martínez, A. et al. How robust are cross-population signatures of polygenic adaptation in humans? Preprint at medRxivhttps://doi.org/10.1101/2020.07.13.200030v2 (2020). ArticleGoogle Scholar
  213. Sohail, M. et al. Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies. eLife8, e39702 (2019). Google Scholar
  214. Abdellaoui, A. et al. Genetic correlates of social stratification in Great Britain. Nat. Hum. Behav.3, 1332–1342 (2019). Google Scholar
  215. Haworth, S. et al. Apparent latent structure within the UK Biobank sample has implications for epidemiological analysis. Nat. Commun.10, 333 (2019). ADSGoogle Scholar
  216. Selzam, S. et al. Comparing within- and between-family polygenic score prediction. Am. J. Hum. Genet.105, 351–363 (2019). Google Scholar
  217. Turchin, M. C. et al. Evidence of widespread selection on standing variation in Europe at height-associated SNPs. Nat. Genet.44, 1015–1019 (2012). Google Scholar
  218. O’Connor, L. J. et al. Extreme polygenicity of complex traits is explained by negative selection. Am. J. Hum. Genet.105, 456–476 (2019). Google Scholar
  219. Zeng, J. et al. Signatures of negative selection in the genetic architecture of human complex traits. Nat. Genet.50, 746–753 (2018). Google Scholar
  220. Boyle, E. A., Li, Y. I. & Pritchard, J. K. An expanded view of complex traits: from polygenic to omnigenic. Cell169, 1177–1186 (2017). Google Scholar
  221. Liu, X., Li, Y. I. & Pritchard, J. K. Trans effects on gene expression can drive omnigenic inheritance. Cell177, 1022–1034.e6 (2019). Google Scholar
  222. Flannick, J. et al. Exome sequencing of 20,791 cases of type 2 diabetes and 24,440 controls. Nature570, 71–76 (2019). ADSGoogle Scholar
  223. Singh, T. et al. The contribution of rare variants to risk of schizophrenia in individuals with and without intellectual disability. Nat. Genet.49, 1167–1173 (2017). Google Scholar
  224. Luo, Y. et al. Exploring the genetic architecture of inflammatory bowel disease by whole-genome sequencing identifies association at ADCY7. Nat. Genet.49, 186–192 (2017). Google Scholar
  225. Tindana, P., Molyneux, S., Bull, S. & Parker, M. ‘It is an entrustment’: broad consent for genomic research and biobanks in sub-Saharan Africa. Dev. World Bioeth.19, 9–17 (2019). Google Scholar
  226. Fisher, C. B. & Layman, D. M. Genomics, big data, and broad consent: a new ethics frontier for prevention science. Prev. Sci.19, 871–879 (2018). Google Scholar
  227. Nembaware, V. et al. A framework for tiered informed consent for health genomic research in Africa. Nat. Genet.51, 1566–1571 (2019). Google Scholar
  228. Weiner, C. Anticipate and communicate: ethical management of incidental and secondary findings in the clinical, research, and direct-to-consumer contexts (December 2013 Report of the Presidential Commission for the Study of Bioethical Issues). Am. J. Epidemiol.180, 562–564 (2014). Google Scholar
  229. Eckstein, L., Garrett, J. R. & Berkman, B. E. A framework for analyzing the ethics of disclosing genetic research findings. J. Law Med. Ethics42, 190–207 (2014). Google Scholar
  230. Wonkam, A. & de Vries, J. Returning incidental findings in African genomics research. Nat. Genet.52, 17–20 (2020). Google Scholar
  231. McGuire, A. L. et al. The road ahead in genetics and genomics. Nat. Rev. Genet.21, 581–596 (2020). Google Scholar
  232. Popejoy, A. B. & Fullerton, S. M. Genomics is failing on diversity. Nature538, 161–164 (2016). ADSGoogle Scholar
  233. Hudson, M. et al. Rights, interests and expectations: Indigenous perspectives on unrestricted access to genomic data. Nat. Rev. Genet.21, 377–384 (2020). Google Scholar
  234. Claw, K. G. et al. A framework for enhancing ethical genomic research with Indigenous communities. Nat. Commun.9, 2957 (2018). ADSGoogle Scholar
  235. Mills, M. C. & Rahal, C. The GWAS Diversity Monitor tracks diversity by disease in real time. Nat. Genet.52, 242–243 (2020). Google Scholar
  236. Lautenbach, D. M., Christensen, K. D., Sparks, J. A. & Green, R. C. Communicating genetic risk information for common disorders in the era of genomic medicine. Annu. Rev. Genomics Hum. Genet.14, 491–513 (2013). Google Scholar
  237. Palk, A. C., Dalvie, S., de Vries, J., Martin, A. R. & Stein, D. J. Potential use of clinical polygenic risk scores in psychiatry — ethical implications and communicating high polygenic risk. Philos. Ethics Humanit. Med.14, 4 (2019). Google Scholar
  238. Regalado, A. Eugenics 2.0: we’re at the dawn of choosing embryos by health, height, and more. MIT Technology Reviewhttps://www.technologyreview.com/2017/11/01/105176/eugenics-20-were-at-the-dawn-of-choosing-embryos-by-health-height-and-more/ (2017).
  239. Kong, C., Dunn, M. & Parker, M. Psychiatric genomics and mental health treatment: setting the ethical agenda. Am. J. Bioeth.17, 3–12 (2017). Google Scholar
  240. de Vries, J., Landouré, G. & Wonkam, A. Stigma in African genomics research: gendered blame, polygamy, ancestry and disease causal beliefs impact on the risk of harm. Soc. Sci. Med.258, 113091 (2020). Google Scholar
  241. Merriman, T. & Cameron, V. Risk-taking: behind the warrior gene story. N. Z. Med. J.120, U2440 (2007). Google Scholar
  242. Gronowski, A. M. & Budelier, M. M. The ethics of direct-to-consumer testing. Clin. Lab. Med.40, 93–103 (2020). Google Scholar
  243. Blell, M. & Hunter, M. A. Direct-to-consumer genetic testing’s red herring: ‘genetic ancestry’ and personalized medicine. Front. Med.6, 48 (2019). Google Scholar
  244. Rothstein, M. A. et al. Legal and ethical challenges of international direct-to-participant genomic research: conclusions and recommendations. J. Law Med. Ethics.47, 705–731 (2019). Google Scholar
  245. Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature461, 747–753 (2009). This paper describes the concept of ‘missing heritability’, the observation that heritability estimates from GWAS are much lower than those from twin studies. ADSGoogle Scholar
  246. Young, A. I. Solving the missing heritability problem. PLoS Genet.15, e1008222 (2019). Google Scholar
  247. Cai, N. et al. Minimal phenotyping yields genome-wide association signals of low specificity for major depression. Nat. Genet.52, 437–447 (2020). Google Scholar
  248. Nagel, M., Watanabe, K., Stringer, S., Posthuma, D. & van der Sluis, S. Item-level analyses reveal genetic heterogeneity in neuroticism. Nat. Commun.9, 1–10 (2018). Google Scholar
  249. Plenge, R. M., Scolnick, E. M. & Altshuler, D. Validating therapeutic targets through human genetics. Nat. Rev. Drug Discov.12, 581–594 (2013). Google Scholar
  250. Cook, D. et al. Lessons learned from the fate of AstraZeneca’s drug pipeline: a five-dimensional framework. Nat. Rev. Drug Discov.13, 419–431 (2014). Google Scholar
  251. Okada, Y. et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature506, 376–381 (2014). ADSGoogle Scholar
  252. Peat, G. et al. The Open Targets post-GWAS analysis pipeline. Bioinforma. Oxf. Engl.36, 2936–2937 (2020). Google Scholar
  253. Sakaue, S. & Okada, Y. GREP: genome for REPositioning drugs. Bioinforma. Oxf. Engl.35, 3821–3823 (2019). Google Scholar
  254. Schork, N. J. Personalized medicine: time for one-person trials. Nature520, 609–611 (2015). ADSGoogle Scholar
  255. Abraham, G., Qiu, Y. & Inouye, M. FlashPCA2: principal component analysis of Biobank-scale genotype datasets. Bioinformatics33, 2776–2778 (2017). Google Scholar
  256. Howie, B. N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet.5, e1000529 (2009). Google Scholar
  257. Howie, B., Marchini, J. & Stephens, M. Genotype imputation with thousands of genomes. G31, 457–470 (2011). Google Scholar
  258. Browning, B. L., Zhou, Y. & Browning, S. R. A one-penny imputed genome from next-generation reference panels. Am. J. Hum. Genet.103, 338–348 (2018). Google Scholar
  259. Scott, L. J. et al. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science316, 1341–1345 (2007). ADSGoogle Scholar
  260. Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet.39, 906–913 (2007). Google Scholar
  261. Loh, P.-R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet.47, 284–290 (2015). Google Scholar
  262. Mägi, R. & Morris, A. P. GWAMA: software for genome-wide association meta-analysis. BMC Bioinforma.11, 288 (2010). Google Scholar
  263. Delaneau, O. et al. A complete tool set for molecular QTL discovery and analysis. Nat. Commun.8, 15452 (2017). ADSGoogle Scholar
  264. Speed, D. & Balding, D. J. SumHer better estimates the SNP heritability of complex traits from summary statistics. Nat. Genet.51, 277–284 (2019). Google Scholar
  265. Grotzinger, A. D. et al. Genomic structural equation modelling provides insights into the multivariate genetic architecture of complex traits. Nat. Hum. Behav.3, 513–525 (2019). Google Scholar
  266. Burgess, S. et al. Using published data in Mendelian randomization: a blueprint for efficient identification of causal risk factors. Eur. J. Epidemiol.30, 543–552 (2015). Google Scholar
  267. Kanai, M. et al. Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases. Nat. Genet.50, 390–400 (2018). Google Scholar
  268. Chen, Z. et al. China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int. J. Epidemiol.40, 1652–1666 (2011). Google Scholar
  269. Finer, S. et al. Cohort Profile: East London Genes & Health (ELGH), a community-based population genomics and health study in British Bangladeshi and British Pakistani people. Int. J. Epidemiol.49, 20–21i (2020). Google Scholar
  270. The H3Africa Consortium. Enabling the genomic revolution in Africa. Science344, 1346–1348 (2014). Google Scholar
  271. Giri, A. et al. Trans-ethnic association study of blood pressure determinants in over 750,000 individuals. Nat. Genet.51, 51–62 (2019). Google Scholar
  272. All of Us Research Program Investigators. The ‘All of Us’ Research Program. N. Engl. J. Med.381, 668–676 (2019). Google Scholar
  273. Canela-Xandri, O., Rawlik, K. & Tenesa, A. An atlas of genetic associations in UK Biobank. Nat. Genet.50, 1593–1599 (2018). Google Scholar

Acknowledgements

D.P. is supported by Netherlands Organization for Scientific Research (NWO) grant VICI 435-14-005, the NWO Gravitation project BRAINSCAPES: A Roadmap from Neurogenetics to Neurobiology (024.004.012) and European Research Council advanced grant ERC-2018-ADG 834057. N.S.M. is supported by National Institutes of Health (NIH) grant U24HL135600. J.d.V. is supported by NIH grant U54HG009790 and Wellcome Trust grant 219600/Z/19/Z. Y.O. is supported by Japan Society for the Promotion of Science (JSPS) KAKENHI grants 19H01021 and 20K21834 and Japan Agency for Medical Research and Development (AMED) grants JP20km0405211, JP20ek0109413, JP20ek0410075, JP20gm4010006 and JP20km0405217. T.L. is supported by NIH grants R01GM122924, R01HL142028, 1R01AG057422, 1UM1HG008901 and R01MH106842. H.C.M. is supported by a Wellcome Trust core grant to the Sanger Institute (098051).

Author information

Authors and Affiliations

  1. Department of Complex Trait Genetics, Center for Neurogenomics and Cognitive Research, Amsterdam Neuroscience, Vrije Universiteit Amsterdam, Amsterdam, Netherlands Emil Uffelmann & Danielle Posthuma
  2. Human Genetics Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK Qin Qin Huang & Hilary C. Martin
  3. Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa Nchangwi Syntia Munung & Jantina de Vries
  4. Department of Statistical Genetics, Osaka University Graduate School of Medicine, Osaka, Japan Yukinori Okada
  5. Laboratory of Statistical Immunology, Immunology Frontier Research Center (WPI-IFReC), Osaka University, Osaka, Japan Yukinori Okada
  6. Analytic & Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA Alicia R. Martin
  7. Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA Alicia R. Martin
  8. Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA Alicia R. Martin
  9. New York Genome Center, New York, NY, USA Tuuli Lappalainen
  10. Department of Systems Biology, Columbia University, New York, NY, USA Tuuli Lappalainen
  11. Department of Child and Adolescent Psychiatry and Pediatric Psychology, Section Complex Trait Genetics, Amsterdam Neuroscience, Vrije Universiteit Medical Center, Amsterdam, Netherlands Danielle Posthuma
  12. Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, Stockholm, Sweden Tuuli Lappalainen
  1. Emil Uffelmann