Advancing Knowledge for a Better Tomorrow

August 2023

What is a GWAS?

A real break-though came in 2006 with the advent of Genome-wide Association Studies (GWAS). This is based on an insight that dates right back to the inventor of the heritability statistic, R.A. Fisher. To resolve the impasse created by the failure of candidate gene studies, researcher had to dig back to theoretical fundamentals. The model that Fisher originally developed for genes treated them as independent additive effects, each of tiny size. He thought that most traits are polygenic, with many genes contributing to the outcome, as opposed to the more modest number of single-gene traits like Mendelian disorders. It was a deliberate simplification but it turned out to be extremely useful in practice as well as in theory.

The GWAS approach fits separate statistical models to a very large number of genes, to predict a trait—there are far too many genes to model them jointly. Unlike candidate gene studies, the method is hypothesis-free and does not pre-judge which gene will prove effective. This allows estimation of the (usually tiny) effects that each gene has independently on the outcome, together with the statistical significance of the association and other facts. By trawling through large gene sets and samples of enormous numbers of individuals in this way, many associations are found.

First, some loose usage of ‘gene’ here must be clarified. Strictly speaking genes are regions of the genome which code for proteins. Most of the genome does not directly code for proteins, but is still important in other ways, for instance by regulating actions of protein-coding genes, with consequences downstream which are still poorly understood. Below we will often refer to Single Nucleotide Polymorphisms (SNPs) instead. These are single positions within the genome. Again, all genes are sequences of these SNPs, but not all SNPs are directly involved in genes. These single positions can take alternative values, which are also referred to as &lquot;alleles&rquot; (as for gene variants). When we associate a SNP variant with an outcome, we are glossing over the fact that we don’t necessarily know how that happens physiologically (but that is also true for many genes).

GWAS permits at least two kinds of inference about SNPs and traits: causal and predictive. Causal inference is important when searching for interventions, such as drugs to target specific conditions. If it is highly likely that a specific SNP causes, say, a disorder, then the physiological effects of that SNP, once those are determined, can provide targets for treating the condition through drugs. In some cases, direct genome editing is even possible in living fully developed subjects, as we will see later. However it is easy to find associations involving SNPs that are not causal because positions that are physically close together in the genome are linked together: if the value at one position changes, the other tends to change too: they travel together. The closer they are, the more likely they are to be linked (cumbersomely known as ‘linkage disequilibrium’ because they do not assort independently after sexual mating). Causal inference also presupposes a high level of confidence that an association found is not a spurious by-product of the very large number of SNPs tested. By controlling this false discovery rate using statistical means—say by requiring stringent levels of ‘statistical significance’—this possibility can be reduced and confidence raised.

Predictive inference doesn’t care about causality per se, and only attempts to explain the outcome. If SNPs are found that do this, and are merely associated—along for the ride, so to speak—with the hidden SNPs actually doing the work, which may not have been specifically identified, that is fine. The SNPs that have been found are acting as proxies, and still enable explanation of the outcome.

For predictive purposes, the numerous SNPs involved in a GWAS can be combined into a polygenic score (PGS). In the simplest case, such scores can be created, for any particular person (or animal), by adding up the SNP effect sizes found by the models, given the variants actually carried by that individual. This composite score can then be used as a predictor for the outcome, possibly with other predictors in a joint model.

Go Back