Single nucleotide polymorphisms (SNPs)
Introduction
The DNA of the human genome contains three billion base pairs made up of the four DNA bases: adenine, thymine, guanine and cytosine. While our genetic make up is 99,9 % the same, small differences in our DNA can predispose us to different diseases and make us respond to medicines differently. This differences in our DNA we call Single Nucleotid Polymorphisms (SNPs, pronounced „snips“). A SNP is a specific location in our DNA where different people have different DNA bases. For example, at a specific point in your DNA you may have the DNA base cytosine (C) and another person may have the DNA base thymine (T). If you possess two copies of C or two copies of T at this location, one on each of your pair of chromosomes, you are homozygous. If you possess a C and T at this location you are heterozygous.
SNPs are the most common type of difference in our DNA: there are about 9 million SNPs. The majority of SNPs are thought to be biologically “silent” – they do not effect gene function or inherited traits. Some SNPs may affect gene expression in disease situations or be present in the gene itself and affect protein function. If we find the key to reveal „SNP secret“ we can, predict and diagnose diseases more accurately and to discover new medicines and identify the patients likely to benefit from particular medicines.
Polymorphisms and diseases
The human population has relatively limited genetic diversity, reflecting its young age and historically small size. Many rare genetic variants exist in the human population, but most of the heterozygosity in the population is attributable to common alleles (that is, those that are present at a frequency of >1% in the general population). The infrequent variants include the primary causes of rare, mendelian genetic diseases, with these alleles typically being recent in origin and highly penetrant. By contrast, some authors have recently hypothesized that the common variants may contribute significantly to genetic risk for common disease. If this common disease-common variant (CD-CV) hypothesis is true, it permits a conceptually straightforward approach to identifying disease-causing mutations: build a comprehensive catalogue of the limited number of common gene mutations in the human population and test them directly for association to clinical phenotypes. Such an approach is possible due to the human genome project in identifying genes and in the technology for discovering and typing DNA sequence variants. It has been difficult to compare variation among classes of sites within genes, among genes and between populations, owing to the small sample sizes and to differences in the populations studied. To define the nature of variationin human genes, as well as provide a catalogue of gene polymorphisms for association studies, we performed an extensive survey of coding sequence diversity of many genes in many individuals.
Using SNP maps to find disease genes
The frequency, stability, and relatively even distribution of SNPs in the genome make them particularly valuable as genetic markers. A high-density SNP map - where the position of SNPs on the genome is identified - can be used in the search for disease susceptibility genes. Where particular SNP variants are close to a susceptibility gene allele, they will tend to be inherited together over many generations. 1 SNPs that frequently differ in individuals with a disease compared with individuals without the disease act as beacons to tell us that a disease susceptibility gene may be nearby. An increased knowledge of the genetic basis of common diseases will lead to a better understanding of how these diseases
develop and progress. This understanding will help us to identify new ways for medicines to affect disease progression,
which will lead to the development of new medicines that act on the cause of disease to cure, treat or prevent it.
A high-density SNP map of the human genome will enable us to identify SNP variations that differ in patients who have
a certain response when given a medicine. These SNPs could be used as part of a medicine response test to identify patients likely to benefit or experience a specific side effect. In this way, healthcare providers will be able to provide a greater degree of personalized medicine by prescribing medicines based on a patient’s predicted response.
Association studies
Human Genome Project is a project to define whole structure of DNA. One of the fruits of this project is the discovery of millions DNA sequence variants in the human genome. The majority of these variants are SNPs. There is SNP Consortium in a world (a consortium of pharmaceutical companies, technology companies, academic centers and the Wellcome Trust) which producing an ordered high-density SNP map of the human genome that is being placed in the public domain. There are over 9 million SNPs in this map (2005). The use of SNPs in small areas of the genome has been shown to rapidly narrow the search for disease susceptibility genes. With a thousand DNA samples in a typical study, each with around 100,000 - 300,000 SNPs, rapid read-out technology must be available to genotype millions of SNPs cost-effectively and reproducibly. Then we will need to correlate patients genotypes (SNPs) with their phenotypes (clinical measurements), which requires complex statistical analysis software.
SNP-based association studies can be performed in two ways: direct testing of an SNP with functional consequence for association with a disease trait, or using an SNP as a marker for linkage disequlibrium (LD). LD is generally defined as a measure of the degree of association (co-segregation) of two genetic markers and can thus be used to identify those regions of the genome associated with the disease, allowing the subsequent identification of the adjacent causative gene. This is analogous to the use of linkage analysis to identify disease-related genes in families. However, due to the limited number of generations in a family study and consequently a limited number of recombination events, linkage can be detected over large genetic distances in pedigrees. Approximately 300 highly informative simple tandem repeat markers evenly spaced across the human genome (cca 1 every 10 cM) typically are required to localize the gene responsible for a monogenic disorder. Conversely, LD in populations extends over far shorter distances due to erosion of inter-marker association as a result of recombination over successive generations. Furthermore, due to the extensive time periods relative to the three or four generation pedigrees used in a linkage study, LD in populations can reflect not only recombination, but also new mutation events and genetic drift.
The question of the number of markers required for an LD genome scan to identify genes associated with complex disease has been hotly debated. With a detailed knowledge of the structure of the human genome likely to result from the current sequencing efforts, it should be possible to reduce the number of markers required for an LD genome scan by concentrating initially on those areas that are rich in genes. A precise knowledge of the degree and pattern of fluctuation of recombination frequency across the genome would also allow us to distribute markers in an intelligent fashion and possibly reduce further the number of markers required for a genome scan. Simple comparison of the genetic and physical maps for different genomic regions reveals large variations in the ratio of physical to genetic map distance, indicating wide differences in the levels of recombination in different parts of the genome.
Approach to study
Candidate gene analysis, where a gene is selected on the basis of biological function and tested for association with the disease phenotype is a practical method that can be used to identify genes with a role in complex disease. A more pragmatic approach is the use of linkage analysis to identify tentatively linked regions in the families and then extend the analysis to the relevant population, simultaneously increasing confidence in the linkage and narrowing the critical interval harbouring the culprit gene. However, this approach also has limitations as the initial linkage analysis is unlikely to detect small effects, even at low confidence levels.