Practical 3: Genetics Practical (GD5302)

Due date: Tuesday 16th April 2024 (Week 13) at 12:00 (midday)

This accounts to 25% of the coursework grade. Please note that the Module Management System (MMS) is definitive for weighting and deadlines, which, occasionally have to be changed.

Genome-Wide Association Studies in Type 2 Diabetes Mellitus

Objective

The objective of this task is to familiarize students with Genome-Wide Association Studies (GWAS) genotype data. Students will perform Quality Control (QC), conduct Principal Component Analysis (PCA) to account for population structure, association tests for a simulated binary phenotype and build genotypes for a set of single nucleotide polymorphisms (SNPs).

Background

A research team investigating genetic data related to Type 2 Diabetes Mellitus (T2DM) experienced a hard disk drive failure on their High Performance Computing (HPC) system. Fortunately, they had a backup, which allowed them to retrieve some of the lost data. However, instead of obtaining a single file, they found several versions of the same dataset with similar timestamps. The research team had a previous set of plots for population structure and association analysis from previous analysis than they could use as ground truth. Uncertain about the integrity of each version, they tasked their data scientists with determining whether running a standard Genome-Wide Association Study (GWAS) pipeline would yield consistent results across the recovered datasets.

Dataset

  1. Genotype data (Plink format with three files: .bed, .bim, .fam) from the 1000 genomes phase 3 project. Available at the path /scratch/bioinf/gd5302/coursework/student_datasets/USERNAME.
  2. Simulated binary phenotype for T2DM: *_phenotype.txt.
  3. Sample annotation file: integrated_call_samples_v3.20130502.ALL.panel.
  4. File with the regions of high levels of pairwise linkage disequilibrium (LD): high-ld-hg19.txt.

Tasks

  1. Quality Control (QC):
    • Use PLINK software to perform QC on the genotype data.
    • Include steps such as removing low-quality variants, filtering based on Minor Allele Frequency (MAF), and excluding individuals with high missingness or relatedness.
    • Reuse the scripts (02_data_qc) from the teaching practicals.
    • How many samples (people) and variants are available at the first QC-step? tip: use the output .log file
    • How many samples (people) and variants are available at the last QC-step? tip: use the output .log file
  2. Population Structure Analysis:
    • Conduct Principal Component Analysis (PCA) using PLINK software and display the PCA plot. Add a caption to the plot describing X-axis and Y-axis.
    • Reuse the scripts (03_pca) from the teaching practicals to visualize population structure by plotting the first two principal components (PC's) and assessing clustering patterns.
    • Which populations (pop) are represented in the study?
    • How does the plot compares with the one from the practicals? Speculate what could have gone wrong.
Population Code Description Super Population Code
CHB Han Chinese in Beijing, China EAS
JPT Japanese in Tokyo, Japan EAS
CHS Southern Han Chinese EAS
CDX Chinese Dai in Xishuangbanna, China EAS
KHV Kinh in Ho Chi Minh City, Vietnam EAS
  1. Association testing:
    • Perform an association test using the T2DM binary phenotype and correcting for population structure (covariate).
    • Reuse the scripts 04_association_test from the teaching practicals.
    • How many SNP (ID) are lower than the pvalue (P) threshold: \(P< 5×10^{-8}\). tip: use the slurm-out-manhattan-plot.txt
    • Which SNP (ID) has the lowest pvalue (P)? tip: use the slurm-out-manhattan-plot.txt
    • Display the Manhattan plot and add a caption describing X-axis and Y-axis.
    • How does the plot compares with the one from the practicals? Speculate what could have gone wrong.
  2. Extract SNP genotypes:
    • Build the genotypes for the following SNP: 19:47137162:C:A.
    • Reuse the scripts 05_genotype from the teaching practicals.
    • How many alleles are in each SNP genotype? tip: use the slurm-out-plot_genotypes.txt
    • Display the distribution plots for each SNP and add a caption describing X-axis and Y-axis.
  3. Describe the working environment of the analysis:
    • Using a maximum of 500 words describe the set of tools and resources that allowed you to perform this analysis. This should include the HPC, SSH connection from your local machine, job scheduler, and coding languages used in the scripts.

Report

Hand in via MMS a report in PDF format (max. 10 pages) which contains details of each step of the process (tasks 1-5), describing the methodology applied and decisions you take.

Marking

This practical will be marked according to the graduate school mark descriptors. All documents relating to the mark descriptors (and their conversion to the 20 point scale) can be found on the graduate school webpages.

Details for good academic practice are outlined on the University webpages.
Details on Graduate School penalties, extension request etc, please refer to the student rules section.

Any use of AI tools, including large language models such as ChatGPT, needs to be acknowledged, referenced, and logged. If used, text generated by AI should be in quotation marks and referenced as private communication. Code and its comments need to be clearly highlighted and referenced. All AI interactions used for coding, or for report writing should be annexed to the submission as a searchable text file.