Next generation genomics 2010-2011 Mapping Reads Workshop devised by Timothée Cézard, GenePool (Jan 2011) In this workshop you will take a set of Illumina data, map it to a reference genome, and examine the mapped data for single nucleotide polymorphisms, insertions and deletions. The raw data are from an Escherichia coli of type O157. O157 E. coli are associated with severe clinical outcomes in humans, but appear to be essentially nonpathogenic commensals of cows. Thus it is thought that most human E. coli O157 disease derives from cow faecal contamination of the human food chain. The reference genome is a fully (Sanger dideoxy) sequenced human isolate from a spinach-associated outbreak in the USA. The experimental data are paired-end illumina reads from two isolates from a surveillance epidemiology project in Scotland. Text in lines spaced by ++++++ below is what you should type in the terminal window. For UNIX command help, see http://unixhelp.ed.ac.uk/. 1Connect to the Bioinformatics MSc server using the same username and password as you used for PBL. If you were not in the PBL class, we have a password for you. Working on the Gifford Laboratory Computers 1.Log in to windows as normal using your UUN (sXXXXXXX) 2.Start Exceed (Start menu -> All Programmes -> Communications -> Exceed -> Exceed) 3.Start the Exceed SSH client (Start menu -> All Programmes -> Communications -> Secure Shell Client) 4.Turn on X windows tunnelling (Edit -> Settings -> Profile -> Connection -> Tunnelling -> OK) 5.Click Quick Connect 6.Click Yes when asked if you want to save 7.Enter the server name: bioinfmsc1.bio.ed.ac.uk 8.Enter your UUN (sXXXXXXX) 9.Click Connect 10.Click Yes when asked if you want to save 11.Enter your password when prompted From UNIX (and Mac OSX) computers, you can probably just open a terminal and type ++++++ ssh -X @bioinfmsc1.bio.ed.ac.uk ++++++ 2 Set up some essential links to the workshop data and executables. Use nano (or your favourite editor) to add the line export PATH=/home/ngg_shared/mapping_workshop/bin/:$PATH to the file “.bashrc” Save .bashrc and read it using source ++++++ nano .bashrc source .bashrc ++++++ 3. Copy the workshop data files to your home directory and make a project directory ++++++ cp -r /home/ngg_shared/mapping_workshop/ecoli_data/ . cp -r /home/ngg_shared/mapping_workshop/ecoli_genome/ . mkdir aligned_reads ++++++ 4.Check the content and quality of the data in these files. This uses the utility script fastqc ++++++ cd ecoli_data/ fastqc 6991_1.fastq fastqc 6991_2.fastq ++++++ You can review the content and quality of the sequence data via the html output of fastqc in a web browser as follows: (The command below should be entered on one line) ++++++ firefox 6991_1_fastqc/fastqc_report.html 6991_2_fastqc/fastqc_report.html & ++++++ 5. Use BWA to align the reads. BWA stands for Brurrows-Wheeler Aligner (which we will discuss in a later lecture). The BWA software was written by Heng Li, and is open source. You can visit the BWA project page at [http://bio-bwa.sourceforge.net/bwa.shtml] and read the original paper describing BWA at Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60. [http://www.ncbi.nlm.nih.gov/pubmed/19451168] 5A: Use BWA to generate an index file for the reference genome ++++++ cd ~/ecoli_genome/ bwa index -a is E_Coli_O157H7_TW14359.fasta cd ~ ++++++ 5B: Perform the alignment and save the output in aligned_reads. (The two commands below should be entered on one line) ++++++ bwa aln ecoli_genome/E_Coli_O157H7_TW14359.fasta ecoli_data/6991_1.fastq > aligned_reads/6991_1.sai bwa aln ecoli_genome/E_Coli_O157H7_TW14359.fasta ecoli_data/6991_2.fastq > aligned_reads/6991_2.sai ++++++ 6: Prepare the aligned reads for viewing in a genome browser. BWA includes a number of tools to help convert the aligned read data into useful formats. One of the most useful is SAM format - see the SAM specifications at [http://samtools.sourceforge.net/SAM-1.3.pdf]. [http://samtools.sourceforge.net/samtools.shtml] Li H., Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [http://www.ncbi.nlm.nih.gov/pubmed/19505943] Once in SAM format, we use samtools to reconvert the sam file into a binary sam (or bam) file, and to perform some sorting of this file so that the genome viewer can access it rapidly and efficiently. (The commands below should be entered on one line each) ++++++ bwa sampe ecoli_genome/E_Coli_O157H7_TW14359.fasta aligned_reads/6991_1.sai aligned_reads/6991_2.sai ecoli_data/6991_1.fastq ecoli_data/6991_2.fastq > aligned_reads/6991.sam samtools view -bt ecoli_genome/E_Coli_O157H7_TW14359.fasta aligned_reads/6991.sam > aligned_reads/6991.bam samtools sort aligned_reads/6991.bam aligned_reads/6991_sorted ++++++ 7. Collect some statistics about the alignment using samtools. ++++++ samtools flagstat aligned_reads/6991_sorted.bam > aligned_reads/6991_sorted.bam.stat ++++++ You can look at these statistics using more, less or cat: ++++++ cat aligned_reads/6991_sorted.bam.stat ++++++ 8. Look at the inferred insert size of the paired-end read library using samtools to produce a file that can be processed by R. (The commands below should be entered on one line each) ++++++ samtools index aligned_reads/6991_sorted.bam samtools view -F 4 -F 8 -f 64 aligned_reads/6991_sorted.bam | perl -lane 'if (abs($F[8])<1000){print abs($F[8])}' > aligned_reads/insert_size_value.txt less aligned_reads/insert_size_value.txt | wc -l ++++++ There should be 991530 insert sizes. ++++++ samtools view -F 4 -F 8 -f 64 aligned_reads/6991_sorted.bam | perl -lane 'if (abs($F[8])>=1000){print $F[8]}' | wc -l ++++++ There should be 14356 entries in the file. Now call the R environment and read the “insert-size_value.txt” output file, and make a pretty histogram: ++++++ R data = read.table("aligned_reads/insert_size_value.txt") hist(data$V1, breaks=100) q() ++++++ 9. View the aligned reads in a genome viewer. We have installed IGV, the Integrative Genome Viewer from the Boad institute. [http://www.broadinstitute.org/software/igv/] IGV is fast, customisable and free. The developers say: "The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated datasets. It supports a wide variety of data types including sequence alignments, microarrays, and genomic annotations." James T. Robinson, Helga Thorvaldsdóttir, Wendy Winckler, Mitchell Guttman, Eric S. Lander, Gad Getz, Jill P. Mesirov. Integrative Genomics Viewer. Nature Biotechnology 29, 24–26 (2011) Launch igv with the command ++++++ igv_linux-64.sh & ++++++ Once in IGV, use the GUI menus to perform the following tasks. #File --> Import Genome #Give a name to the genome #Select the Genome file as a sequence file #ignore the Cytoband File and Gene File #save the newly created igv-genome #File --> Load from file # Select the sorted bam file containing the reads Tim Czard and Mark Blaxter, UoE, 12/01/2011