Acrobeles complexus

Mark Blaxter's teaching pages

@ The Blaxter Lab, Institute of Evolutionary Biology, School of Biological Sciences, The University of Edinburgh

University of Edinburgh crest   

Bioinformatics MSc


Next Generation Genomics

 

Workshop 2 Assembly using Velvet

In this workshop you will explore the use of Velvet (Zerbino and Birney) (see http://www.ebi.ac.uk/~zerbino/velvet/) in assembly of next generation genomics data.

The raw data are from a community isolate of Escherichia coli that is associated with disease - the strains are called O157 and can cause fatal infections in humans. We (the GenePool) generated paired-end data for the strain. You will analyse these data to explore assembly parameters.

Velvet is a de Bruijn assembler. There is a manual available here. Velvet was published in 2008.


Part 1: Setting up

Connect to the Bioinformatics MSc server using the same username and password as you used for PBL. If you were not in the PBL class, we have a password for you.

Working on the Gifford Laboratory Computers
1. Log in to windows as normal using your UUN (sXXXXXXX)
2. Start Exceed (Start menu -> All Programmes -> Communications -> Exceed -> Exceed)
3. Start the Exceed SSH client (Start menu -> All Programmes -> Communications -> Secure Shell Client)
4. Turn on X windows tunnelling (Edit -> Settings -> Profile -> Connection -> Tunnelling -> OK)
5. Click Quick Connect
6. Click Yes when asked if you want to save
7. Enter the server name: bioinfmsc1.bio.ed.ac.uk
8 .Enter your UUN (sXXXXXXX)
9. Click Connect
10. Click Yes when asked if you want to save
11. Enter your password when prompted

From UNIX (and Mac OSX) computers, you can probably just open a terminal and type

++++++
ssh -X <username>@bioinfmsc1.bio.ed.ac.uk
++++++

Set up some essential links to the workshop data and executables.

[you should have done this for workshop 1]

Use nano (or your favourite editor) to add the line

export PATH=/home/ngg_shared/assembly_workshop/bin/:$PATH

to the file ".bashrc"
Save .bashrc and read it using source

++++++
nano .bashrc
<<do the editing>>
source .bashrc

++++++

Copy the workshop data files to your home directory and make a project directory

The raw data are in a shared directory on the bioinformatics server

++++++
cp -r /home/ngg_shared/assembly_workshop/ecoli_data/ .
++++++


Part 2: The raw data

Check the content and quality of the data in these files using the utility script fastqc

++++++
cd ecoli_data/
fastqc 6991_1.fastq
fastqc 6991_2.fastq

++++++

You can review the content and quality of the sequence data via the html output of fastqc in a web browser as follows:
(The command below should be entered on one line)

++++++
firefox 6991_1_fastqc/fastqc_report.html 6991_2_fastqc/fastqc_report.html &
++++++

  • What sort of Illumina data do we have? (single,paired,fasta/fastq)
  • How many entries are in the files?
  • What is the read length?

Part 3: Running velvet in single-end mode

When you run Velvet, you do so in two stages. First you ask velveth to make the hash table of all the k-mers at your chosen value of k. Then you ask velvetg to build the de Bruijn graph, and to output the sequences of the assembled contigs.

Type

++++++
velveth
++++++

to see the expected structure of a velveth command and to identify the flags you should use. There are LOTS of options for velveth (and velvetg).

We will start off with a simple velvet run (velveth+velvetg) with kmer 21 using 'full_21' as the directory parameter.

++++++
velveth full_21 21 -fastq -short 6991_1.fastq 6991_2.fastq
velvetg full_21

++++++


Part 4: Velvet results

What is in the output directory?

++++++
cd full_21
ls -l

++++++

How good is the assembly?

We use a script, contig_stats.pl to check the number of contigs, their span, and additional information such as N50, etc:

++++++
contig_stats.pl -f contigs.fa -t 100 -h
++++++

What is the N50?
How many bases are in the assembly?
How many contigs have been generated?


Part 5: Tuning coverage expectations for Velvet - finding the most likely coverage

Velvet works best if it is told what to expect in terms of coverage, and when to ignore low and high coverage contigs (errors and repeats respectively).

The contig_stats.pl script outputs a file "stats.txt". We can use the statistical programming language R to summarise these data.

++++++
R
x11()
data = read.table("stats.txt", header=TRUE)
install.packages("plotrix")
library(plotrix)
weighted.hist(data$short1_cov, data$lgth, breaks=0:50, xlab="coverage of node",ylab="frequency, number of nodes with given coverage")
++++++

What is the most likely value for the expected coverage?
Where would you set the coverage cut-off?
Which other options might be useful? (to remove high coverage nodes)?


Part 6: Tuning coverage expectations for Velvet - rerunning Velvet with chosen coverage values

6.1 coverage cutoff

After looking at the stats.txt file (with R), run velvetg again with different -cov_cutoff values.

How does the output change? (run process_contigs.pl)
What is the optimal parameter?
How long does it take to run? (time)

6.2 Expected coverage

Run velvet again adding the parameter -exp_cov.

How does the output change? (run process_contigs.pl)
What is the optimal parameter?
How long does it take to run? (time)

6.3 Additional parameter exploration

Have you had a look at the 'full_21/Log' file?
Which files does Velvet produce / change after each run?

6.4 You could try using additional advanced options... (-max_coverage)


Part 7: Paired end data in Velvet - setting up

Create a new directory 'small_paired', and copy the data files there.

++++++
mkdir small_paired
cd small_paired
cp -r /home/ngg_shared/assembly_workshop/ecoli_data/ .

++++++

In order to run velvet with these data in paired mode, we need to merge the two files into one.

You can use shuffleSequences_fastq.pl to merge them together.

++++++
shuffleSequences_fastq.pl 6991_1.fastq 6991_2.fastq output_shuffled_paired.txt
++++++

Look at the file (use the command "more").

Are the entries in the correct order and format?


Part 8: Paired end data in Velvet - running velvet in paired mode

8.1 Checking the data

We will start off with a simple velvet run (velveth+velvetg) with kmer 21 using 'full_21' as the directory parameter.

++++++
velveth pairs_21 21 -fastq -shortPaired output_shuffled_paired.txt
velvetg pairs_21

++++++

Are there differences in the coverage distribution?
Which cut-off do you propose to use this time?

Compare the stats.txt files (using R) of the single end and small paired set

Is there a difference between the low (0-5) / high (double the expected) coverage reads?

8.2 paired end with ins_length set k=21

Run the small paired set again with velvetg adding the parameter -ins_length.
You should specify the following parameters: -cov_cutoff, -exp_cov and -ins_length)

How does the output change? (run process_contigs.pl)

8.3 paired end with ins_length set k=31

Rerun the analyses using k=31, and setting 'pairs_31' as the directory parameter.

Why do you have to run velveth again?
Have you noticed the change of the expected coverage?
How does the runtime change?
How does N50 change?
Do you think a bigger Kmer is better?


Part 9: The effects of data quality trimming - if you have time

Removing low quality data from the read sets should reduce the number of 'bubbles' in the graph, and thus improve the overall assembly.... but does it?

9.1 Trim the data set to remove poor quality bases

++++++
cd ../
mkdir trimmed_data
cd trimmed_data
cp -r /home/ngg_shared/assembly_workshop/ecoli_data/ .
fastq_quality_trimmer -t 20 -l 35 -i 6991_1.fastq -o 6991_1.fastq.trim
fastq_quality_trimmer -t 20 -l 35 -i 6991_2.fastq -o 6991_2.fastq.trim
++++++

9.2 Check the data quality using fastqc as above (section 2)

9.3 Rerun the velvet assembly as per sections 3, 4, 5,6 ,7 and 8 above

Rerun the analyses in Velvet again (both single-end and paired-end) for these trimmed data and compare the results


10: Velvet advanced options - if you have time

This section will go through some advanced options and discuss the effects they have. There are LOTS of options for velveth and velvetg: try typing the commands with no other arguments...

'-min_pair_count'

Run velvet (give output directory 'full_21-pairCnt') adding the -min_pair-count parameter to the paired end run command. Try different values e.g. 2 and 20
e.g. -cov_cutoff 14 -exp_cov 28 -ins_length 215 -ins_length_sd 35 -min_pair_count …

How does the result change?
Is a bigger n50 a better result?

'-max_gap_count'

Repeat the previous exercise with '-max_gap_count'
How does the result change?
What problems could you get by increasing this option?

 

 

 

 

 


Website Highlight
Pratylenchus penetrans

The plant-parasitic nematode Pratylenchus penetrans .
Plant parasitic nematodes cause major economic losses worldwide. Pratylenchus penetrans is a migratory endoparasite of the roots of many crop species . See NEMBASE4 for analyses of ESTs from this parasite and many other nematodes.

[nematodes.org v4.0] the content of these pages is copyright Mark Blaxter and colleagues. Contact the webmaster if there are problems.