Pre-processing of fastq sequences with Qiime2
Resources
https://docs.qiime2.org/2020.11/
Fastq files
We are using Fastq files that were produced from sequencing of 16SrRNA marker genes on an Illumina Miseq instrument.
Briefly on script-files for the HPC
The code shown below is executed on a high performance computer (HPC). All code is compiled into a .txt file which is lined up on to be executed on the HPC with Slurm commands (i.e. with sbatch examplescript.txt
). The first lines of the .txt file typically look similar to this:
#!/bin/bash
#SBATCH --cpus-per-task=6
#SBATCH --mem-per-cpu=6000
#SBATCH --time=24:00:00
#SBATCH --partition=compute
#SBATCH --job-name=Q_fun
echo "Starting at: $(date)"
and the last line contains this:
echo "Finished at: $(date)"
However, code can also be executed on any local computer that has Qiime2 installed natively. In that case Slurm and sbatch are not needed. But note that some computations require large amounts of RAM and multiple CPUs.
Paired end
All steps are based on paired end processing (i.e. forward and reverse reads are paired)
Required Files
.fastq.gz
- 2 files per sample - forward reads and reverse readsmanifest.tsv
- tab separated file with three columns where the first column is the sample ID (as per fastq file), the second and third column are the paths to the forward and reverse fastq files respectively. If you process fastq files on a HPC then the path needs to contain the full path to the HPC folder containing the fastq files. If you process those files on your own hard disc then the paths need to change to your local drive. See example below with path to a HPC server.metadata.tsv
- file containing sample data; example below.
Note: Create the .tsv
file as tab separated .csv
file first and then change the file extension from .csv
to .tsv
. Not sure how else to easily build the .tsv
file.
Example of a manifest.tsv file
sample-id | forward-absolute-filepath | reverse-absolute-filepath |
---|---|---|
C-M3-01 | /data/group/labname/home/studentID/C-M3-01_S1_L001_R1_001.fastq.gz | /data/group/labname/home/studentID/C-M3-01_S1_L001_R2_001.fastq.gz |
C-M3-02 | /data/group/labname/home/studentID/C-M3-02_S2_L001_R1_001.fastq.gz | /data/group/labname/home/studentID/C-M3-02_S2_L001_R2_001.fastq.gz |
C-M3-03 | /data/group/labname/home/studentID/C-M3-03_S3_L001_R1_001.fastq.gz | /data/group/labname/home/studentID/C-M3-03_S3_L001_R2_001.fastq.gz |
Example of a metadata.tsv file
#SampleID | Soil | Paddock | P | pH | Clay | Silt | Sand |
---|---|---|---|---|---|---|---|
#q2:types | categorical | categorical | numeric | numeric | numeric | numeric | numeric |
C-M3-01 | Kurosol | 1 | 38.26 | 4.52 | 12.281 | 37.94449975 | 48.3582225 |
C-M3-02 | Kurosol | 1 | 50.98 | 4.44 | 12.857 | 36.91042988 | 49.524291 |
C-M3-03 | Chromosol | 2 | 47.58 | 4.56 | 13.433 | 35.87636 | 50.6903595 |
Paired end manifest import (Step 1)
qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' //
--input-path manifest.tsv //
--output-path demux-paired-end.qza //
--input-format PairedEndFastqManifestPhred33V2
Cutadapt (Step 2)
Use this to trim out primer sequences. Change primer sequence as needed.
qiime cutadapt trim-paired //
--i-demultiplexed-sequences demux-paired-end.qza //
--p-front-f GTGARTCATCGAATCTTTG //
--p-front-r TCCTCCGCTTATTGATATGC //
--o-trimmed-sequences trimmed_demux-paired-end.qza
Summary of output
qiime demux summarize //
--i-data trimmed_demux-paired-end.qza //
--o-visualization trimmed_demux-paired-end.qzv
From the output decide where to truncate the forward and reverse reads with p-trunc-len-f
and p-trunc-len-r
in dada2 below.
Note: View any .qzv file on https://view.qiime2.org/
Denoise paired end sequences with dada2 (Step 3)
qiime dada2 denoise-paired //
--i-demultiplexed-seqs trimmed_demux-paired-end.qza //
--o-table feature_table.qza //
--o-representative-sequences sample_rep_seqs.qza //
--p-trim-left-f 0 --p-trim-left-r 0 //
--p-trunc-len-f 270 //
--p-trunc-len-r 235 //
--output-dir dada2 //
--verbose
Summary of output
qiime feature-table summarize //
--i-table feature_table.qza //
--o-visualization feature_table.qzv //
--m-sample-metadata-file metadata.tsv
Note: Look at the feature_table.qzv and record median reads per sample
Taxonomic assignment (Step 4)
Use the latest pre-trained silva or greengenes classifier (if you used the same primers) or train your own.
qiime feature-classifier classify-sklearn //
--i-classifier silva-132-99-515-806-nb-classifier.qza //
--p-reads-per-batch 10000 //
--i-reads sample_rep_seqs.qza //
--o-classification taxonomy_silva.qza //
--quiet
Summary of output
qiime metadata tabulate //
--m-input-file taxonomy_silva.qza //
--o-visualization taxonomy_silva.qzv
Note: The taxonomy_silva.qzv
is loaded into https://view.qiime2.org/ to dowload the .tsv
file for later import into R.
Build phylogenetic tree (Step 5)
In this case we are using the insertion tree method. See https://library.qiime2.org/plugins/q2-fragment-insertion/16/
As not all ASVs will be inserted we will filter the feature_table.qza again to keep only those ASVs that are in the tree.
You will need the reference file from silva or greengenes. In this case we are using sepp-refs-silva-128.qza
.
qiime fragment-insertion sepp //
--i-representative-sequences sample_rep_seqs.qza //
--i-reference-database sepp-refs-silva-128.qza //
--o-tree insertion-tree.qza //
--o-placements insertion-placements.qza
qiime fragment-insertion filter-features //
--i-table feature_table.qza //
--i-tree insertion-tree.qza //
--o-filtered-table feature_table_insertiontreefiltered.qza //
--o-removed-table removed_features.qza
Done!
Everything else including further quality filtering happens with phyloseq
in R where we will import the following files:
feature_table_insertiontreefiltered.qza
,
taxonomy_silva.qza
and
insertion-tree.qza
.
This will be covered in the next chapter.
All in one
# Manifest Import
qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' //
--input-path manifest.tsv //
--output-path demux-paired-end.qza //
--input-format PairedEndFastqManifestPhred33V2
# Cutadapt
qiime cutadapt trim-paired //
--i-demultiplexed-sequences demux-paired-end.qza //
--p-front-f GTGARTCATCGAATCTTTG //
--p-front-r TCCTCCGCTTATTGATATGC //
--o-trimmed-sequences trimmed_demux-paired-end.qza
qiime demux summarize //
--i-data trimmed_demux-paired-end.qza //
--o-visualization trimmed_demux-paired-end.qzv
# Denoise
qiime dada2 denoise-paired //
--i-demultiplexed-seqs trimmed_demux-paired-end.qza //
--o-table feature_table.qza //
--o-representative-sequences sample_rep_seqs.qza //
--p-trim-left-f 0 --p-trim-left-r 0 //
--p-trunc-len-f 270 //
--p-trunc-len-r 235 //
--output-dir dada2 //
--verbose
# Taxonomic assignment
qiime feature-classifier classify-sklearn //
--i-classifier silva-132-99-515-806-nb-classifier.qza //
--p-reads-per-batch 10000 //
--i-reads sample_rep_seqs.qza //
--o-classification taxonomy_silva.qza //
--quiet
# Phylogenetic tree
qiime fragment-insertion sepp //
--i-representative-sequences sample_rep_seqs.qza //
--i-reference-database sepp-refs-silva-128.qza //
--o-tree insertion-tree.qza //
--o-placements insertion-placements.qza
# Final filtering
qiime fragment-insertion filter-features //
--i-table feature_table.qza //
--i-tree insertion-tree.qza //
--o-filtered-table feature_table_insertiontreefiltered.qza //
--o-removed-table removed_features.qza
Qiime2 reference
Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, Alexander H, Alm EJ, Arumugam M, Asnicar F, Bai Y, Bisanz JE, Bittinger K, Brejnrod A, Brislawn CJ, Brown CT, Callahan BJ, Caraballo-Rodríguez AM, Chase J, Cope EK, Da Silva R, Diener C, Dorrestein PC, Douglas GM, Durall DM, Duvallet C, Edwardson CF, Ernst M, Estaki M, Fouquier J, Gauglitz JM, Gibbons SM, Gibson DL, Gonzalez A, Gorlick K, Guo J, Hillmann B, Holmes S, Holste H, Huttenhower C, Huttley GA, Janssen S, Jarmusch AK, Jiang L, Kaehler BD, Kang KB, Keefe CR, Keim P, Kelley ST, Knights D, Koester I, Kosciolek T, Kreps J, Langille MGI, Lee J, Ley R, Liu YX, Loftfield E, Lozupone C, Maher M, Marotz C, Martin BD, McDonald D, McIver LJ, Melnik AV, Metcalf JL, Morgan SC, Morton JT, Naimey AT, Navas-Molina JA, Nothias LF, Orchanian SB, Pearson T, Peoples SL, Petras D, Preuss ML, Pruesse E, Rasmussen LB, Rivers A, Robeson MS, Rosenthal P, Segata N, Shaffer M, Shiffer A, Sinha R, Song SJ, Spear JR, Swafford AD, Thompson LR, Torres PJ, Trinh P, Tripathi A, Turnbaugh PJ, Ul-Hasan S, van der Hooft JJJ, Vargas F, Vázquez-Baeza Y, Vogtmann E, von Hippel M, Walters W, Wan Y, Wang M, Warren J, Weber KC, Williamson CHD, Willis AD, Xu ZZ, Zaneveld JR, Zhang Y, Zhu Q, Knight R, and Caporaso JG. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37: 852–857. https://doi.org/10.1038/s41587-019-0209-9