Pre-processing of fastq sequences with Qiime2

Resources

https://docs.qiime2.org/2020.11/

Fastq files

We are using Fastq files that were produced from sequencing of 16SrRNA marker genes on an Illumina Miseq instrument.

Briefly on script-files for the HPC

The code shown below is executed on a high performance computer (HPC). All code is compiled into a .txt file which is lined up on to be executed on the HPC with Slurm commands (i.e. with sbatch examplescript.txt). The first lines of the .txt file typically look similar to this:

#!/bin/bash
#SBATCH --cpus-per-task=6
#SBATCH --mem-per-cpu=6000
#SBATCH --time=24:00:00
#SBATCH --partition=compute
#SBATCH --job-name=Q_fun
echo "Starting at: $(date)"

and the last line contains this:

echo "Finished at: $(date)"

However, code can also be executed on any local computer that has Qiime2 installed natively. In that case Slurm and sbatch are not needed. But note that some computations require large amounts of RAM and multiple CPUs.

Paired end

All steps are based on paired end processing (i.e. forward and reverse reads are paired)

Required Files

.fastq.gz - 2 files per sample - forward reads and reverse reads
manifest.tsv - tab separated file with three columns where the first column is the sample ID (as per fastq file), the second and third column are the paths to the forward and reverse fastq files respectively. If you process fastq files on a HPC then the path needs to contain the full path to the HPC folder containing the fastq files. If you process those files on your own hard disc then the paths need to change to your local drive. See example below with path to a HPC server.
metadata.tsv - file containing sample data; example below.

Note: Create the .tsv file as tab separated .csv file first and then change the file extension from .csv to .tsv. Not sure how else to easily build the .tsv file.

Example of a manifest.tsv file

sample-id	forward-absolute-filepath	reverse-absolute-filepath
C-M3-01	/data/group/labname/home/studentID/C-M3-01_S1_L001_R1_001.fastq.gz	/data/group/labname/home/studentID/C-M3-01_S1_L001_R2_001.fastq.gz
C-M3-02	/data/group/labname/home/studentID/C-M3-02_S2_L001_R1_001.fastq.gz	/data/group/labname/home/studentID/C-M3-02_S2_L001_R2_001.fastq.gz
C-M3-03	/data/group/labname/home/studentID/C-M3-03_S3_L001_R1_001.fastq.gz	/data/group/labname/home/studentID/C-M3-03_S3_L001_R2_001.fastq.gz

Example of a metadata.tsv file

#SampleID	Soil	Paddock	P	pH	Clay	Silt	Sand
#q2:types	categorical	categorical	numeric	numeric	numeric	numeric	numeric
C-M3-01	Kurosol	1	38.26	4.52	12.281	37.94449975	48.3582225
C-M3-02	Kurosol	1	50.98	4.44	12.857	36.91042988	49.524291
C-M3-03	Chromosol	2	47.58	4.56	13.433	35.87636	50.6903595

Paired end manifest import (Step 1)

qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' //
  --input-path manifest.tsv //
  --output-path demux-paired-end.qza //
  --input-format PairedEndFastqManifestPhred33V2

Cutadapt (Step 2)

Use this to trim out primer sequences. Change primer sequence as needed.

qiime cutadapt trim-paired //
  --i-demultiplexed-sequences demux-paired-end.qza //
  --p-front-f GTGARTCATCGAATCTTTG //
  --p-front-r TCCTCCGCTTATTGATATGC //
  --o-trimmed-sequences trimmed_demux-paired-end.qza

Summary of output

qiime demux summarize //
  --i-data trimmed_demux-paired-end.qza //
  --o-visualization trimmed_demux-paired-end.qzv

From the output decide where to truncate the forward and reverse reads with p-trunc-len-f and p-trunc-len-r in dada2 below. Note: View any .qzv file on https://view.qiime2.org/

Denoise paired end sequences with dada2 (Step 3)

qiime dada2 denoise-paired //
  --i-demultiplexed-seqs trimmed_demux-paired-end.qza //
  --o-table feature_table.qza //
  --o-representative-sequences sample_rep_seqs.qza //
  --p-trim-left-f 0 --p-trim-left-r 0 //
  --p-trunc-len-f 270 //
  --p-trunc-len-r 235 //
  --output-dir dada2 //
  --verbose

Summary of output

qiime feature-table summarize //
  --i-table feature_table.qza //
  --o-visualization feature_table.qzv //
  --m-sample-metadata-file metadata.tsv

Note: Look at the feature_table.qzv and record median reads per sample

Taxonomic assignment (Step 4)

Use the latest pre-trained silva or greengenes classifier (if you used the same primers) or train your own.

qiime feature-classifier classify-sklearn //
  --i-classifier silva-132-99-515-806-nb-classifier.qza //
  --p-reads-per-batch 10000 //
  --i-reads sample_rep_seqs.qza //
  --o-classification taxonomy_silva.qza //
  --quiet

Summary of output

qiime metadata tabulate //
--m-input-file taxonomy_silva.qza //
--o-visualization taxonomy_silva.qzv

Note: The taxonomy_silva.qzv is loaded into https://view.qiime2.org/ to dowload the .tsv file for later import into R.

Build phylogenetic tree (Step 5)

In this case we are using the insertion tree method. See https://library.qiime2.org/plugins/q2-fragment-insertion/16/ As not all ASVs will be inserted we will filter the feature_table.qza again to keep only those ASVs that are in the tree. You will need the reference file from silva or greengenes. In this case we are using sepp-refs-silva-128.qza.

qiime fragment-insertion sepp //
  --i-representative-sequences sample_rep_seqs.qza //
  --i-reference-database sepp-refs-silva-128.qza //
  --o-tree insertion-tree.qza //
  --o-placements insertion-placements.qza

qiime fragment-insertion filter-features //
  --i-table feature_table.qza //
  --i-tree insertion-tree.qza //
  --o-filtered-table feature_table_insertiontreefiltered.qza //
  --o-removed-table removed_features.qza

Done!

Everything else including further quality filtering happens with phyloseq in R where we will import the following files: feature_table_insertiontreefiltered.qza, taxonomy_silva.qza and insertion-tree.qza.

This will be covered in the next chapter.

All in one

# Manifest Import
qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' //
  --input-path manifest.tsv //
  --output-path demux-paired-end.qza //
  --input-format PairedEndFastqManifestPhred33V2

# Cutadapt
qiime cutadapt trim-paired //
  --i-demultiplexed-sequences demux-paired-end.qza //
  --p-front-f GTGARTCATCGAATCTTTG //
  --p-front-r TCCTCCGCTTATTGATATGC //
  --o-trimmed-sequences trimmed_demux-paired-end.qza

qiime demux summarize //
  --i-data trimmed_demux-paired-end.qza //
  --o-visualization trimmed_demux-paired-end.qzv

# Denoise
qiime dada2 denoise-paired //
  --i-demultiplexed-seqs trimmed_demux-paired-end.qza //
  --o-table feature_table.qza //
  --o-representative-sequences sample_rep_seqs.qza //
  --p-trim-left-f 0 --p-trim-left-r 0 //
  --p-trunc-len-f 270 //
  --p-trunc-len-r 235 //
  --output-dir dada2 //
  --verbose

# Taxonomic assignment
qiime feature-classifier classify-sklearn //
  --i-classifier silva-132-99-515-806-nb-classifier.qza //
  --p-reads-per-batch 10000 //
  --i-reads sample_rep_seqs.qza //
  --o-classification taxonomy_silva.qza //
  --quiet

# Phylogenetic tree
qiime fragment-insertion sepp //
  --i-representative-sequences sample_rep_seqs.qza //
  --i-reference-database sepp-refs-silva-128.qza //
  --o-tree insertion-tree.qza //
  --o-placements insertion-placements.qza

# Final filtering
qiime fragment-insertion filter-features //
  --i-table feature_table.qza //
  --i-tree insertion-tree.qza //
  --o-filtered-table feature_table_insertiontreefiltered.qza //
  --o-removed-table removed_features.qza

Qiime2 reference

Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, Alexander H, Alm EJ, Arumugam M, Asnicar F, Bai Y, Bisanz JE, Bittinger K, Brejnrod A, Brislawn CJ, Brown CT, Callahan BJ, Caraballo-Rodríguez AM, Chase J, Cope EK, Da Silva R, Diener C, Dorrestein PC, Douglas GM, Durall DM, Duvallet C, Edwardson CF, Ernst M, Estaki M, Fouquier J, Gauglitz JM, Gibbons SM, Gibson DL, Gonzalez A, Gorlick K, Guo J, Hillmann B, Holmes S, Holste H, Huttenhower C, Huttley GA, Janssen S, Jarmusch AK, Jiang L, Kaehler BD, Kang KB, Keefe CR, Keim P, Kelley ST, Knights D, Koester I, Kosciolek T, Kreps J, Langille MGI, Lee J, Ley R, Liu YX, Loftfield E, Lozupone C, Maher M, Marotz C, Martin BD, McDonald D, McIver LJ, Melnik AV, Metcalf JL, Morgan SC, Morton JT, Naimey AT, Navas-Molina JA, Nothias LF, Orchanian SB, Pearson T, Peoples SL, Petras D, Preuss ML, Pruesse E, Rasmussen LB, Rivers A, Robeson MS, Rosenthal P, Segata N, Shaffer M, Shiffer A, Sinha R, Song SJ, Spear JR, Swafford AD, Thompson LR, Torres PJ, Trinh P, Tripathi A, Turnbaugh PJ, Ul-Hasan S, van der Hooft JJJ, Vargas F, Vázquez-Baeza Y, Vogtmann E, von Hippel M, Walters W, Wan Y, Wang M, Warren J, Weber KC, Williamson CHD, Willis AD, Xu ZZ, Zaneveld JR, Zhang Y, Zhu Q, Knight R, and Caporaso JG. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37: 852–857. https://doi.org/10.1038/s41587-019-0209-9

Last updated on Jan 29, 2021