02: Building a data analysis pipeline for long-read, amplicon-based, raw DNA-sequence data generated by Oxford Nanopore Technology

Mink, Sylvia1,2; Busch, Yannik3; Kiefer, Johanna3; Attenberger, Christian4; Peter, Wolfgang3; Gassner, Christoph2

  1. Medizinisches Zentrallaboratorium GmbH, Feldkirch, Austria
  2. Private Universität im Fürstentum Liechtenstein, Institut für Translationale Medizin, Triesen, Fürstentum Liechtenstein
  3. Stefan Morsch Stiftung, Birkenfeld, Deutschland
  4. Private Universität im Fürstentum Liechtenstein, Triesen, Fürstentum Liechtenstein

Introduction:

Modern DNA sequencing techniques are capable of generating so-called "long reads" of single-stranded DNA. Such "long reads" enable a haplotype-specific analysis of the two parental allele variants. This is particularly important when (at least) two single nucleotide variants (SNVs) are several hundred base pairs apart, making it difficult to determine their relationship to each other using conventional methods. Oxford Nanopore Technology (ONT) currently allows the longest read lengths of up to four mega base pairs.

However, using this promising technology poses a major challenge due to downstream data analysis. Currently, there is no clearly established software pipeline for processing and analyzing the raw sequencing data to obtain haplotype-specific results. Rather, several publicly available software tools must be combined. These individual elements are often poorly documented and require a high level of bioinformatics knowledge to use efficiently.  

We have therefore developed a data analysis pipeline that allows the automated processing of amplicon-specific long-read ONT sequencing data.

Materials and Methods:

We designed long range PCR reactions for the partly homologous FUT1, FUT2 and FUT3 genes. These genes encode the human blood group systems "Lewis" and "H". 32 DNA samples were amplified and processed according to the ONT protocol for amplicon barcoding with native barcoding expansion 96 (EXP-NBD196, and SQK-LSK109) then analyzed on MinION using an R10.3 flow cell.

Basecalling was performed in high accuracy mode using Guppy Basecalling Software ((C) Oxford Nanopore Technologies plc. Version 6.1.3+cc1d765d3) and separated by barcode ID. An initial quality check was performed with NanoPlot (version 1.40.0).

Reads were subsequently filtered for quality score and expected amplicon length using NanoFilt (version 2.8.0).

Alignment to the reference sequence was done with minimap2 (version 2.24-r1122). The resulting SAM (sequence alignment map) file was then converted to its binary counterpart using samtools (version 1.15, htslib 1.15). Bcftools (version 1.15) was used for calling variants and filtering according to quality score.

Phasing of haplotypes was performed by whatshap (version 1.4) and ONToHap (ONToHap-v1.0.0) respectively. Whatshap statistics yielded an overview of phasing data including variant characteristics for each gene. Haplotype specific sequences in FASTA format were generated by bcftools (version 1.15).

Finally, this analysis sequence was automated using Linux bash script.

Results:

For each gene, this data analysis pipeline generated phased haplotypes of approximately 11 kilo base pairs. Processing the raw data was performed simultaneously for all 32 DNA samples.

Conclusion:

The described data analysis pipeline allows for simple and time efficient processing of long read, amplicon-based sequencing data generated by Oxford Nanopore Technology.