ncov-recombinant v0.6.1 - v0.7.0

Test Summary Package

This report was automatically generated on February 28, 2023.

Authors

Katherine Eaton | National Microbiology Laboratory, PHAC |

1. Summary

The ncov-recombinant update from v0.6.1 to v0.7.0 has 3 major changes.

The first change is a nextclade dataset upgrade from 2022-10-27 to 2023-02-01 which adds nomenclature for newly designated recombinants XBH to XBP.

The second change is detection of recursive recombinants, XBL and XBN which arose from two separate recombination events between BA.2.75* and XBB*. Currently, recursive recombination is only set to be detected between XBB and VOC circulating in late 2022 and early 2023.

The third major change is that all documentation has been migrated to Read The Docs. This includes a detailed Developer’s Guide for those looking to contribute to the project.

Between v0.6.1 and v0.7.0, 15.2% of sequences in the controls-gisaid dataset had different detection results. 5.1% of sequences were newly classified (NA → X) and represent lineages not present in the v0.6.1 model. 6.6% of sequences had lineage assignment changes and 3.5% of sequences had sublineage assignment changes as a result of the Nextclade dataset upgrade. 0% of positive controls were dropped (X → NA), indicating no observed loss in sensitivity.

ncov-recombinant v0.7.0 is a recommended upgrade for recombinant surveillance to accurately classify the latest recombinant lineages (up to XBP) and to detect recursive recombination (ex. XBL is a recombinant of XBB).

For a comprehensive summary of the methodological changes, please see the release notes for v0.7.0

2. Purpose

Verify that the update of ncov-recombinant pipeline from version 0.6.1 to0.7.0:

  1. Maintains specificity for recombinants trained in previous versions.
  2. Increases sensitivity for newly designated recombinant sublineages.

3. Datasets

Controls

This dataset includes SARS-CoV-2 genomes from GISAID that reflect the known diversity of recombinant sequences to date. These include 572 positive controls (recombinants), representing lineages XA - XBP and 186 negative controls (non-recombinants) selected from the Nextstrain Reference Phylogeny.

In total, 758 control sequences were used as input and a strain list is available here.

Canada VirusSeq

This dataset includes publicly available SARS-CoV-2 genomes from the Canadian VirusSeq Data Portal. Sequences were downloaded on 2023-01-23 and include 441,234 genomes in total.

4. Procedure

The snakemake pipelines for v0.6.1 and v0.7.0 were run independently on the controls-gisaid and virusseq datasets. Please see the Procedure section of the Supplementary for detailed command-line instructions.

5. Results

Controls GISAID

Note: Lineage assignments in v0.7.0 are identical to those in pango-designation and are the expected values.

Figure 1: Comparison of lineage assignments in the controls-gisaid dataset between v0.6.1 and v0.7.0. Strain names represent the cluster_id of novel lineages, which represents the sequence with the earliest collection date.

Canada VirusSeq

Note: Lineage assignments in v0.7.0 are identical to those in pango-designation and are the expected values.

Figure 2: Comparison of lineage assignments in the controls-gisaid dataset between v0.6.1 and v0.7.0. Strain names represent the cluster_id of novel lineages, which represents the sequence with the earliest collection date.

Changes

New Detections

New detections (NAX*) result from the following changes in v0.7.0:

  1. Nextclade dataset upgrades to include newly designated lineages: XBG, XBK, XBM.

    Lineage (v0.7.0) Lineage (v0.6.0) Parents
    XBG NA BA.2.76*, BA.5.2*
    XBK NA BA.5.2*, CJ.1*
    XBM NA BA.2.76*, BF.3*

Lineage Changes

Lineage changes result from the following updates in v0.7.0:

  1. Nextclade dataset upgrades to include newly designated lineages: XBH, XBJ, XBL, XBN, XBP.

    Lineage (v0.7.0) Lineage (v0.6.0) Parents
    XBH BY.1 BA.2.3*, BA.2.75*
    XBJ BA.2.3.20 BA.2.3*, BA.5.2*
    XBL XBB.1-like BA.2.75*, XBB*
    XBN XBB-like BA.2.75*, XBB*
    XBP XBD-like BA.2.75*,* BA.5*

Sublineage changes result from the following updates in v0.7.0:

  1. Nextclade dataset upgrades to include new sublineages for: XAY and XBB.

    Lineage (v0.7.0) Lineage (v0.6.0) Parents
    XAY.2 XAY, XAY-like BA.2*, Delta (21J)
    XBB.1 XBB.1.1 BA.2.10*, BA.2.75*
    XBB.1.5 XBB.1, XBB-like BA.2.10*, BA.2.75*

Dropped Positives

Dropped positives are only observed in the virusseq dataset, and include the unpublished cluster_id hCoV-19/Canada/ON-PHL-22-53186/2022 (N=19, 2022-12-09 to 2023-01-02). In v0.6.1 this was classified as a BA.5.2/BA.5.3 recombinant with breakpoints extremely close to the 5’ termini (Figure 3). The most likely reason this is dropped in v0.7.0 is because the 3 mutations attributed to BA.5.2 are no longer considered diagnostic based on the latest global mutation frequencies.

Figure 3: Genomic composition of the dropped positive (hCoV-19/Canada/ON-PHL-22-53186/2022) which is composed of 19 sequences with identical mutation profiles.

Acknowledgements

The results here are in whole, or in part based upon data hosted at the Canadian VirusSeq Data Portal: https://virusseq-dataportal.ca/. We wish to acknowledge the Canadian Public Health Laboratory Network (CPHLN), Genome Canada and the CanCOGeN VirusSeq Consortium for their contribution to the Portal.

Supplementary

Procedure

Download Data

  1. Download the GISAID sequences and metadata in the strains list from GISAID to data/controls-gisaid/.

  2. Download the VirusSeq sequences and metadata.

    wget -O virusseq.tar.gz https://singularity.virusseq-dataportal.ca/download/archive/2d9ace2c-0808-475f-bc93-6ad5808581a4
    tar -xvf virusseq.tar.gz
    
    mkdir data/virusseq
    
    # Prep metadata
    csvtk cut -t -f "fasta header name,sample collection date,geo_loc_name (country),geo_loc_name (state/province/territory)" *files-archive*.tsv \
        | csvtk rename -t -f "fasta header name" -n "strain" \
        | csvtk rename -t -f "sample collection date" -n "date" \
        | csvtk rename -t -f "geo_loc_name (country)" -n "country" \
        | csvtk rename -t -f "geo_loc_name (state/province/territory)" -n "division" \
        > data/virusseq/metadata.tsv
    
    # Prep sequences
    mv *files-archive*.fasta data/virusseq/sequences.fasta
    
    # Cleanup
    rm *files-archive*.tsv
    rm virusseq.tar.gz

Version 0.7.0 | 3f3d4438

  1. Download the pipeline.

    git clone https://github.com/ktmeaton/ncov-recombinant.git 0.7.0
    cd 0.7.0
    git checkout v0.7.0
  2. Create a version-controlled conda environment.

    # Local
    mamba env create -f workflow/envs/environment.yaml -n ncov-recombinant-0.7.0
    
    # HPC
    sbatch -J conda-ncov-recombinant-0.7.0 --wrap="mamba env create -f workflow/envs/environment.yaml -n ncov-recombinant-0.7.0"
  3. Symlink the controls-gisaid data.

    ln -s ../../../data/controls-gisaid/metadata.tsv data/controls-gisaid/metadata.tsv
    ln -s ../../../data/controls-gisaid/sequences.fasta data/controls-gisaid/sequences.fasta
  4. Symlink the virusseq data.

    ln -s ../../data/virusseq data/virusseq
  5. Run the pipeline for controls-gisaid.

    # Local
    conda activate ncov-recombinant-0.7.0
    snakemake --profile profiles/controls-gisaid
    
    # HPC
    scripts/slurm.sh --profile profiles/controls-gisaid-hpc --conda-env ncov-recombinant-0.7.0
  6. Run the pipeline for virusseq (must be done as HPC).

    scripts/slurm.sh --profile profiles/virusseq-hpc --conda-env ncov-recombinant-0.7.0

Version 0.6.1 | 4d1f495a

  1. Download the pipeline.

    git clone https://github.com/ktmeaton/ncov-recombinant.git 0.6.1
    cd 0.6.1
    git checkout v0.6.1-hotfix.1
  2. Create a version-controlled conda environment.

    # Local
    mamba env create -f workflow/envs/environment.yaml -n ncov-recombinant-0.6.1
    
    # HPC
    sbatch -J conda-ncov-recombinant-0.6.1 --wrap="mamba env create -f workflow/envs/environment.yaml -n ncov-recombinant-0.6.1"
  3. Symlink the controls-gisaid data.

    ln -s ../../../data/controls-gisaid/metadata.tsv data/controls-gisaid/metadata.tsv
    ln -s ../../../data/controls-gisaid/sequences.fasta data/controls-gisaid/sequences.fasta
  4. Symlink the virusseq data.

    ln -s ../../data/virusseq data/virusseq
  5. Run the pipeline for controls-gisaid.

    # Local
    conda activate ncov-recombinant-0.6.1
    snakemake --profile profiles/controls-gisaid
    
    # HPC
    scripts/slurm.sh --profile profiles/controls-gisaid-hpc --conda-env ncov-recombinant-0.6.1
  6. Run the pipeline for virusseq (must be done as HPC).

    scripts/slurm.sh --profile profiles/virusseq-hpc --conda-env ncov-recombinant-0.6.1

Comparison

After the pipelines are complete for each version, run the following to compare lineage assignments.

old_ver="0.6.1"
new_ver="0.7.0"

Controls GISAID

conda activate ncov-recombinant-0.7.0

link_sizes=("1" "3" "5" "10")
for size in ${link_sizes[@]}; do
    python3 0.7.0/scripts/compare_positives.py \
      --positives-1 ${old_ver}/results/controls-gisaid/linelists/positives.tsv \
      --positives-2 ${new_ver}/results/controls-gisaid/linelists/positives.tsv \
      --ver-1 "v${old_ver}" \
      --ver-2 "v${new_ver}" \
      --outdir compare/controls-gisaid-${size} \
      --node-order alphabetical \
      --min-link-size $size
done

Canada VirusSeq

conda activate ncov-recombinant-0.7.0

link_sizes=("1" "3" "5" "10")
for size in ${link_sizes[@]}; do
    python3 0.7.0/scripts/compare_positives.py \
      --positives-1 ${old_ver}/results/virusseq/linelists/positives.tsv \
      --positives-2 ${new_ver}/results/virusseq/linelists/positives.tsv \
      --ver-1 "v${old_ver}" \
      --ver-2 "v${new_ver}" \
      --outdir compare/virusseq-${size} \
      --node-order alphabetical \
      --min-link-size $size
done

New Lineages

old_ver="0.6.1"
new_ver="0.7.0"
csvtk cut -t -f "strain" ${old_ver}/results/controls-gisaid/linelists/positives.tsv \
  | tail -n+2 \
  | csvtk grep -t -f "strain" -P - -v ${new_ver}/results/controls-gisaid/linelists/positives.tsv \
  | csvtk cut -t -f "strain" \
  | tail -n+2 \
  | csvtk grep -t -f "strain" -P - ${old_ver}/results/controls-gisaid/linelists/linelist.tsv \
  | csvtk pretty -t \
  | less -S

Dropped Lineages

csvtk cut -t -f "strain" ${new_ver}/results/controls-gisaid/linelists/positives.tsv \
  | tail -n+2 \
  | csvtk grep -t -f "strain" -P - -v ${old_ver}/results/controls-gisaid/linelists/positives.tsv \
  | csvtk cut -t -f "strain" \
  | tail -n+2 \
  | csvtk grep -t -f "strain" -P - ${new_ver}/results/controls-gisaid/linelists/linelist.tsv \
  | csvtk pretty -t \
  | less -S