The web, pdf , and official versions of this thesis were generated from ktmeaton/obsidian-public@19ca4a67 on January 28, 2022.
The Plague is a disease that has profoundly impacted human history and is responsible for some of the most fatal pandemics ever recorded. It may surprise many to know that this disease is not a bygone of a past era, but in fact is still present in many regions of the world. Although researchers have been studying plague for hundreds of years, there are many aspects of its epidemiology that are enigmatic. In this thesis, I focus on how DNA from the plague bacterium can be used to estimate where and when this disease appeared in the past. To do so, I reconstruct the evolutionary relationships between modern and ancient strains of plague, using publicly available data and new DNA sequences retrieved from the skeletal remains of plague victims in Denmark. This work offers a new methodological framework for large-scale genetic analysis, provides a critique on what questions DNA evidence can and cannot answer, and expands our knowledge of the global diversity of plague.
Pandemics of plague have reemerged multiple times throughout human history with tremendous mortality and extensive geographic spread. The First Pandemic (6th - 8th century) devastated the Mediterranean world, the Second Pandemic (14th - 19th century) swept across much of Afro-Eurasia, and the Third Pandemic (19th - 20th century) reached every continent except for Antarctica, and continues to persist in various endemic foci around the world. Despite centuries of historical research, the epidemiology of these pandemics remains enigmatic. However, recent technological advancements have yielded a novel form of evidence: ancient DNA of the plague bacterium Yersinia pestis. In this thesis, I explore how genomic data can be used to unravel the mysteries of when and where this disease appeared in the past. In particular, I focus on phylogenetic approaches to studying this ‘small microbe’ with ‘big data’ (ie. 100s - 1000s of genomes). I begin by describing novel software I developed that supports the acquisition and curation of large amounts of DNA sequences in public databases. I then use this tool to create an updated global phylogeny of Y. pestis, which includes ~600 genomes with standardized metadata. I devise and validate a new approach for temporal modeling (ie. molecular clock) that produces robust divergence dates in pandemic lineages of Y. pestis. In addition, I critically examine the questions that genomic evidence can and cannot address in isolation, such as whether the timing and spread of short-term epidemics can be confidently reconstructed. Finally, I apply this theoretical and methodological insight to a case study in which I reconstruct the appearance, persistence, and disappearance of plague in Denmark during the Second Pandemic. The three papers enclosed in this sandwich-thesis contribute to a larger body of work on the anthropology of plague, which seeks to understand how disease exposure and experience change over time and between human populations. Furthermore, this dissertation more broadly impacts both prospective studies of infectious disease, such as environmental surveillance and outbreak monitoring, and retrospective studies, which seek to date the emergence and spread of past pandemics.
I’d like to thank my parents, Michelle and Michael Eaton. When I was little, I thought you knew everything. And now that I’m writing my doctoral dissertation… I realize you do know everything! I hope when I grow up, I turn out to be just like you <3 Thank you for your love, support, and encouragement over all these years.
To Miriam: Thank you for being my partner, my best friend, my everything. I hope that one day I have 1/10 of your intellect, kindness, and patience. (Maybe we won’t hold our breath for that last one.)
To Hendrik Poinar: Thank you for your unending support and enthusiasm. Your mentorship and passion for research has been my rock during the hard times. Thank you for taking leaps of faith and trusting me when I proposed ridiculous project ideas. At least a few of those wound up in this thesis! And most importantly, thank you for traveling to Edmonton in 2013 to give a talk at the University of Alberta!
Thank you to members of my doctoral supervisory committee: Brian Golding, Tracy Prowse, and Nükhet Varlık. I am indebted to you for your generous support, careful guidance, and prompt feedback. Our affectionate motto of ‘Keep It Simple Stupid’ has played on loop in my head as I prepared this dissertation.
To John Silva, Marcia Furtado, and Delia Hutchinson: Thank you for guiding me through the labyrinth that is McMaster’s administration. Your smiling faces up on the 5th floor were always so reassuring. I knew that if I ever had a problem, you would be there to investigate and advocate on my behalf.
To the Plague Team: Jennifer Klunk, what would I do without you? I think everything I’ve ever known and ever will know comes back to you. Thank you for being a dedicated mentor, a brilliant scientist, and the best companion for dancing in the lab. Madeline Tapson, I dearly miss sharing a desk with you. Your warm and friendly spirit was always comforting, and you opened my eyes to so many new avenues of plague research. Ravneet Sidhu, I learned so much from training you and I loved that you questioned everything. I’m so excited to be collaborating with you on current and future projects. Michael Klowak and Julianna Stangroom, thank you for your HARD work in screening a dizzying number of plague samples and making the lab such a fun and exciting place to be.
To Emil Karpinski: I always looked forward to our bus rides into campus together. You played such an important role in creating a welcoming atmosphere when I first arrived and throughout my whole degree. Also, you are lab notebooks goals (wow).
To Ana Duggan: You have been a role model for me in so many avenues of my academic, personal, and professional life! You were the first woman I met that was also passionate about computational analysis, and have made me feel more comfortable and confident in my own skin.
To Nathalie Mouttham: Thank you for being such a stellar trainer and friend! I have vivid (but positive) memories of long hours in a laboratory basement doing Phenol Chloroform extractions together and playing 20 questions.
Thank you to all past and present members of the McMaster Ancient DNA Centre. In particular: Melanie Kuch, Matthew Emery, Jess Hider, Samantha Price, Marie-Hélène B.-Hardy, and Dirk Hackenberger. You created such a unique sense of community, and left pretty big shoes to fill!
Thank you to all collaborators who have generously shared their time, energy, and resources with me. In particular, thank you to Rebecca Redfern for working with me on the Roman Londoners project!
To my colleagues at Red Lobster: I worked in many kitchens to fund my education, but working at Red Lobster was by far my favorite. Also to the managers, thank you for letting me have so many free Cheddar Bay Biscuits, they were a crucial component of my student diet (no joke).
Finally, I would like to acknowledge all individuals who financially supported me throughout my doctoral research. I thank Hendrik Poinar, the Department of Anthropology, the MacDATA Institute, the Sherman Centre for Digital Scholarship, McMaster University, the Province of Alberta, and the Social Sciences Research and Humanities Research Council.
aDNA | Ancient DNA |
DNA | Deoxyribonucleic acid |
MRCA | Most Recent Common Ancestor |
NCBI | National Center for Biotechnology Information |
SRA | Sequence Read Archive |
tMRCA | Time to the most recent common ancestor |
I, Katherine Eaton, declare that this thesis titled, ‘Big Data, Small Microbes: Genomic analysis of the plague bacterium Yersinia pestis’ and the work presented in it are my own. I confirm that:
This work was done wholly or mainly while in candidature for a research degree at McMaster University.
Where any part of this thesis has previously been submitted for a degree or any other qualification at McMaster University or any other institution, this has been clearly stated.
Where I have consulted the published work of others, this is always clearly attributed.
Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work.
I have acknowledged all main sources of help.
Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself.
In 2011, I learned about a researcher named Dr. Hendrik Poinar. His team had just published a seminal paper, in which they identified the causative agent of the infamous Black Death [1]. I discovered that this morbid term describes a pandemic that devastated the world in the 14th century, with unprecedented mortality and spread. In less than 10 years (1346-1353) the Black Death swept across Afro-Eurasia, killing 50% of the population [2]. Outbreaks of this new and mysterious disease, often referred to as The Plague, reoccurred every 10 years on average [3]. This epidemic cycling continued for 500 long years in Europe, but in Western Asia, the disease never truly disappeared [4]. The 10-year window of the Black Death alone has an estimated global mortality of 200 million people, making it the most fatal pandemic in human history [5], and also one of the most mysterious.
The cryptic nature of this medieval disease led to significant debate among contemporaries. The dominant theory of contagion at the time was miasma, in which diseases were spread through noxious air [6]. However, Ibn al-Khatib, a notable Islamic scholar, took issue with this theory. After studying outbreaks of The Plague in the 14th century, he proposed an alternative hypothesis in which minute bodies were transmissible between humans [7]. Like most controversial theories, this idea was not readily embraced. Some 400 years later, the British botanist Richard Bradley wrote a radical treatise on The Plague [8] where he similarly proposed that infectious diseases were caused by living, microscopic agents. Again, this theory was rejected. It was not until the 19th century that this “new” perspective would receive widespread acceptance [9]. It is quite remarkable that our modern conceptions of epidemiology and bacteriology can be traced back to diverse “founders” throughout history, who all happened to be grappling with the perplexing nature of The Plague.
After it was established that a living organism caused the Black Death, the intuitive next step was to precisely identify the organism. The symptoms described in historical texts seemed to incriminate bubonic plague [2], a bacterial pathogen that passes from rodents to humans, and leads to grotesquely swollen lymph nodes (buboes). On the other hand, the rapid spread of the Black Death suggests this was a contagion primarily driven by human to human transmission, which more closely fit the profile of an Ebola-like virus [10]. In the 1990s and 2000s, geneticists began contributing novel evidence to the debate, by retrieving pathogenic DNA from skeletal remains [11]. The plague bacterium, Yersinia pestis, played a central role in these molecular investigations, as researchers sought to either establish or refute its presence in medieval victims [12]. The competitive nature of this discourse fueled significant technological progress, and over the next decade, the study of ancient DNA became a well-established discipline. However, the origins of the Black Death remained unresolved, due to numerous controversies surrounding DNA contamination and scientific rigor [13].
As an undergraduate student of forensic anthropology, I was fascinated by the rapid pace at which the field of ancient DNA was developing. I attribute my developing academic obsession to two early-career experiences. First, was reading the highly entertaining back-and-forth commentaries in academic journals [14], where plague researchers would occasionally exchange personal insults [15]. It was clear that these researchers cared deeply about their work. Despite the occasional toxicity, I found these displays of passion to be engaging and refreshing, compared to the otherwise emotionally-sterile field of scientific publishing.
The second defining experience, was the perplexing and often frustrating task of diagnosing infectious diseases from skeletal remains. I was intrigued by the idea of reconstructing an individual’s life story from their skeleton, and using this information to solve the mysteries of the dead. However, while some forms of trauma leave diagnostic marks on bone (ex. sharp force), acute infectious diseases rarely manifest in the skeleton [16,17] and thus are ‘invisible’ to an anthropologist. Because of this, I found the new field of ancient DNA to be extremely appealing, as it offered a novel solution to this problem. Anthropologists could now retrieve the precise pathogen that had infected an individual, and contribute new insight regarding disease exposure and experience throughout human history. These experiences suggested to me that studying the ancient DNA of pathogens would be an exciting, dynamic, and productive line of research for a graduate degree. I’m happy to say that 10 years later, I still agree with this statement, and by writing this dissertation I hope to convince you, the reader, as well.
Which brings us back to Dr. Hendrik Poinar and his team’s seminal work on the mysterious Black Death. The study, led by first author Kirsten Bos, had found DNA evidence of the plague bacterium Y. pestis in several Black Death victims buried in a mass grave in London [1]. However, they did not just retrieve a few strands of DNA, they captured millions of molecules (10.5 million to be precise) which allowed them to reconstruct the entire Y. pestis genome, comprising four million DNA bases. The amount of molecular evidence was staggering, and offered irrefutable proof that the plague bacterium was present during the time of Black Death. However, with a sample size of N=1, the genetic link between Y. pestis and this ancient pandemic was tentative at best.
Armed with the proposal of finding more evidence of Y. pestis in the archaeological record, I applied to work for Dr. Hendrik Poinar at the McMaster Ancient DNA Centre. In 2014, I had the delight and privilege of being accepted into the graduate program at McMaster University. Alongside other members of the “McMaster Plague Team”, I set about the daunting task of screening the skeletal remains of more than 1000 individuals for molecular evidence of Y. pestis. This material was generously provided by archaeological collaborators, who were similarly invested in the idea that ancient DNA techniques could identify infectious diseases in their sites. These archaeological remains reflected nearly a millennium of human history, with sampling ages ranging from the 9th to the 19th century CE. The geographic diversity was also immense, with individuals sampled across Europe, Africa, and Asia.
Of the 1000+ individuals screened, approximately 30% originated in Denmark. Due to this large sample size, we, the “Plague Team”, had the greatest success in identifying ancient Y. pestis in this region. Over a period of 5 years, we retrieved Y. pestis DNA from 13 Danish individuals dated to the medieval and early modern periods. To contextualize these plague isolates, we reconstructed their evolutionary relationships using a large comparative dataset of global Y. pestis. In Chapter 4, I present the results of this collaborative study, which marks the first longitudinal analysis of an ancient pathogen in a single region. I explore whether the genetic evidence of Y. pestis aligns with the historical narrative of the Black Death, and whether or not subsequent epidemics can be attributed to long-distance reintroductions. However, while this high-throughput study was the first one I embarked on, as the chapter numbering indicates, it would be the last project I completed due to several unforeseen complications.
While the McMaster Plague Team was busy screening for Y. pestis, so too were other ancient DNA centres throughout the world. Between 2011 and 2021, more than 100 ancient Y. pestis genomes were published, making plague the most intensively sequenced historical disease. The sequencing of modern isolates accelerated in tandem, with over 1500 genomes produced from culture collections of 20th and 21st century plague outbreaks [18]. Because of this influx of evidence, the research questions changed accordingly. Geneticists were no longer interested in just establishing the presence of Y. pestis during the short time frame of the Black Death (1346-1353), they wanted to know how it behaved and spread throughout the long 500 years of this pandemic. The longitudinal study design of Chapter 4 was therefore well-positioned to address these nuanced epidemiological questions. However, this novel genetic evidence also introduced new complexities.
It quickly became clear that isolates of Y. pestis sampled during epidemic periods were highly similar in terms of genetic content, if not indistinguishable clones [19]. This called into question the resolution of genomic evidence, and whether the geographic origins and spread of the Black Death could be accurately inferred using ancient DNA studies. This was further confounded by the finding that the rate of evolutionary change in Y. pestis could vary tremendously [20] which led to the discovery that previously published temporal models were erroneous [21]. It became increasingly uncertain whether genetic evidence could be used to produce informative estimates of the timing of plague’s frequent reemergences [22]. As I read these critical studies, I began developing an idea to address the substantial gaps in our evolutionary theory and methodology concerning the plague bacterium Y. pestis. This idea culminated in Chapter 3, where I curated and contextualized the largest global data set of plague genomes. I critiqued the existing spatiotemporal models of plague’s evolutionary history, and with the assistance of my co-authors, devised a new methodological approach. This method would then be repurposed for Chapter 4, so that I could infer the emergence and disappearance of Y. pestis in Denmark with greater accuracy. However, as the chapter numbering once again reflects, there was one final obstacle.
Synthesizing the largest genomic data set was a lofty ambition, especially considering that there were few software tools available to perform such a task. New plague genomes of Y. pestis were being published monthly, and at times even weekly, with such volume that manual tracking became impossible. My excel spreadsheet of genetic metadata became riddled with errors and fields with missing data. The era of “Big Data” had arrived, and I was woefully unequipped to effectively manage this deluge of information. In response, I ventured into the tumultuous waters of software development. In Chapter 2, I describe my original software that automates the acquisition and organization of genetic metadata. Academic publishing in the field of software was a unique experience, as I had to both produce a scholarly manuscript and demonstrate expertise as a service-provider. This database tool has continually proven to be indispensable, and is the backbone upon which the studies in Chapter 3 and Chapter 4 would be rebuilt upon.
At this point, I re-introduce the dissertation as a collection of three hierarchical, but independently published, studies. I first describe an original piece of software in Chapter 2, which automates the retrieval and organization of publicly available sequence data. In Chapter 3, I outline how this tool was used to generate an updated and curated phylogeny of Y. pestis, which yielded novel insight regarding the timing and origins of past pandemics. In this chapter, I also conduct a critical examination of the historical questions that genomic evidence can, or cannot, address. In Chapter 4, I use these theories and methods to reconstruct the emergence and continuity of plague in Denmark over a period of 400 years. I conclude in Chapter 5 with a discussion of the contributions of each study, with a particular focus on their significance within the broader field of anthropology.
Published 03 February 2020 in
The Journal of Open Source Software, 5(46), 1990.
https://doi.org/10.21105/joss.01990
Licensed under a Creative Commons Attribution 4.0 International License.
Katherine Eaton1,2
1 McMaster Ancient DNA Centre, McMaster University
2 Department of Anthropology, McMaster University
NCBImeta
is a command-line application that downloads and organizes biological metadata from the National Centre for Biotechnology Information (NCBI). While the NCBI web portal provides an interface for searching and filtering molecular data, the output offers limited options for record retrieval and comparison on a much larger and broader scale. NCBImeta
tackles this problem by creating a reformatted local database of NCBI metadata based on user search queries and customizable fields. The output of NCBImeta
, optionally a SQLite database or text file(s), can then be used by computational biologists for applications such as record filtering, project discovery, sample interpretation, and meta-analyses of published work.
Recent technological advances in DNA sequencing have propelled biological research into the realm of big data. Due to the tremendous output of Next Generation Sequencing (NGS) platforms, numerous fields have transformed to explore this novel high-throughput data. Projects that quickly adapted to incorporate these innovative techniques included monitoring the emergence of antibiotic resistance genes [23], epidemic source tracking in human rights cases [24], and global surveillance of uncharacterized organisms [25]. However, the startling rate at which sequence data are being deposited online have presented significant hurdles to the efficient reuse of published data. In response, there is growing recognition within the computational community that effective data mining techniques are a dire necessity [26,27].
An essential step in the data mining process is the efficient retrieval of comprehensive metadata. These metadata fields are diverse in nature, but often include the characteristics of the biological source material, the composition of the raw data, the objectives of the research initiative, and the structure of the post-processed data. Several software applications have been developed to facilitate bulk metadata retrieval from online repositories. Of the available tools, SRAdb
[28], the Pathogen Metadata Platform [29], MetaSRA
[30], and pysradb
[31] are among the most widely utilised and actively maintained. While these software extensions offer substantial improvements over the NCBI web browser experience, there remain several outstanding issues.
In response, NCBImeta
aims to provide a more user-inclusive experience to metadata retrieval, that emphasizes real-time access and provides generalized frameworks for a wide variety of NCBI’s databases.
NBCImeta
is a command-line application that executes user queries and metadata retrieval from the NCBI suite of databases. The software is written in Python 3, using the BioPython
module [32] to connect to, search, and download XML records with NCBI’s E-Utilities [33]. The lxml
package is utilised to perform XPath queries to retrieve nodes containing biological metadata of interest. SQLite
is employed as the database management system for storing fetched records, as implemented with the sqlite3
python module. Accessory scripts are provided to supply external annotation files, to join tables within the local database so as to re-create the relational database structure, and finally to export the database as tabular text for downstream analyses. NCBImeta
currently interfaces with the molecular and literature databases [34] described in Table 2.1.
Database | Description |
---|---|
Assembly | Descriptions of the names and structure of genomic assemblies, statistical reports, and sequence data links. |
BioSample | Characteristics of the biological source materials used in experiments. |
BioProject | Goals and progress of the experimental initiatives, originating from an individual organization or a consortium. |
Nucleotide | Sequences collected from a variety of sources, including GenBank, RefSeq, TPA and PDB. |
PubMed | Bibliographic information and citations for biomedical literature from MEDLINE, life science journals, and other online publications. |
SRA | Composition of raw sequencing data and post-processed alignments generated via high-throughput sequencing platforms. |
The typical workflow of NCBImeta
follows four major steps as outlined in Figure 2.1. Users first configure the program with their desired search terms. NCBImeta
is then executed on the command-line to fetch relevant records and organize them into a local database. Next, the user optionally edits the database to, for example, add their own custom metadata. Finally, the resulting database, kept in SQLite format or exported to text, delivers 100+ biologically-relevant metadata fields to researcher’s fingertips. This process not only saves significant time compared to manual record retrieval through the NCBI web portal, but additionally unlocks attributes for comparison that were not easily accessible via the web-browser interface.
NCBImeta
’s implementation offers a novel approach to metadata management and presentation that improves upon the prevously described limitations of existing software in a number of ways. First, NCBImeta
is run on the command-line, and the final database can be exported to a text file, thus no knowledge of an external programming language is required to generate or explore the output. Second, a general parsing framework for tables and metadata fields was developed which can be extended to work with diverse database types contained within NCBI’s infrastructure. Finally, a query system was implemented for record retrieval that allows users to access records in real-time, as opposed to working with intermittent or out-dated database snapshots.
The following section demonstrates how NCBImeta
can be used to obtain current and comprehensive metadata for a pathogenic bacteria, Pseudomonas aeruginosa, from various sequencing projects across the globe. P. aeruginosa is an opportunistic pathogen associated with the disease cystic fibrosis (CF) and is highly adaptable to diverse ecological niches [35]. As such, it is a target of great interest for comparative genomics and there are currently over 15,000 genomic sequence records available which are spread across two or more databases. In cases such as this, it is critical to leverage the tremendous power of these existing datasets while being conscious of the labor typically required to retrieve and contextualize this information. NCBImeta
renders the problem of acquiring and sifting through this metadata trivial and facilitates the integration of information from multiple sources.
To identify publicly available P. aeruginosa genomes, NCBImeta
is configured to search through the tables Assembly (assembled genomes) and SRA (raw data). For additional context, NCBImeta
is used to retrieve metadata from the Nucleotide table for descriptive statistics of the genomic content, from the BioProject table to examine the research methodology of the initiative, from Pubmed to identify existing publications, and finally from the Biosample table to explore characteristics of the biological material. A small subset of the 100+ retrieved columns is shown in Figure 2.2, to provide a visual example of the output format and the metadata that is retrieved.
Subsequently, the output of NCBImeta can be used for exploratory data visualization and analysis. The text file export function of NCBImeta ensures downstream compatibility with both user-friendly online tools (ex. Google Sheets Charts) as well as more advanced visualization packages [36]. In Figure 2.3, the geospatial distribution of P. aeruginosa isolates are plotted alongside key aspects of genomic composition (ex. number of genes).
Finally, NCBImeta can be used to streamline the process of primary data acquisition following careful filtration. FTP links are provided as metadata fields for databases attached to an FTP server (ex. Assembly, SRA) which can be used to download biological data for downstream analysis.
The development of NCBImeta
has primarily focused on a target audience of researchers whose analytical focus is prokaryotic genomics and the samples of interest are the organisms themselves. Chief among those include individuals pursuing questions concerning epidemiology, phylogeography, and comparative genomics. Future releases of NCBImeta
will seek to broaden database representation to include gene-centric and transcriptomics research (ex. NCBI’s Gene and GEO databases).
NCBImeta is a command-line application written in Python 3 that is supported on Linux and macOS systems. It is distributed for use under the OSD-compliant MIT license (https://opensource.org/licenses/MIT). Source code, documentation, and example files are available on the GitHub repository (https://github.com/ktmeaton/NCBImeta).
I would like to thank Dr. Hendrik Poinar and Dr. Brian Golding for their guidance and support on this project, as well as for insightful conversations regarding biological metadata, the architecture of NCBI, and software deployment. Thank you to Dr. Andrea Zeffiro, Dr. John Fink, Dr. Matthew Davis, and Dr. Amanda Montague for valuable discussions regarding APIs, digital project management, and software publishing. Thank you to all past and present members of the McMaster Ancient DNA Centre and the Golding Lab. This work was supported by the MacDATA Institute (McMaster University, Canada) and the Social Sciences and Humanities Research Council of Canada (#20008499).
Submitted 17 December 2021 to
Communications Biology.
https://www.researchsquare.com/article/rs-1146895
Licensed under a Creative Commons Attribution 4.0 International License
Katherine Eaton1,2, Leo Featherstone3, Sebastian Duchene3, Ann G. Carmichael4, Nükhet Varlık5, G. Brian Golding6, Edward C. Holmes7, Hendrik N. Poinar1,2,8,9,10
1McMaster Ancient DNA Centre, McMaster University, Hamilton, Canada.
2Department of Anthropology, McMaster University, Hamilton, Canada.
3The Peter Doherty Institute for Infection and Immunity, University of Melbourne, Melbourne, Australia.
4Department of History, Indiana University Bloomington, Bloomington, USA.
5Department of History, Rutgers University-Newark, Newark, USA.
6Department of Biology, McMaster University, Hamilton, Canada.
7Sydney Institute for Infectious Diseases, School of Life & Environmental Sciences and School of Medical Sciences, University of Sydney, Sydney, Australia.
8Department of Biochemistry, McMaster University, Hamilton, Canada.
9Michael G. DeGroote Institute of Infectious Disease Research, McMaster University, Hamilton, Canada.
10Canadian Institute for Advanced Research, Toronto, Canada.
Plague has an enigmatic history as a zoonotic pathogen. This potentially devastating infectious disease will unexpectedly appear in human populations and disappear just as suddenly. As a result, a long-standing line of inquiry has been to estimate when and where plague appeared in the past. However, there have been significant disparities between phylogenetic studies of the causative bacterium, Yersinia pestis, regarding the timing and geographic origins of its reemergence. Here, we curate and contextualize an updated phylogeny of Y. pestis using 601 genome sequences sampled globally. We perform a detailed Bayesian evaluation of temporal signal in subsets of these data and demonstrate that a Y. pestis-wide molecular clock model is unstable. To resolve this, we devised a new approach in which each Y. pestis population was assessed independently. This enabled us to recover significant temporal signal in five populations, including the ancient pandemic lineages which we now estimate may have emerged decades, or even centuries, before a pandemic was historically documented from European sources. Despite this, we only obtain robust divergence dates from populations sampled over a period of at least 90 years, indicating that genetic evidence alone is insufficient for accurately reconstructing the timing and spread of short-term plague epidemics. Finally, we identify key historical data sets that can be used in future research, which will complement the strengths and mitigate the weaknesses of genomic data.
Plague has an impressively long and expansive history as a zoonosis of rodents. The earliest “fossil” evidence of the plague bacterium, Yersinia pestis, stems from ancient DNA studies which date its first emergence in humans to the Late Neolithic Bronze Age (LNBA) approximately 5000 years ago [37]. During this time, Y. pestis has dispersed globally on multiple occasions due to an ability to infect a variety of mammalian hosts [38] and ever-expanding trade networks [39]. Few regions of the ancient and modern world remain untouched by this disease, as plague has demonstrated a persistent presence on every continent except Australia and Antarctica [40]. There are three historically documented pandemics of plague: the First Pandemic (6th to 8th century CE) [21], the Second Pandemic (14th to 19th century CE) [41], and the Third Pandemic (19th to 20th century CE) [42]. The advent of each has been marked by a series of outbreaks, such as the medieval Black Death (1346 - 1353 CE), which is estimated to have killed more than half of Europe’s population [43].
One long-standing line of inquiry in plague’s evolutionary history has been estimating the timing, origins, and spread of these past pandemics. The most intensively researched events have been: (1) the first appearance of Y. pestis in human populations [44], (2) the onset and progression of the three pandemics [1,20,21], and (3) the inter-pandemic or “quiescent” periods where Y. pestis recedes into wild rodent reservoirs and disappears from the historical record [45,46]. Our knowledge of these events has deepened considerably in recent years, owing in part to technological advancements in the retrieval and sequencing of ancient DNA alongside new molecular clock dating methods.
Despite intensive interest and methodological advancement, the rate and time scale of evolution in Y. pestis remains notoriously difficult to estimate. This is largely attributed to the substantial variation in evolutionary rates that has been documented across the phylogeny [19,20]. As a result, considerable debate has emerged over whether Y. pestis has no temporal signal [21], thereby preventing meaningful rate estimates, or if some Y. pestis populations have such distinct rates that a species-wide signal is obscured [22]. This uncertainty has resulted in radically different rate and date estimates between studies, with node dates shifting by several millennia [20,44].
The geographic origins and dispersal of past pandemics are similarly contentious, particularly concerning mechanisms of spread and their underlying ecology. This contention concerns competing hypotheses about the relative importance of localized persistence versus long-distance reintroduction [47,48,49]. Among both sides of this issue, there is an expectation that genomic evidence will play a significant role [50], if not resolve the debate [47]. However, no study to date has statistically evaluated whether Y. pestis genomes have sufficient geographic signal to confidently infer ancestral locations and spread.
To address these debates and obstacles, we: (1) curated and contextualized the most recent Y. pestis genomic evidence, (2) reviewed and critiqued our current understanding of plague’s population structure, (3) devised a new approach for recovering temporal signal in Y. pestis, and (4) critically assessed the reliability of phylogeographic analysis. We ground our results and their interpretation using informative historical case studies to demonstrate the methodological and interpretive consequences. We anticipate these results will impact both prospective studies of plague, such as environmental surveillance and outbreak monitoring, and retrospective studies, which seek to date the emergence and spread of past pandemics.
To determine the population structure of Y. pestis, we first estimated a maximum likelihood phylogeny using 601 global isolates including 540 modern (89.9%) and 61 ancient (10.1%) strains (Methods). We rooted the tree using two genomes of the outgroup taxa Yersinia pseudotuberculosis. The alignment consisted of 10,249 variant positions exclusive to Y. pestis, with 3,844 sites shared by at least two strains. Following phylogenetic estimation, we pruned the outgroup taxa from the tree to more closely examine the genetic diversity of Y. pestis. In Figure 3.1 A, we contextualize the global phylogeny using three nomenclature systems: the metabolic biovars, major branches, and historical time periods. In the following section, we compare and critique each system, identify any incongruent divisions and uncertainty, and explore an integrative approach for molecular clock analysis.
The oldest classification system of Y. pestis is the biovar nomenclature that uses metabolic differences to define population structure. Accordingly, Y. pestis can be categorized into four classical biovars: Antiqua (ANT), Medievalis (MED), Orientalis (ORI), and microtus/pestoides (PE) [51,52]. Non-classical biovars have also been introduced, such as the Intermedium biovar (IN), which reflects a transitional state from Antiqua to Orientalis [53]. The biovar system is simple in application, as it largely focuses on two traits: the ability to ferment glycerol and reduce nitrate [52]. However, this simplicity is offset by the growing recognition of regional inconsistencies in metabolic profiles [54]. This is further exacerbated by the sequencing of non-viable, “extinct” Y. pestis for which metabolic sub-typing is challenging [1]. Researchers have responded to this uncertainty in a variety of ways, by extrapolating existing biovars [21] and creating new pseudo-biovars (PRE) [44]. Others have foregone the biovar nomenclature altogether in favor of locally-developed taxonomies [54]. Despite extensive research, it remains unclear which metabolic traits, if any, can be used to classify Y. pestis into distinct populations on a global scale.
In contrast to the biovar nomenclature which emphasizes phenotype, the major branch nomenclature focuses on genotype. This system divides the global phylogeny of Y. pestis into populations according to their relative position to a multifurcation called the “Big Bang” polytomy [20]. All lineages that diverged prior to this polytomy are grouped into Branch 0 and those diverging after form Branches 1 through 4. Because this multifurcation plays such a central role in this system, there is great interest in estimating its timing and geographic origins [46,55].
Ancient Y. pestis genomes now represent a substantive portion of the known genetic diversity yet cannot be easily classified via direct metabolic testing. An alternative strategy has been employed that incorporates contextual evidence such as the sampling age, historical time period, and potential pandemic associations. In ancient DNA studies, the genetic diversity of Y. pestis is commonly divided into four time periods: the Late Neolithic Bronze Age [44], the First Pandemic [21], the Second Pandemic [19], and the Third Pandemic [20] (Figure 3.1 B).
The key strengths of the time period nomenclature are two-fold. First, it provides a foundation for interdisciplinary discourse, in which the genetic diversity can be contextualized and explained using relevant historical records. Second, this system effectively categorizes the historical outbreaks of plague recorded in Europe. This can be seen in Figure 3.1, where the Bronze Age strains (0.PRE) in Europe are replaced by those of the First Pandemic (0.ANT4), which in turn are replaced by strains of the Second Pandemic (1.PRE). However, this “strength” comes at a cost, as this system is far less effective in describing plague populations outside of Europe and incurs two significant risks.
The first risk is artificially grouping unrelated populations. Contemporaneous strains can have distinct evolutionary histories [56] even when originating from the same plague foci. The Pestoides (0.PE) and Medievalis (2.MED) biovars are informative examples, as these populations have co-existed in the Caucasus mountains since at least the 20th century (Figure 3.1 C). The second risk is artificially separating related populations. The Second and Third Pandemics were previously seen as mutually exclusive events dated to the 14th to 18th century, and the late 19th to mid-20th century respectively [57]. Recent historical scholarship has contested this claim and demonstrated that these dating constraints are a product of a Eurocentric view of plague [41]. The Second Pandemic is now recognized to have extended into at least the 19th century [4,58] and the Third Pandemic is hypothesized to have began as early as the 18th century [42,59]. Phylogenetic analysis reveals genetic continuity between these two events, as the Third Pandemic (1.ORI) is a direct descendant of the Second Pandemic (1.PRE) [60]. What remains unknown is the extent of temporal overlap, and as such, it is unclear how to distinguish these pandemics using genetic evidence.
A final limitation is that several populations are curiously excluded from the pandemic nomenclature altogether. For example, Branch 2 populations emerged at the same time as, but separate from, the Second Pandemic and have been associated with high mortality epidemics [61]. In particular, the Medievalis population (2.MED) has dispersed throughout Asia (Figure 3.1) with the fastest spread velocity of any Y. pestis lineage [42]. Despite its epidemiological significance, Branch 2 populations across Asia continue to be overlooked in the pandemic taxonomy of Y. pestis. As ancient DNA sampling strategies expand in geographic scope, and as more non-European historical sources are brought to bear, it will be important to consider how best to refashion the historical period nomenclature to encompass this diversity.
There exists no current classification system which comprehensively represents the global population structure of Y. pestis. Instead, integrative approaches have been previously used in large comparative studies of Y. pestis [20,62]. We therefore take the intersection of the three taxonomic systems discussed previously and describe 12 populations for further statistical analysis (Figure 3.1, Table S1, Table ??). In the following sections, we highlight the novel insight and issues that arise when this population structure is explicitly incorporated into molecular clock models and phylogeographic reconstructions.
The extent of rate variation present in our updated genomic data set is notably larger than those depicted in previous studies [19,63] . A root-to-tip regression on sampling age reproduces the finding that substitution rates in Y. pestis are poorly represented by a simple linear model or “strict clock” (Figure 3.2 A). We found a very low coefficient of determination (R2=0.09) that indicates a large degree of unaccounted variation. This finding suggests that evolutionary change in Y. pestis may be more appropriately estimated using a “relaxed clock”, where rate variation is explicitly modeled. To test this hypothesis, we performed a Bayesian Evaluation of Temporal Signal (BETS) [64]. In brief, this method tested four model configurations including: (1) a strict clock, (2) a relaxed clock, (3) the true sampling ages, and (4) no sampling ages. Configurations with no sampling ages explicitly test for the presence of temporal signal. A comparison of the model likelihoods, or Bayes factors, was then used to assess the degree of temporal signal.
BETS was inconclusive when attempting to fit a single clock to the updated global diversity of Y. pestis. The Markov chain Monte Carlo (MCMC) inference exhibited poor sampling of parameter space (effective sample size, ESS < 200) across all model configurations, even when we reduced sources of variation by removing tip date uncertainty, fixing the tree topology, and removing up to 70% of the genomes. The poor performance of a single clock model is consistent with several other studies, in which low ESS values were observed [44] and divergence dates could not be estimated [21]. A single clock model is not a viable approach for Y. pestis, as there is excessive rate variation across the global phylogeny, which likely explains node-dating disparities between previous studies [20,44,62].
In contrast to the single clock approach, we observed substantial improvements when each population was assessed independently. All model parameters in our Bayesian analysis demonstrated MCMC convergence with ESS values well above 200 and we detected temporal signal in 9 out of 12 Y. pestis populations (Table S2). Several of these appeared more clock-like than others, which was observed through the root-to-tip regression and Bayesian rate estimation. For example, we found rate variation to be low in the Bronze Age (R2=0.92), moderate in the Second Pandemic (R2=0.76) and high in Medievalis (R2>=0.02). Our results indicate that population specific models are a more effective approach for estimating substitution rates across the global phylogeny.
To demonstrate the application of our molecular clock method and the interpretive consequences, we explored three outcomes as case studies. First, as a control, we examined Y. pestis populations that had (i) no temporal signal. These “negative” cases inform us about the minimum sampling time, or phylodynamic threshold, required to obtain robust temporal estimates in Y. pestis. Second, we examined populations with (ii) irreproducible estimates between studies, such as the time to most recent common ancestor (tMRCA). We discuss how sampling bias drives this outcome, and how it can be identified and corrected with ancient DNA calibrations where available. Finally, we identify the populations with the most (iii) informative rates and dates. We discuss how these molecular dates change our understanding of pandemic “origins” and complement recent historical scholarship.
We found several Y. pestis populations with no detectable temporal signal that include the Intermedium (1.IN) and Antiqua biovars (2.ANT, 3.ANT). Despite being sampled over a period as long as 84 years (2.ANT), these populations have not accumulated sufficient evolutionary change to yield informative divergence dates. This limited diversity is identifiable in the maximum likelihood phylogeny as populations with the highest density of nodes sitting close to their roots (Figure S6). Out of caution, we also consider the rates and dates associated with the Antiqua population 4.ANT to be non-informative, as it has a similar node distribution, with a smaller number of samples (N=12) collected over an even shorter time frame (38 years).
Our results show that for robust temporal estimates to be obtained, Y. pestis must be sampled over multiple decades at minimum. This time frame is largely consistent with the finding that Y. pestis has one of the slowest substitution rates observed among bacterial pathogens [22]. Here we found that all populations had a median rate of less than 1 substitution per year (Figure 3.2 C, Table S4), with the lowest rate in Antiqua (0.ANT) at 1 substitution every 14.1 years (1.7 x 10-8 subs/site/year) and the highest rate in Pestoides (0.PE) at 1 substitution every 1.1 years (2.1 x 10-7 subs/site/year). In application, this means that Y. pestis lineages often cannot be differentiated until at least several decades have passed, a concept referred to in the literature as the phylodynamic threshold [65].
The phylodynamic threshold has been rigorously explored in other pathogens, such as SARS-CoV-2 [66], but not explicitly in Y. pestis. The challenges in reconstructing intra-epidemic plague diversity have been noted previously. For example, several isolates from the Second Pandemic dated to the medieval Black Death (1346-1353) are indistinguishable clones [60], extinguishing any hope of reconstructing its spread from genetic evidence alone. Our median rate estimation for the Second Pandemic (1.PRE) of 1 substitution every 9.5 years (2.5 x 10-8 subs/site/year) is congruent with this finding. The clonal nature of the Black Death is not an exceptional event, but rather the norm based on the sampling time frame. Our results highlight a significance limitation and cautionary note for plague research, as genetic evidence alone is not suitable for reconstructing the timing of short-term, episodic epidemics.
We observed two populations with detectable temporal signal associated with substantial node-dating conflicts: the Pestoides (0.PE) and Antiqua (0.ANT) biovars: both of which are paraphyletic. Conflicts were identified by comparing their estimated time to the most recent common ancestor (tMRCA) to that of their descendant populations. For example, the First Pandemic (0.ANT4) is a descendant clade of the larger Antiqua (0.ANT) population based on the maximum likelihood phylogeny (Figure 3.3). We would expect the tMRCA of the ancestral 0.ANT to pre-date the First Pandemic, for which ancient DNA calibrations are available. However, the tMRCA of 0.ANT is far too young (95% HPD: 1357 - 1797 CE), and incorrectly post-dates the tMRCA of 0.ANT4 (95% HPD: 39 - 234 CE) by more than a millennium. This outcome is somewhat paradoxical, as these populations have robust temporal signal and yet a critical examination of their divergence dates reveals they are unreliable.
This conflicting pattern has been previously described and attributed to sampling bias [67,68], specifically, insufficient sampling of basal branches and the presence of extensive rate variation. The two affected populations, Antiqua (0.ANT) and Pestoides (0.PE), have a low density of nodes at their roots in the maximum likelihood phylogeny (Figure S6). This pattern is also observed in another Antiqua population (1.ANT) which has a small sample size (N=4) and has previously been linked to rate acceleration events [20]. The dates associated with these three populations (0.PE, 0.ANT, 1.ANT) should be considered non-informative.
These node-dating issues reveal a clear limitation in our approach of estimating divergence times from population-specific models. Defining Y. pestis population by time periods has adverse effects, as ancient plague genomes can serve as crucial calibration points for rate changes that are otherwise unsampled in extant populations. In populations with poorly sampled basal branches (0.PE, 0.ANT, 1.ANT) we expect an optimization approach to be more ideal, in which a few closely related populations are merged or select ancient DNA calibrations are introduced [56]. Otherwise, divergence dates in these populations tend to be overly young, sometimes by more than a 1000 years [63], and are difficult to replicate between studies (Table ??).
The inability to infer divergence dates due to sampling bias also has several historical implications. Perhaps the most significant concerns the emergence of plague in Africa which makes up 90% of all modern plague cases [69], yet for which there remains not a single ancient sequence. Little progress has been made in sampling extant African plague diversity, with this region represented by only 1.5% (9/601) of all genomes. Furthermore, the oldest genetic evidence of African plague comes from the 1.ANT population, which has only four representative strains. Despite this sparse sampling, researchers have repeatedly attempted to use genomic evidence to date the first appearance of Y. pestis in Africa [20,62,63]. The result is a complete lack of congruent dates for this event, as the majority of tMRCA estimates for 1.ANT do not overlap (Table ??). These divergence dates are of limited value for historical interpretation [63,70,71] and should be treated with great skepticism.
Category | Population | Morelli et al. 2010 | Cui et al. 2013 | Pisarenko et al. 2021 | This Study |
---|---|---|---|---|---|
Informative Dates | 1.ORI | -326, 1793 | 1735, 1863 | 1744, 1842 | 1806, 1901 |
No Temporal Signal | 1.IN | -2388, 1606 | 1500, 1750* | 1791, 1897 | 1651, 1913 |
Sampling Bias | 1.ANT | -4909, 1322 | 1377, 1650 | 1483, 1704 | 1655, 1835 |
Informative Dates | 1.PRE | – | 1312, 1353 | – | 1214, 1315 |
Informative Dates | 2.MED | -583, 1770 | 1550, 1800* | 1413, 1653 | 1560, 1845 |
No Temporal Signal | 2.ANT | -3994, 1460 | 1550, 1800* | 1373, 1628 | 1509, 1852 |
No Temporal Signal | 4.ANT | – | 1200, 1700* | 1611, 1816 | 1848, 1968 |
No Temporal Signal | 3.ANT | – | 1450, 1850* | 1531, 1742 | 1769, 1947 |
Sampling Bias | 0.ANT | -6857, 1199 | 100, 1100* | 1033, 1435 | 1357, 1797 |
Informative Dates | 0.ANT4 | – | – | – | 39, 234 |
Sampling Bias | 0.PE | -26641, -598 | -4394, 510 | -377, 499 | 1573, 1876 |
Informative Dates | 0.PRE | – | – | – | -3098, -2786 |
* Visually estimated from the published time-scaled phylogeny.
Excluding populations with no detectable signal, we identified five populations with potentially informative rates and dates. These include the Bronze Age (0.PRE), Medievalis (2.MED), the First Pandemic (0.ANT4), the Second Pandemic (1.PRE), and the Third Pandemic (1.ORI). The Bronze Age marks the first identified appearance of Y. pestis in humans, and the three pandemics, along with Medievalis, are historically associated with high mortality and rapid spread [42]. Due to this epidemiological significance, these five populations have been sampled over the longest time frames, ranging from 92 years for the Third Pandemic (1.ORI) to 1250 years for the Bronze Age (1.PRE). This affirms the importance of long-term heterochronous sampling for Y. pestis, made possible through the retrieval of ancient DNA [1] and recent sequencing of early 20th century culture collections [61]. By curating and contextualizing this new heterochronous data, we were able to detect temporal signal in extant Y. pestis populations without the use of ancient DNA calibrations for the first time.
Our estimates of the tMRCA for the First and Second Pandemics share a common theme, in that the genetic origins potentially pre-date the appearance of plague in traditional (ie. European) historical narratives. For example, the earliest textual evidence of the Second Pandemic (1.PRE) in Europe comes from the Black Death (1346) [43]. However, we estimate the mean tMRCA of this population to be earlier, between 1214 and 1315 CE. Similarly, the first recorded outbreaks of plague during the First Pandemic (0.ANT4) come from the Plague of Justinian (541 CE) [72]. Instead, we estimate that the strains of Y. pestis associated with this pandemic shared a common ancestor between 272 and 465 CE.
One explanation for these disparate timelines is sampling bias, as western European sources dominate both the genetic and historical record. Recent historical scholarship has contested Eurocentric timelines [4,73] by demonstrating the presence of plague in western Asia far earlier than previously thought. Arabic historical chronicles suggest that the Second Pandemic may have begun as early as the 13th century [74]. Genetic dating appears to support these historical critiques, by expanding the timelines of past pandemics to make space for more diverse historical narratives. An alternative explanation for our earlier dates is tip date uncertainty. The radiocarbon estimates for the majority of ancient Y. pestis samples have confidence intervals of ±50 years or more. As we only used the mean sampling age for molecular clock models, it’s possible that the true tMRCA intervals are larger and do overlap with historical estimates. How much uncertainty can be included in molecular clock models for Y. pestis, while still achieving convergence of parameter estimates, remains to be tested.
In contrast to the ancient pandemics, our temporal estimates of the Third Pandemic were more closely aligned to the historical evidence. We estimated that isolates from the Third Pandemic (1.ORI) shared a common ancestor between 1806 and 1901 CE, which aligns well with the timeline as reconstructed from epidemiological reports. Highly localized plague cases began appearing in southern China and (1772-1880) and later spread globally out of Hong Kong (1894-1901) [42,75,76]. Our estimate also overlaps with the majority of previous studies, although it is the youngest tMRCA to date (Table ??). This comparison demonstrates the reproducibility of our estimate, but also reveals how the “origin” story of the Third Pandemic continues to change. The phylogenetic root once estimated to be as old as 326 BCE [62] is now resolved to be much younger (19th century CE). This younger date is particularly intriguing, as a major epidemiological transition occurred in the 19th century with the reemergence of several other notable pathogens [78]. Reconstructing the evolutionary history of the Third Plague Pandemic may not only inform us about the epidemiology of plague, but contribute to a broader understanding of the factors that led to reemerging diseases in the modern era [79].
Even less is known about the Medievalis population whose strains were hypothesized to be responsible for plague outbreaks in the Caspian Sea region which reoccurred throughout the 19th and 20th centuries [61]. We estimated the tMRCA of Medievalis (2.MED) to be between 1560 and 1845 CE, which overlaps with all previously published estimates (Table ??). While this population was once thought to have emerged as early as 583 BCE, there is now growing consensus that the earliest possible emergence was in the 16th century CE. Interestingly, the Caspian Sea region appears to be a nexus of plague as the only known area where the distributions of both European and Asian Y. pestis strains overlap (Figure 3.4). This raises the interesting possibility that distinct populations of Y. pestis were co-circulating during the Second Pandemic, a hypothesis that ancient DNA from Medievalis could help elucidate. In the absence of direct genetic sampling, an alternative approach is to infer the ancestral locations and spread of plague using phylogeographic analysis.
Phylogeographic inference relies heavily on the degree of geographic signal in the data. To assess this in Y. pestis, we tested whether phylogenetic relationships correlate with sampling locations. We identified the closest genetic relative of every genome in our data set, using the maximum likelihood phylogeny. We then recorded whether these genomes were collected from the same location at three levels of resolution: (1) continent, (2) country, and (3) province. As a statistical measure of geographic structure, we reported the percent of genomes that had a closest relative sampled from the same location.
The majority of Y. pestis populations (6/12) were localized to a single continent (Figure S2). Of those distributed across multiple continents, geographic structure ranged from 76% to 99%. At the country level, the degree of geographic structure dropped drastically in some populations (Bronze Age 0.PRE: 38%) while remaining stable in others (3.ANT: 100%). The inverse of this pattern appeared at the province level, where Antiqua (3.ANT) dropped to 45% while the Bronze Age (0.PRE) was unchanged. As expected, geographic structure decreases at finer resolutions but the extent to which varies by population.
The factors which appeared to govern these patterns are wide-ranging, but primarily concern mobility. One striking aspect is the difference in host composition between populations driving this signal. We observed a correlation (R2=0.43) between the degree of geographic structure and the percentage of samples collected from a non-human host (Figure 3.5). Populations primarily sampled from rodents and arthropod vectors had far more geographic structure than those found exclusively in humans. This is epidemiologically consistent with the greater mobility of human populations, which disrupt geographic clustering via long-distance spread.
Another factor is the difference in substitution rates relative to migration rates. In populations that are spread faster (locations/year) than they evolve (substitutions/year), geographic structure decreases as identical isolates (clones) are found in different locations. Lineages of the Second Pandemic (1.PRE) exemplify this, as clonal isolates have been sampled across multiple countries [19], leading to uncertainty regarding the routes of spread. Along similar lines is the disparity in the sizes of locations across the globe used in the analyses. China, for example, is approximately the same size (0.94) as the European continent. Thus, at the country level, plague populations in Asia are sampled over fewer locations and have stronger geographic structure. An informative comparison is Antiqua (3.ANT) with 100% structure across China and Mongolia, and the Second Pandemic (1.PRE) with 55% structure across 11 European countries. Plague populations that are distributed over multiple, smaller locations have less geographic structure, leading to greater uncertainty when inferring past migrations.
These observations suggests that phylogeographic inference is best suited to populations that are slow-spreading and/or rapidly-evolving, a significant problem for Y. pestis, as it is both a rapidly-spreading and slow-evolving pathogen [22]. To explore how this feature impacts our ability to infer ancestral locations for Y. pestis in the past, we independently fit three discrete migration models [80] to the maximum likelihood phylogeny using the sampling locations by: (1) continent, (2) country, and (3) province. For each internal node, we extracted the ancestral location with the highest likelihood given the data. To explore whether genomic evidence can provide meaningful geographic estimates, we compared two case studies: the Third Pandemic, which serves as a “control” for our phylogeographic analysis, and the Second Pandemic, where the origins and spread remain contentious due to limited non-European historical evidence.
The re-emergence of plague during the Third Pandemic was closely monitored by contemporary researchers [81]. As a result, the geographic origins are well-documented and firmly established. Highly localized plague cases first appeared in Yunnan, China (1772-1800), later spreading throughout the province (1800-1880) [42,76]. Plague then traveled eastwards to the coast (1880-1900), where it dispersed globally out of Hong Kong (1894-1901) [82].
We estimated that the Third Pandemic (1.ORI) diverged from an ancestral Intermedium (1.IN) population located in Yunnan, China (probability: 1.00) (Figure 3.6, Figure S3). Plague then rapidly diversified (reflected by a polytomy), after which new lineages appeared in North America (probability: 0.99), South America (probability: 1.00), and Africa (probability 1.00). Due to the unresolved branch structure, we could not confidently estimate the routes of this dispersal. The migrations that could be reconstructed all occurred post this radiation, and included endemic cycling in Southeast Asia (China, Myanmar) as well as North America (USA), which led to a re-introduction of plague into South America (Peru).
The strength and specificity of our estimated origin is striking, given that we could not confidently locate the ancestral divergence for any other population (Figure S3). This may be because the Third Pandemic (1.ORI) is a direct descendant of the Intermedium (1.IN) population, which has strong geographic structure at the province level (87%). In addition, isolates from Yunnan fall both basal to, and within, the known diversity of the Third Pandemic (1.ORI). This combination provides strong evidence of the geographic origin, which is congruent with the historical narrative. This level of precision was only possible due to the extensive sampling of non-human hosts. Yunnan is solely represented by rodent and arthropod samples (N=18) and therefore this reservoir would be entirely invisible if only human isolates were used. Like others [50], we caution that the presence and location of rodent reservoirs should not be inferred from phylogenetic evidence alone. Instead, new modeling approaches have been developed [83] that could leverage multi-disciplinary sources [76] to correct for sampling biases in the genomic data.
In comparison to the Third Pandemic, there is far less surviving historical evidence from the Second Pandemic. Historians have identified early accounts of plague appearing in 1346 in the Golden Horde, which encompass Central Asia and Eastern Europe [2]. The disease then appears to have spread southward through the Caucasus to reach Western Asia, and westward to the Crimea, from which it dispersed throughout Europe, the Middle East, and North Africa. Plague reoccurred for several centuries in these territories, with successive waves varying in scale from localized epidemics to continent-wide outbreaks [43]. In Western Europe, plague receded after 1720 and would not re-emerge again until 1899 [84], while in Western Asia the disease never disappeared [4].
We estimated that the Second Pandemic (1.PRE) diverged from an ancestral population located in China (probability: 0.93) as part of the “Big Bang” polytomy. The ancestral province was poorly resolved, with the most likely location being near Xinjiang (probability): 0.64) which includes the Tien Shan mountains. The location of the Second Pandemic MRCA was also uncertain and estimated to be in Russia (probability: 0.63), specifically in Tatarstan (probability: 0.37) which was part of the Golden Horde. However, these low likelihoods indicate that our estimated origins are poorly supported by the data. With regards to spread, only four migrations could be confidently inferred (likelihood > 0.95) across the full sampling time frame (500 years). The available genetic data therefore provides little definitive evidence as to the spread of plague during the Second Pandemic.
This then begs the question of whether more ancient DNA samples will improve these geographic estimates? As it currently stands, the relationships between all countries could not be resolved during the 14th century, nor among the Baltic states sampled in the 15th century, or between England and Russia in the 17th century. Furthermore, the historical evidence indicates that plague often spread to multiple countries, if not continents, in the span of a decade [85]. This migration rate is far higher than the substitution rate of the Second Pandemic (1.PRE), which accumulates 1 mutation every 9.5 years. The genomic data alone does not have sufficient resolving power to reconstruct the spread of short-term, episodic waves of plague. Instead, this evidence is best used in conjunction with higher-resolution evidence, such as historical case records [67,86].
We sought to contribute to five lines of debate concerning the evolutionary history of Yersinia pestis. The first, was whether Y. pestis has sufficient temporal signal (ie. molecular clock) to accurately estimate rates and dates. Accordingly, we found that a species-wide clock model was methodologically unstable and did not lead to reproducible estimates. However, we observed significant improvements when each Y. pestis population was assessed independently. We therefore recommend this approach for future studies, as the full global diversity of Y. pestis can be utilized without down-sampling.
Second, we explored the minimum sampling time frame for Y. pestis that yields informative rates and dates. The lowest substitution rate was observed in Antiqua (0.ANT) with a median rate of 1 substitution every 14.1 years, meaning that some Y. pestis lineages cannot be differentiated until several decades have passed. In addition, we found no temporal signal in several populations (1.IN, 2.ANT, 3.ANT) which have been sampled over a period as long as 84 years. Genetic evidence alone may not be suitable for reconstructing the timing of short-term, epidemic events of plague.
In the third instance we tackled node dating disparities between studies. We explored how phylogenetic sampling bias drives this outcome, and how it can be detected and remedied with ancient DNA calibrations. In particular, we focused on the non-overlapping tMRCA estimates of the first appearance of Y. pestis in Africa (1.ANT). Until sampling strategies diversify, we caution that the published divergence dates for this population, and several others (Antiqua 0.ANT, Pestoides 0.PE), are of limited value for historical interpretation.
The fourth point revolved around the timing of past pandemics. We observed a common theme in which the genetic dates (tMRCAs) of pandemic Y. pestis potentially pre-date the historical dates by several decades or centuries. For example, we estimated the tMRCA of the Second Pandemic to be between 1214 and 1315 CE which pre-dates the Black Death (1346 - 1353 CE). Similarly, we estimated the tMRCA of the First Pandemic to be between 272 and 465 CE, also pre-dating the Plague of Justinian (531 CE). We discussed this disparity in light of methodological concerns such as radiocarbon dating uncertainty and geographic sampling biases that have historically favored European sources. We anticipate that additional samples from non-European locations will add greater clarity to how the timelines of past pandemics can be expanded to include more diverse historical narratives.
Finally, we asked whether Y. pestis has sufficient geographic signal to accurately infer ancestral locations and spread. As expected, geographic structure diminished at finer resolutions but also varied by population. We found that the geographic origins of the Third Pandemic (1.ORI) were unambiguously inferred to be in Yunnan province, China (likelihood=1.00) and attributed this to comprehensive sampling of rodent reservoirs. In contrast, we demonstrated how the origins and spread of the Second Pandemic (1.PRE) cannot be resolved from genetic evidence alone, as this population is exclusively sampled from human remains which have high mobility. In isolation, Y. pestis genomic evidence may be unsuitable for inferring point migrations and the directionality of spread. Alternatively, new methods which incorporate non-genetic evidence, such as outbreak case-occurrence records [67], into phylogeographic analysis presents an exciting avenue for interdisciplinary collaboration, as explicitly integrative models will complement the strengths of genetic and historical evidence, while mitigating their respective weaknesses.
A visual overview of the computational methods is provided in Figure S7 and is publicly available as a snakemake pipeline (https://github.com/ktmeaton/plague-phylogeography/).
Y. pestis genome sequencing projects were retrieved from the National Centre for Biotechnology Information (NCBI) databases using NCBImeta v0.7.0 [87]. 1657 projects were identified and comprised three genomic types. 1473 projects came from isolates sampled during the 20th and 21st centuries, which we label as “modern”. Of these, (i) 586 projects were available as assembled genomic contigs (FASTA), and (ii) 887 were only available as unassembled sequences (FASTQ). An additional (iii) 184 projects came from skeletal remains with sampling ages older than the 19th century, which we label as “ancient”. The 887 modern unassembled genomes were excluded from this project, as the wide variety of laboratory methods and sequencing strategies precluded a standardized workflow. In contrast, the 184 ancient unassembled genomes were retained given the relatively standardized, albeit specialized, laboratory procedures required to process ancient tissues.
Collection location, date, and host metadata were curated by cross-referencing the original publications. Locations were transformed to latitude and longitude coordinates using GeoPy v2.0.0 and the Nominatim API (https://github.com/osm-search/Nominatim) for OpenStreetMap. Coordinates were standardized at the level of country and province/state, using the centroid of each. Collection dates were standardized according to their year and recording uncertainty arising from missing data and radiocarbon estimates. Genomes were removed if no associated date or location information could be identified in the literature, or if there was documented evidence of laboratory manipulation.
Two additional data sets were required for downstream analyses. First, Y. pestis strain CO92 (GCA_000009065.1) was used as the reference genome for sequence alignment and annotation. Second, Yersinia pseudotuberculosis strains NCTC10275 (GCA_900637475.1) and IP32953 (GCA_000834295.1) served as an outgroup to root the maximum likelihood phylogeny.
Modern assembled genomes were aligned to the reference genome using snippy v4.6.0 (https://github.com/tseemann/snippy), a pipeline for core genome alignments. Default parameters were used, along with the following minimum thresholds: depth of 10X, base quality of 20, mapping quality of 30, major allele frequency of 0.9. Modern genomes were excluded if the number of sites covered at a minimum depth of 10X was less than 70% of the reference genome.
Ancient unassembled genomes were downloaded from the SRA database in FASTQ format using the SRA Toolkit. Pre-processing and alignment to the reference genome was performed using the nf-core/eager pipeline v2.2.1, a reproducible workflow for ancient genome reconstruction [88]. Default parameters were used, along with the following minimum filters: read length of 35 bp, an edit distance of 0.01, and a 16 bp seed length. Only merged reads were retained from paired end-sequencing projects. Ancient genomes were removed if the number of sites covered at a minimum depth of 3X was less than 70% of the reference genome.
A multiple sequence alignment was constructed using the snippy core module of the snippy v4.6.0 pipeline. The output alignment was filtered to only include chromosomal sites that were present in at least 95% of samples (ie. a missing data threshold of 5%). The filtered alignment included 10,249 variant positions exclusive to Y. pestis, with 3,844 sites shared by at least two strains
Model selection was performed on the full data set (N=601) using Modelfinder [89] which identified the K3Pu+F+I model as the optimal choice based on the Bayesian Information Criterion (BIC). The K3P model, also known as K81, estimates substitution rates using three categories, in this case: (1) A<->C equals G<->T, (2) A<->G equals C <->T, and (3) A<->T equals C<->G). The “u+F”suffix indicates that base frequencies will be empirically evaluated and thus are not assumed to be equal. The “+I” suffix indicates that a proportion of the alignment includes invariable sites (ie. non-SNPS),
A maximum likelihood phylogeny was estimated for this data across 10 independent runs of IQTREE2 [90]. Branch support was evaluated using 1000 iterations of the ultrafast bootstrap approximation [91], with a threshold of 95% required for strong support.
The full multiple sequence alignment was alternatively split into 12 populations, referred to as the population data sets. These populations were defined by the intersection of the following nomenclature systems: the major branches, metabolic biovars, and historical time period (Table S1). One sample was excluded, a Pestoides isolate from the Bronze Age (Strain RT5, BioSample Accession SAMEA104488961), as this population would be of size N=1.
In an attempt to improve the performance and convergence of molecular clock analyses, a subsampled data set was also constructed. Populations that contained multiple samples drawn from the same geographic location and the same time period were reduced to one representative sample. The sample with the shortest terminal branch length was prioritized, to diminish the influence of uniquely derived mutations on the estimated substitution rate. An interval of 25 years was identified as striking an optimal balance, resulting in 191 samples, which is a 68% reduction from the original data set.
To explore the degree of temporal signal present in the data, two categories of tests were performed. The first was a root-to-tip (RTT) regression on the mean sampling age using the statsmodels python package. Given the relative simplicity of a regression model, the full data set of 601 genomes was used.
For the second test of temporal signal, a Bayesian Evaluation of Temporal Signal (BETS) was conducted. This consisted of running four model configurations: either with or without sampling dates, and under a strict or uncorrelated lognormal relaxed clock models (strict and UCLN, respectively). We calculated the log marginal likelihood under each model configuration using stepping-stone sampling as implemented in BEAST v1.10 [92]. To this end, we ran 200 path steps, each with a Markov chain Monte Carlo (MCMC) of length 106 steps. In addition to the clock model we used a constant-size coalescent tree prior, a GTR+gamma nucleotide substitution model.
Importantly, the models involved priors that were proper for all parameters, which is essential for marginal likelihood calculations [93]. In particular, the molecular clock rate (ie. the mean of the UCLN clock model or the global rate of the strict clock) had a continuous time Markov chain reference prior [94], the population size of the constant-size coalescent an exponential prior distribution with mean 10, and the standard deviation of the UCLN had an exponential prior with mean 0.33. Marginal likelihood estimation with stepping- stone sampling does not require from the posterior distribution. To obtain the posterior distribution we used an MCMC of 109 steps, sampling every 103 steps. For situations where the effective sample size (ESS) of any parameters was below 200 we increased the chain length by 50% and reduced sampling frequency accordingly.
To explore underlying phylogeography, we performed ancestral state reconstruction using the maximum likelihood method implemented in TreeTime [80]. We independently fit three discrete mugration models to the maximum likelihood phylogeny using the sampling locations by: (1) continent, (2) country, and (3) province. The mapping of countries to continents was defined according to the open-source resource GeoJSONRegions (https://geojson-maps.ash.ms/). For each internal node, we extracted the ancestral location with the highest likelihood given the data.
We also conducted a discrete trait analysis in BEAST [92,95]. Country of sample origin was chosen as the discrete trait of interest. A coalescent constant population size tree prior was chosen with an exponential prior placed on the effective population size with mean 100000. We modeled evolutionary rate with an uncorrelated relaxed lognormal clock, with a CTMC scale prior on the mean and exponential prior with mean 1/3 on the standard deviation of the underlying lognormal distribution [96]. A GTR+gamma nucleotide substitution model with estimated base frequencies for 1.ORI, 1.PRE, 0.ANT4, and 0.PRE. The same settings were used for 2.MED with the exception of swapping the GTR+gamma model to an HKY+gamma model. MCMC chains were run for 107 steps with sampling every 103 steps. We used logCombiner to combine between 3-5 replicate runs, with 10% burnin, for each clade to achieve ESS above 200 for each parameter and Maximum Clade Credibility (MCC) trees [97].
Data visualization was performed using the python package seaborn [98] and Auspice [99], a component of the Nextstrain visualization suite.
This work was supported by the Social Sciences and Humanities Research Council of Canada (#20008499), CIFAR humans and the microbiome program (HP), and the MacDATA Institute (McMaster University, Canada). This research was enabled in part by support provided by Compute Ontario (https://www.computeontario.ca/) and Compute Canada (https://www.computecanada.ca). We would like to thank Madeline Tapson, Dr. Dan Salkeld, and Dr. Jennifer Klunk for their expertise in contextualizing the ecology and evolutionary history of plague. We also thank Jessica Hider and Marie-Hélène B.-Hardy for discussions on the interpretation of genomic data from zoonotic pathogens. We are indebted to Dr. Ana Duggan and Dr. Emil Karpinski for their insight regarding Bayesian methods for phylogenetic analysis. We thank members of the Sherman Centre for Digital Scholarship, including Dr. Andrea Zeffiro, Dr. John Fink, Dr. Matthew Davis, and Dr. Amanda Montague, for their assistance in developing the genomic database. Finally, we would like to thank all past and present members of the McMaster Ancient DNA Centre and the Golding Lab at McMaster University.
K.E, G.B.G, and H.N.P designed the study. K.E, L.F., and S.D performed computational analysis. A.G.C and N.V. provided historical sources and interpretation. E.C.H critiqued and revised the computational methods and discussion. G.B.G provided access to computational resources and data storage. H.N.P and G.B.G supervised the study. K.E wrote the manuscript with contributions from all co-authors.
The authors declare no competing interests.
Prepared 08 December 2021 for submission to
The Proceedings of the National Academy of Sciences
Katherine Eaton*1,2, Ravneet Sidhu*1,3, Jennifer Klunk1,4, Julia Gamble5, Jesper Boldsen6, Ann G. Carmichael7, Nükhet Varlık8, Sebastian Duchene9, Leo Featherstone9, Vaughan Grimes10, G. Brian Golding3, Sharon DeWitte11, Hendrik N. Poinar1,2,12,13,14
*Contributed equally.
1McMaster Ancient DNA Centre, McMaster University, Hamilton, Canada.
2Department of Anthropology, McMaster University, Hamilton, Canada.
3Department of Biology, McMaster University, Hamilton, Canada.
4Daicel Arbor Biosciences, Ann Arbor, USA.
5Department of Anthropology, University of Manitoba, Winnipeg, Canada.
6Department of Forensic Medicine, Unit of Anthropology (ADBOU), University of Southern Denmark, Odense, Denmark.
7Department of History, Indiana University Bloomington, Bloomington, USA.
8Department of History, Rutgers University-Newark, Newark, USA.
9The Peter Doherty Institute for Infection and Immunity, University of Melbourne, Melbourne, Australia.
10Department of Archaeology, Memorial University of Newfoundland, St. Johns, Canada.
11Department of Anthropology, University of South Carolina, Columbia, USA.
12Department of Biochemistry, McMaster University, Hamilton, Canada.
13Michael G. DeGroote Institute of Infectious Disease Research, McMaster University, Hamilton, Canada.
14Canadian Institute for Advanced Research, Toronto, Canada.
The epidemiology of plague in the past is highly controversial, owing to the scarcity and ambiguity of historical evidence. A frequent source of debate is the re-emergence and continuity of plague in Europe during the 14th to 18th centuries CE. Scandinavia is particularly underrepresented in the historical archives, despite having a uniquely long history of plague (5000 years) as revealed through ancient DNA analysis. To better understand the historical epidemiology of plague in this region, we performed in-depth (N=298), longitudinal screening (800 years) for the plague bacterium, Yersinia pestis, across 13 archaeological sites in Denmark. We captured the emergence and continuity of Y. pestis in this region over a period of 400 years (14th - 17th century CE), for which the plague-positivity rate was 8.3% (3.3% - 14.3% by site). These results deepen the epidemiological link between the plague bacterium, Y. pestis, and the unknown pestilence that afflicted medieval and early modern Europe. Furthermore, this study paves the way for the next generation of historical disease research, in which hypotheses concerning mortality can be tested using population-scale, genomic evidence from ancient pathogens.
Europe endured a 500-year long pandemic from the 14th to 18th centuries CE [19]. During this period, mysterious outbreaks reoccurred every 10 years with mortality estimates as high as 50% during the infamous Black Death (1346-1353) [3]. Paleogeneticists have increasingly identified the plague bacterium Yersinia pestis as the most likely agent, although the epidemiology of this pandemic remains controversial [100]. The major source of debate concerns two aspects: mortality and spread. The ecology of Y. pestis is highly complex, and involves both zoonotic “spillover” from rodent populations as well as inter-human transmission [38]. As a result, both disease exposure and spread are known to vary between regions and over time [100]. These differences are challenging to reconcile, and have led to significant controversy concerning the location of plague reservoirs in the past [50].
Recent studies have explored this question by synthesizing genetic evidence [50] and historical records [39,47] across Europe. These sources have significant geographic gaps, such as the complete lack of evidence from Scandinavia in digitized databases [86]. This gap has been attributed to the sparseness of historical sources and ambiguity with regards to disease terminology during the medieval period [3]. However, recent ancient DNA research [101] has revealed that the history of plague in Scandinavia is among the oldest in the world, and established the presence of Y. pestis in Sweden 5000 years ago. This raises the possibility of long-term persistence of plague in Scandinavia, with Y. pestis re-emerging as a local, endemic disease.
To evaluate the possibility of undocumented plague persistence, we screened for the presence of Y. pestis in the Anthropological DataBase Odense University (ADBOU) collection. This extraordinary collection includes preserved and curated skeletal remains from over 16,000 Danish individuals, dated from the Viking Age to the Early Modern period. To ensure a wide variety of locations were represented, we sampled 298 individuals across 13 archaeological sites from the mainland (Jutland), as well as two islands (Funen and Lolland). Based on the skeletal dates, these individuals represent 800 years of population history (1000-1800 CE) which includes both the known pandemic period in Denmark (1350-1657) and the quiescent periods (1000-1350 CE, 1658-1800) for which no outbreaks of plague are historically documented [100].
We detected Y. pestis in 7 archaeological sites using PCR assays and targeted sequencing (Figure 4.1 A). Across the 7 sites, 8.3% of individuals (13/157) tested positive for Y. pestis, ranging from 3.3% at Ribe Lindegärden to 14.3% at Hågerup. This positivity rate could be considered an underestimate of the ‘true’ prevalence of Y. pestis in Danish populations, due to variable DNA preservation. On the other hand, it may be an overestimate due to the osteological paradox [102], in which mortality is selective and the deceased are not representative of the living population. While the exact extrapolation is unclear, our Y. pestis positivity rate (3.3 - 14.3%) does align with mortality estimates (5 - 20%) during the later epidemics of the medieval and early modern period [85,103].
Of the 13 plague-positive individuals, 9 had sufficient sequencing depth (>3X) of the Y. pestis chromosome for phylogenetic analysis (Figure 4.2 C). To estimate a time-scaled phylogeny and dates for these 9 samples, we fit a relaxed molecular clock to an alignment of Y. pestis genomes which included 40 other isolates (Figure 4.1 B). We observed that all Danish strains clustered strongly (posterior: 1.0) within the known diversity of medieval and early modern Y. pestis in Europe (Figure 4.3). We found no evidence to suggest that Neolithic lineages of Y. pestis in Scandinavia (5000 YBP) [101] left descendants in medieval Danish populations. If long-term persistence of Y. pestis did occur in this region, it fell outside the geographic and temporal scope of this study.
We found no evidence of Y. pestis in Denmark between 1000 and 1300 CE. The factors influencing the preservation of ancient DNA are wide-ranging and complex, thus the absence of evidence cannot prove evidence of absence. That being said, we sampled a minimum of 85 individuals and a maximum of 165 individuals that pre-date the 14th century (Figure 4.2 A). Taking the mean positivity rate observed in this study (8.3%), we would expect to detect Y. pestis in 7 to 13 individuals from this time frame if it were present. We therefore interpret our negative results from this period as tentative evidence that Y. pestis was a relatively new pathogen in medieval Denmark, that did not become abundant and/or widespread until at least the 14th century.
The earliest evidence of Y. pestis in Denmark was found in the town of Ribe. Two individuals were associated with Y. pestis from the first half of the 14th century, dated to 1333 (1301-1366) and 1350 (1319-1383). These estimates are highly congruent with the historical record, as the first documented appearance of plague in Denmark was at Ribe in 1349 [104]. Furthermore, these strains fell within an unresolved cluster (posterior: 0.15) of samples from Northern and Western Europe (Figure 4.3) which has previously been linked to clonal spread of the Black Death (1343-1356) [19]. Our molecular dates support this historical association, albeit only weakly, as the precise epidemic period cannot be resolved due to the large confidence intervals of our estimates (>50 years).
The next period in which we identified Y. pestis was in the latter half of the 14th century. A third isolate from Ribe was dated to 1370 (1336-1408) and strongly clustered (posterior: 0.99) with post-Black Death samples from The Netherlands and Russia. These samples have previously been attributed to the pestis secunda (1357-1366) [105], although we find the pestis tertia (1364-76) [85] to be an equally likely candidate. This clade also has broader epidemiological significance, as it is directly ancestral to the Third Pandemic of plague (19th-20th century) [19]. Our results therefore reveal new global connections, as the same lineage that afflicted medieval Danish populations would later re-emerge to cause modern epidemics of plague, including the recent outbreaks in Madagascar [106].
We observed a gap in the continuity of plague at Ribe, as no Y. pestis was detected there between 1408 and 1484. This was surprising, as 86% of individuals (43/50) from this site were archaeologically dated to between 1400 and 1536. Instead, the distribution of Y. pestis appeared to shift during this period from the eastern coast of Jutland to the western coast. We recovered 3 distinct, and possibly contemporaneous, isolates of Y. pestis from 3 sites near Horsens dated to 1429 (1392-1467), 1433 (1403-1464) and 1457 (1427-1487). These genomes were most closely related to individuals sampled in Germany, Lithuania, Poland, and England (Figure 4.3). This geographic association parallels the historical record, in which outbreaks in Denmark coincided with those in the Baltic region [85]. However, recent studies have demonstrated that the directionality and spread of zoonotic diseases cannot be robustly inferred from genomic data alone [83,107]. Instead, our results establish an epidemiological link between Y. pestis and historical case records in Denmark, which could be jointly modeled with greater resolving power [67] in future work.
In the 16th century, we once again observed Y. pestis at Ribe. We dated two Y. pestis isolates from this region to 1513 (1484-1546) and 1525 (1494-1560). Furthermore, we also found evidence of Y. pestis in the northern site of Faldborg dated to 1594 (1550-1649). As an estimate of plague’s disappearance (1649), this is congruent with the historical record which documents the last recorded outbreak of plague in Jutland to last from 1654-1657 [100]. We found no evidence of Y. pestis in Denmark after this point, specifically between 1649 and 1800 CE. However, no individuals definitively post-date 1649 CE, although this period could include a maximum of 70 individuals (Figure 4.2 A). We would therefore expect to detect Y. pestis in 0 to 2 individuals (3.3%) from this time frame if it were present. Our results do not differ from this expectation, and are therefore not informative with regards to the disappearance of Y. pestis in Denmark. To address this question, additional samples would be required from the 17th and 18th centuries.
This study marks the first population-level analysis of ancient Y. pestis, where we performed in-depth (N=298), longitudinal sampling (800 years) within a single country (Denmark). We describe the earliest known appearance of Y. pestis in Denmark (14th century), and document the continuity of this pathogen in Scandinavia over a period of 400 years (17th century). Furthermore, we provide the first positivity rates of historical plague from molecular evidence, as we detected Y. pestis in 8.3% of Danish individuals. Our phylogenetic analysis was highly congruent with the sparse textual evidence of pestilence in Denmark, with regards to the timing of outbreaks and geographic ties to the Baltic region. We also provide novel evidence of plague exposure among Danish populations, such as the site of Tirup, where there is no surviving historical evidence. These results are of importance for both researchers of plague and other infectious diseases, as they (1) illuminate undocumented pathogens in the historical record, (2) reveal new connections between our past and present experience of plague, (3) broaden our understanding of the epidemiology of re-emerging diseases.
We sampled 298 individuals across 13 archaeological sites in Denmark (Figure 4.2 A, Dataset S1). Site occupation dates spanned from the 11th to the 19th century CE. We estimated individual date ranges based on burial position, which was categorized according to cultural shifts that occurred in Denmark throughout the medieval and early modern period [108]. When the original stratigraphic context was preserved, we refined these individual estimates further. For individuals with ambiguous or conflicting archaeological estimates, we performed radiocarbon dating when additional destructive sampling was permitted.
DNA was extracted from teeth and dental pulp according to a specialized protocol for ancient DNA [109]. Reagent blanks were introduced as negative controls to monitor DNA contamination in subsequent steps. We screened for plague using a PCR assay that targets the pla virulence gene in Yersinia pestis [21]. Extracts showing amplification in at least 4/6 replicates were converted into paired-end sequencing libraries [110]. Targeted capture of the Y. pestis genome was performed using previously designed probes [21] and sequenced on an Illumina platform.
Sequenced molecules were aligned to a reference genome using the nf-core/eager pipeline [88]. To phylogenetically place these new samples, we downloaded a comparative dataset of 40 high-coverage Y. pestis genomes (>3X) dated between the 14th and 18th centuries. We then constructed a multiple alignment with the snippy pipeline (https://github.com/tseemann/snippy), which included 356 variation positions and 4,289,810 constant sites.
To tip-date each genome, we performed a Bayesian Evaluation of Temporal Signal (BETS) [64] with BEAST2 [111]. We assumed a constant population size and compared the use of a strict clock and an uncorrelated lognormal (UCLN) relaxed clock. Diffuse normal priors were constructed for all tip-dates, using the mean radiocarbon/mortuary date and half the uncertainty as the standard deviation. All Danish samples were assigned equivalent priors with a mean date of 1330 CE and a standard deviation of 115 years. Bayes factors were calculated by comparing the marginal likelihoods of each candidate model, as estimated with a generalized stepping stone (GSS) computation. The model with the highest marginal likelihood was then run for 100,000,000 generations to ensure the effective sample size (ESS) of all relevant parameters was greater than 200.
Data visualization was performed using the python package seaborn and auspice, a component of the Nextstrain visualization suite [99].
Raw sequence reads have been deposited in NCBI BioProject PRJNAXXXXX. Archaeological metadata is provided in the supplementary information (Dataset SI).
This work was supported by the Social Sciences and Humanities Research Council of Canada (#20008499) and the MacDATA Institute (McMaster University, Canada). This research was enabled in part by support provided by Compute Ontario (https://www.computeontario.ca/) and Compute Canada (https://www.computecanada.ca). We would like to thank Julianna Stangroom, Michael Klowak, Dr. Emil Karpinski, Dr. Matthew Emery, and Dr. Stephanie Marciniak for their assistance in laboratory procedures. We also thank Dr. Ana Duggan for her insight regarding Bayesian methods for phylogenetic analysis. We thank members of the Sherman Centre for Digital Scholarship, including Dr. Andrea Zeffiro, Dr. John Fink, Dr. Matthew Davis, and Dr. Amanda Montague, for their assistance in developing the underlying genomic database. Finally, we would like to thank all past and present members of the McMaster Ancient DNA Centre and the Golding Lab at McMaster University.
K.E, R.S, J.K, and H.N.P designed the study. J.G, J.B, and S.D provided access to archaeological sites and materials. V.G performed radiocarbon dating. K.E, R.S, and J.K performed laboratory analysis. A.G.C and N.V. provided historical sources and interpretation. S.D and L.F critiqued and revised the computational methods and discussion. G.B.G provided access to computational resources and data storage. H.N.P and G.B.G supervised the study. K.E wrote the manuscript with contributions from all co-authors.
The authors declare no competing interests.
In this dissertation, I developed computational methods for genomics research and used them to reconstruct past and present pandemics of plague. In Chapter 2, I resented a novel software called NCBImeta
that facilitates the acquisition of sequence data and metadata from the NCBI repository. This specialized tool supports genomics research in the era of big data, where manual processing of abundant sequence records (10,000+) is impossible. As a paper on software development, its contributions and significance to the field of anthropology are understandably unclear. I targeted this article exclusively towards computational biologists because, at the time, few anthropologists had expressed interested in the issue of collecting and curating sequencing data. Reflecting this, NCBImeta
has mainly been cited across biological fields including studies of the human microbiome [112], plant-associated bacteria in agriculture [113], and emerging infectious diseases in public health (Matthew Gopez & Philip Mabon, personal communication, https://github.com/ktmeaton/NCBImeta/pull/9).
In 2021, I took a more active approach in my discipline and used this software to support several bodies of anthropological research. NCBImeta
was recently used in an environmental reconstruction of Beringia [114], the former land-bridge that facilitated early human migrations into North America from northeast Asia. The study by Murchie et al. furthers our understanding of the peopling of the Americas, and the possible interactions between early human populations and large animals (ie. megafauna) before the Last Glacial Period (~12,000 years ago). NCBImeta
was also recently used to curate sequence data in a case study of the zoonotic disease brucellosis in the 14th century [115]. The pioneering work by Hider et al. demonstrates how pathogen DNA preserves differently throughout the body, ranging from being the dominant microorganism in several tissues while being completely absent in others. It raises an important cautionary note for ancient DNA analysis and the anthropology of disease, by empirically demonstrating how sampling strategies can bias our understanding of what diseases were present in past populations.
In Chapter 3, I explored the challenges in estimating where and when plague appeared in the past, and why these estimates are often not reproducible between studies. I used the software tool from Chapter 2 to collect all publicly available Y. pestis genomes, and carefully curated their collection dates, locations, and hosts. My co-authors and I then used this data set for phylodynamic analysis, and devised a new approach for modeling the rates of evolutionary change (ie. molecular clock). We used these results to explain why divergence dates varied between studies, and outlined a critical framework for identifying which divergence dates should be considered non-informative. In addition, we found that past pandemics of plague may have emerged decades, or even centuries, before they were historically documented in European sources. These early dates are in agreement with recent historical work that examines more diverse (ie. non-European) sources. Through this finding, we demonstrated how genomic dating plays an important role in expanding the timelines of past pandemics to make space for more diverse narratives.
In contrast to our claims of the ‘power’ of genomic evidence, a prominent takeway from Chapter 3 was our discussion of the limitations of DNA. In particular, we found that Y. pestis genomes in isolation are not suitable for reconstructing evolutionary relationships during short-term epidemics. This is because the evolutionary rate of past pandemic lineages is approximately 1 substitution every 10 years. Isolates collected within this time frame (<10 years) are often identical, which means that the directionality of spread cannot be confidently inferred. To mitigate this weakness, complementary evidence is needed that has a higher temporal resolution. Historical case records are an excellent candidate, where plague cases are recorded annually if not weekly [86]. Based on initial comments from readers of the preprint, this conclusion was particularly exciting as it provided guidance on how to avoid over-interpreting ancient DNA evidence, and suggested a new avenue for inter-disciplinary collaboration (Boris Schmidt, personal communication).
In Chapter 4, I applied this updated genomic dataset and molecular clock method to a new problem. While in Chapter 3 we were concerned with estimating the first emergence of pandemic lineages, in Chapter 4 we reconstructed the persistence or continuity of ancient pandemics. We designed a unique longitudinal study, where we sampled skeletal remains spanning 800 years (1000 - 1800 CE) dated to before, during, and after the Second Pandemic (14thth - 18th century). Our sampling strategy focused on Scandinavia, particularly Denmark, as this region is underrepresented in the historical narrative and because the Anthropological DataBase Odense University collection (ADBOU, University of Southern Denmark) has exquisitely curated over 17,000 skeletal remains dated from the Viking Age (10th century) to the Early Modern Period (18th century). Using ancient DNA techniques, we recovered evidence of Y. pestis throughout the 14th to 17th centuries, which perfectly aligns with the historical narrative, limited as it is. Furthermore, our positivity rate for Y. pestis (3.3% - 14.3%) overlaps with mortality estimates from several historical outbreaks during the Second Pandemic. Our results strengthen the argument that Y. pestis was the causative agent of the Second Pandemic, and suggests that plague was a relatively new disease in medieval Denmark. These findings are being expanded on in two upcoming studies. The first, is an examination of how Danish populations responded to this new disease with regards to changes in the human immune system [116]. The second, is a reconstruction of how and when virulence in Y. pestis became attenuated during the Second Pandemic. Taken together, we anticipate these studies will deepen our understanding of disease exposure and experience in Denmark and across Europe.
A reoccurring problem in plague research is how best to integrate multidisciplinary sources, as there is great interest in combining genetic, historical, and environmental records to better understand past pandemics of plague [47,117]. An approach that is commonly used in ancient DNA studies of Y. pestis involves two steps: (1) reconstructing the relationships between epidemics using genetic evidence, and then (2) interpreting those relationships using historical records [19,49,105]. However, a major limitation of this method is that multidisciplinary sources are only integrated during the final interpretation phase. This runs the risk that errors and uncertainty associated with the genetic analysis will propagate, leading to high levels of ambiguity when attempting to provide historical context for this genetic ‘noise’.
An alternative method, is to leverage the strengths and mitigate the weaknesses of interdisciplinary sources in a joint phylogenetic analysis. This novel approach treats historical records (ex. location and date of an outbreak) as special ‘sequence-free’ samples. These records are then combined with DNA evidence to jointly infer a phylogeny, which can then be used to estimate the timing and location of historical events. Recent studies have demonstrated how critical this approach is, as case occurrence records can effectively correct for sampling biases in sparse genomic datasets [67,83]. However, incorporating sequence-free datasets is still a relatively recent method, and to date has only been applied to the study of viruses. Furthermore, it has only been tested on outbreaks occurring over a relatively small geographic area and time range. It remains unknown whether this approach is feasible for bacterial genomics, let alone ancient DNA, where genomes are larger and sampled across greater temporal and geographic scopes. This presents a key line of inquiry for future research, for which the plague bacterium Y. pestis would be an excellent case study.
During the course of this dissertation, my interest in global pandemics turned from an academic curiosity to a lived experience. In 2019, the novel coronavirus SARS-CoV-2 emerged to cause a global pandemic, with over 270 million cases recorded worldwide. While there are many unique aspects of this pandemic, one that has captured my attention is that it is the first pandemic to be monitored with real-time genomic surveillance [118]. Over two million genomic sequences have been deposited in public repositories, which can be used to inform public health responses [119]. However, this avalanche of data has also caused numerous problems, as researchers are struggling to manage this information and utilize it effectively [120]. As a result, database tools such as NCBImeta
presented in Chapter 2, are playing an important role in information management.
One field of ongoing research involves improving the scalability of these tools. For example, NCBImeta
was developed for a data set of ‘only’ 15,000 records, and in its current implementation, cannot process the 1+ million SARS-CoV-2 records on NCBI. A second critical avenue is integrating information from multiple repositories, as surveillance data is inconsistently being deposited in national and international databases [121,122,123]. Progress towards these two objectives will result in more diverse genomic data being analyzed (geographically and temporally), which may improve of our understanding of transmission and spread between and within countries.
Another parallel between this dissertation and the ongoing pandemic involves spatiotemporal modeling. In Chapter 3, we discovered that in our expanded genomic data set, Y. pestis’ rate of spread tends to outpace its rate of evolutionary change. This leads to identical Y. pestis isolates found across multiple countries, such as the case throughout the Black Death (1346-1353). However, we sporadically observed the opposite trend, in which Y. pestis strains collected in a short time frame (<10 years) were extremely different. This tremendous diversity in evolutionary rates meant that we were unable to estimate a single molecular clock for Y. pestis. These issues, clonal spread and rate variation, were also recently documented in SARS-CoV-2 [124]. Ferreira et al. describe this as a paradox in which we “become increasingly uncertain about the relationships among specific lineages as we collect greater amounts of data”. This runs counterintuitive to the general expectation in scientific studies that the more data we collect, the closer we get to the ‘truth’. Overall, this presents a complex theoretical problem that is becoming increasingly prevalent across various disciplines moving into the era of ‘big data’.