flowchart LR A["DNA Sequence<br/>A T G C<br/>1 0 0 0<br/>0 1 0 0<br/>0 0 1 0<br/>0 0 0 1<br/>...<br/>16,048 × 4"] --> B["Conv<br/>+<br/>MaxPool"] --> C["Fully<br/>Connected"] --> D["827 labs<br/>Softmax"] style A fill:#e3f2fd,stroke:#1976d2,stroke-width:3px style B fill:#fff3e0,stroke:#f57c00,stroke-width:3px style C fill:#e8f5e8,stroke:#388e3c,stroke-width:3px style D fill:#fce4ec,stroke:#c2185b,stroke-width:3px
engineered-DNA forensics
Engineered organisms (bio-agents) are created in labs around the world. Lab-leaks are relatively common, even out of :BSL-4 labs. Bad actors, including nation states and terror groups have always had a large interest in bioweapons. However, once a bio-agent is out in the wild, it’s really difficult to figure out who developed it in the first place.
The problem is called genetic engineering attribution (GEA). The challenge of GEA is tracing back engineered organisms to their designers through signatures in their DNA.
But why could that work? There are many degrees of freedom in genetic engineering, and the combination of design, style and tools used creates a unique DNA pattern in the synthetic organism, reflecting its lab-of-origin.
Once mature, GEA will allow us to quickly respond to outbreaks, identify responsible labs to fill safety gaps and, importantly, deter bad actors. It’s something we should invest in.
While the field itself arguably started with the microbial forensics investigations of Amerithrax, GEA back then was manual, specific and slow. In the machine learning era, we’re able to develop general tools that work for a wide range of bio-agents and labs. But how far are we?
2018: raw nucleotides + CNN
Nielsen & Voigt, 2018 is the first study using machine learning for GEA. They used engineered :Plasmid sequences from Addgene, an open source database. It’s one of the few databases where each engineered sequence is linked to the lab that designed it.
Their data contained 36,764 plasmid sequences from 827 labs. To make the sequences available to model, they simply :one-hot encoded each nucleotide, resulting in a 16,048 x 4 matrix for each plasmid sequence. Then, they trained a simple CNN to predict which of the labs the sequence came from.
The CNN correctly predicted the lab-of-origin in 48% of held-out sequences, and 70% of the time the correct lab was in the top 10 predictions. This isn’t quite good enough to apply to a real-world case, but they made a point: There actually is a unique lab-of-origin signature in these sequences. ML-GEA is up to a promising start.
2020: base-pair encoding + LSTM
Alley et al, 2020 are next in line. They also used plasmid data from Addgene. After preprocessing, they had a dataset of 81833 sequences from 1,314 labs, substantially larger than Nielsen & Voigt two years before.
Instead of directly training a model on nucleotides, they used :byte-pair encoding (BPE) to identify recurring patterns in the DNA, called motifs. These motifs mapped to codons, regulatory and conserved regions. Each motif then became a token that is fed to the model.
In addition to DNA, they also trained their model based on six phenotypes such as growth temperature and antibiotic resistance. In a real world scenario, this means that we would have to sequence the sample and run lab tests to get these phenotypes.
To predict the lab-of-origin from sequences and phenotypes, they used an :LSTM trained in a two step process. First, they trained on sequences only, and then added phenotypes and finetuned the model. This training strategy prevented the model from getting stuck in a local minimum caused by the phenotype data early in training.
flowchart LR A["DNA Sequence<br/>ATGCGTAA...<br/>↓<br/>BPE Motifs"] --> B["LSTM"] --> C["Metadata<br/>+<br/>Concat"] --> D["1314 labs<br/>Softmax"] style A fill:#e3f2fd,stroke:#1976d2,stroke-width:3px style B fill:#fff3e0,stroke:#f57c00,stroke-width:3px style C fill:#e8f5e8,stroke:#388e3c,stroke-width:3px style D fill:#fce4ec,stroke:#c2185b,stroke-width:3px
The sequence+phenotype LSTM predicted the correct lab-of-origin 70% of the time, with a 84% top-10 accuracy. This is better than before, but it’s unclear whether the performance increase comes from LSTM vs. CNN, BPE vs raw nucleotide encoding, more sequence data or the addition of meta-data to the training.
The authors also show that the model is well :calibrated and that a simpler random forest model can predict the nation-of-origin with 88% top-3 accuracy compared to 47% for simply guessing the most abundant nations in the database.
2022: pangenome + alignment (no deep learning!)
Do we actually need deep learning at all? Wang et al., 2022 don’t think so. They developed PlasmidHawk, which uses a much simpler approach based on a :pangenome, sequence alignment and fragment counting. They needed full-length plasmid DNA, resulting in a dataset of 38,682 plasmids from 896 labs. This is how the method works:
Setup
- Create a pangenome from ALL plasmid sequences
- Align each plasmid sequence back to the pangenome to annotate each fragment or sub-sequence with lab-of-origin information. These are not unique, some fragments will be annotated with multiple labs.
Prediction
- Take a new plasmid sequence and align to the pangenome
- Count how often each lab occurs among the aligned fragments. This is a raw score already. One of the labs will occur most often, making it a good candidate for the lab-of-origin.
- But they calculate a lab score - a more elegant, weighted version of the raw count. The idea is: the fewer labs share a fragment, the more indicative it is for labs that have it.
flowchart LR subgraph S1 ["Setup Phase"] E["Addgene<br/>Repository"] F["Build<br/>Pan-genome"] G["Annotate<br/>Lab Origins"] E --> F --> G end subgraph S2 ["Prediction Phase"] A["Unknown<br/>Plasmid"] B["Align to<br/>Pan-genome"] C["Count<br/>Fragments"] D["Predict<br/>Lab Origin"] A --> B --> C --> D end G -.-> B style S1 fill:none,stroke:none style S2 fill:none,stroke:none style E fill:#f3e5f5,stroke:#7b1fa2,stroke-width:4px,color: style F fill:#fff9c4,stroke:#f9a825,stroke-width:4px,color: style G fill:#fff9c4,stroke:#f9a825,stroke-width:4px,color: style A fill:#e3f2fd,stroke:#1976d2,stroke-width:4px,color: style B fill:#fff3e0,stroke:#f57c00,stroke-width:4px,color: style C fill:#e8f5e8,stroke:#388e3c,stroke-width:4px,color: style D fill:#fce4ec,stroke:#c2185b,stroke-width:4px,color:
Their method, called PlasmidHawk, had a 76% accuracy for lab-of-origin prediction and a 85% top-10 accuracy, which is better than the two neural network models before. They say it’s more efficient than deep learning too, as updating the pangenome with new plasmids is quick compared to re-training a neural network, but the data isn’t that big and DL hardware/software is becoming very efficient, so I believe this isn’t a substantial point anymore.
A big advantage of this approach is interpretability. For each novel sequence, the method directly surfaces the association between fragment and lab-of-origin after aligning it to the pangenome. This leaves a few more degrees of freedom though: The size of fragments has to be chosen manually, and so does the similarity threshold for alignment. The method also doesn’t allow to incorporate additional meta-data like the LSTM approach above, which will be crucial in real-world GEA.
2022: GEA competition
The first GEA competition had 1,200 competitors. They worked with the Alley at al data: 81,833 plasmid sequences + meta-data and 1314 labs-of-origin to predict. There are many more labs in the Addgene database, but all labs with fewer than ten plasmids were pooled into an “Unknown Engineered” category.
The top-teams’ models vastly outperformed previous models, with a top-1 accuracy of 82% and a top-10 accuracy of 95%.
They were also much better at negative attribution - the ability to exclude potential designers. This is measured by an X99 score. X99 is the minimum positive integer N so that the top-N accuracy is at least 99%. For example, X99 would be 279 if the lab-of-origin would be among the top 279 predicted candidates 99% of the time. In Nilsen & Voigt 2018, the X99 was 898, the winning model in the competition had a score of 299.
So what were the modeling strategies of the winners? Mostly :ensembles containing CNNs, though the details (pre-processing etc) varied substantially.
Problems
Crook et al 2022 describe some of the issues they observed in the competition models:
- Low :calibration scores, except for the winning model
- Problematic large composite “Unknown Engineered” class, in which all labs with fewer than 10 sequences (~2000 labs) are pooled. The top-10 accuracy for the Unknown Engineered class was consistently very high, inflating the overall performance, though the problem didn’t seem to be very big for the best models.
Deeper dive into a top solution
Soares et al published their solution, which was one of the winners. Their model had a 90% top-10 accuracy.
Their initial model was a CNN with two tweaks. First, they also used :BPE. Second, they used circular shift augmentation. The key insight here is that plasmid DNA is circular, it doesn’t have a start or an end. The sequence ATGCACTAG
shifted by 3 is CACTAGATG
. To reflect this, they randomly shift sequences during training, which helps the model to learn local motifs and relative nucleotide arrangements rather than absolute positions, and increases the training set. This model was already better than all pre-competition models, with a 76% top-1 and 89% top-10 accuracy.
But the interesting bit is their second model. Here, the base-model is the same CNN as before, but instead of a final softmax layer that directly predicts the lab, they used a triplet network.
The idea of a triplet network is this: for each training example, we create a triplet consisting of a plasmid sequence (the anchor), its true lab-of-origin (the positive) and a different lab (the negative). The CNN processes the sequence and the labs into 200-D learnable :embedding vectors.
During training, the model learns to position the anchor embedding vector closer to the positive lab embedding than to the negative one, using a triplet loss function. Instead of predicting lab-classes, they create embedding vectors. These can then be used to still predict classes, simply by measuring the distance of a sequence embedding to the nearest labs. The closest lab embedding is the top-1 candidate.
Now, while the model was only ever so slightly better than the simpler CNN, it comes with other advantages:
- Interpretability: We can visualise the embeddings easily
- Few-shot learning: Even with a single sequence from a new model we can ask which lab embeddings are closest
- Dimensional > categorical: Instead of assigning a single class, we get distances in space, which can tell a more nuanced story
Summary - where are we
GEA is in its infancy. Every single study so far has built models to predict the lab-of-origin from plasmid sequences in the Addgene database.
This is largely a pragmatic choice. Addgene contains lots of curated sequences with associated meta-data and lab-of-origin. Studies so far have shown that GEA is possible, and that models can predict the lab-of-origin with accuracies of ~70-80%, decent negative attribution and good calibration.
Study (Year) | Model | Data | Sequences | Labs | Top-1 Accuracy | Top-10 Accuracy | Key Innovation |
---|---|---|---|---|---|---|---|
Nielsen & Voigt (2018) | CNN | Addgene plasmids | 36,764 | 827 | 48% | 70% | First ML approach to GEA |
Alley et al (2020) | LSTM | Addgene plasmids + metadata | 81,833 | 1,314 | 70% | 84% | BPE encoding, phenotype data |
Wang et al (2022) | Pangenome alignment | Addgene plasmids | 38,682 (full-length) | 896 | 76% | 85% | Non-ML approach, interpretable |
Competition Winner (2022) | CNN ensembles | Addgene plasmids + metadata | 81,833 | 1,314 | 82% | 95% | Ensemble methods, k-mers, BLAST-PCA features |
Soares et al (2022) | CNN + Triplet Network | Addgene plasmids + metadata | 81,833 | 1,314 | 76% | 90% | Circular shift augmentation, embedding space |
The research so far also shows that small tweaks like BPE, or circular shift augmentation can still make a large difference. It seems likely that there’s a lot of low-hanging fruit here.
But most models probably won’t be useful in the case of a real outbreak yet, because:
Many bio-agents won’t have a plasmid - A bio-agent might not have a plasmid at all. Most viruses don’t have plasmids and even engineered bacteria may not have an engineered plasmid if the engineered trait has been chromosomally integrated.
Models are less likely to predict bad actors - Terror organisations won’t deposit sequences in Addgene, and neither will the labs they are associated with. Models so far are therefore best at predicting trustworthy labs that care about transparency and data sharing. This is still useful for tracing accidental outbreaks though.
Open-source might be a problem - All models so far are trained on open-source data, allowing bad actors to create plasmid sequences mimicking other labs. This will leave its own traces, but is worth thinking about. AI-designers could help to further cover tracks, though other AIs could be trained to spot this.
How to move forward
So how do we make GEA practical for real-world outbreaks?
Data - We want GEA to handle any sequence, from whole bacterial and viral genomes and plasmids to :synthetic assemblies. Therefore, we need to create comprehensive databases linking these to lab-of-origin and meta-data.
Models - Most models so far are highly specific in their hyperparameter optimisation, data-preprocessing and ensembling. To handle a wide range of sequences, we need flexible models that are trained across the biosec-relevant part of the tree-of-life and can handle any sequence length. Sequence foundation models in the style of evo2 might be able to do this.
Integration into a wider framework - Technical GEA is only part of the solution. We’ll also have non-technical information (location, epidemiological features etc.) of outbreaks as well as intelligence (whistleblowers, surveillance). We need to integrate them into a coherent framework.
Genetic attribution security - Once GEA is mature enough, there will be new problems, such as AI-assisted evasion of lab-of-origin detection. At this point we’ll need defensive systems that are robust to obfuscation and stay on top in a new adversarial arms race between GEA tools and evasion techniques.
The game is afoot.
:x plasmid
Plasmids are small, circular pieces of DNA that float around outside the actual genome. They are a good vector to transport genes into the cell, are relatively easy to work with and self-replicating. That’s why they are often used for genetic engineering.
:x bpe
Byte-pair encoding (BPE) is a tokenization method that breaks down DNA sequences into meaningful subunits rather than individual nucleotides. Instead of treating each A, T, G, C separately, BPE identifies frequently occurring pairs and merges them into single tokens, creating a more efficient representation. They also allow inputs to be of different lengths, which is very useful for DNA sequences.
:x lstm
Long Short-Term Memory (LSTM) is a type of neural network designed to remember information over long sequences. Unlike regular neural networks that forget previous inputs, LSTMs have special “gates” that control what information to keep, forget, or output. They also take input of different lengths, which is great for DNA sequences.
:x pan
Traditionally, genomics has worked with a single reference genome against which samples are aligned. This approach loses variation - if a sample has genetic features that can’t be aligned to the reference, they’re discarded. A pangenome captures all genetic variation across multiple samples in a single data structure, often represented as a graph rather than a linear sequence.
:x ensemble
Ensemble methods do as they sound: They combine the predictions of multiple different models, often leading to more accurate and reliable results.
:x calibration
Calibration is about aligning stated probabilities with empirical frequencies. DL models are often overconfident. When a model is well-calibrated it should do the following: if it predicts something with a 70% probability, the thing should happen 70% of the time. If someone tells me there’s a 30% rain probability in Edinburgh today, I expect that it rains 3 out of 10 times they give me that prediction.
:x one-hot
One-hot encoding transforms categorical variables into binary vectors where only one element is “hot” (1) and all others are “cold” (0). For DNA nucleotides: A → [1,0,0,0], T → [0,1,0,0], G → [0,0,1,0], C → [0,0,0,1]. This allows machine learning algorithms to process the data.
:x bsl4
BSL-4 labs have the strictest biosafety precautions, as they deal with agents that are aerosol-transmitted and/or highly dangerous and for which there is often no treatment or vaccine.
:x synth
A fully synthetic DNA molecule, often created from smaller components.
:x embed
Any object - a word, DNA sequence, image, or even a lab - can be translated into a vector, i.e. an array of numbers. This is called an embedding. The dimensionality (100, 200, 1024) is usually chosen manually.
The idea is that these embedding vectors are trained so that similar objects are close together in embedding space, and dissimilar ones are farther apart. After training, the positions of these vectors encode rich, abstract properties of the original objects.
For example, in LLMs, the embedding of “King” - “Man” + “Women” ends up close to “Queen” - because the model has learned how sex and royalty are structured in language. In GEA, plasmid embeddings from the same lab will end up close together in embedding space, as they all share similar design signatures.