Experiment: Reproduce Plant Caduceus

## Description

Plant Caduceus ([Zhai et al. 2025 [June]](https://doi.org/10.1073/pnas.2421738122)), a.k.a. PlantCAD, is a DNA language model trained on [Angiosperm](https://en.wikipedia.org/wiki/Flowering_plant) genomes.  It is useful for a variety of applications in selective crop breeding, evolutionary biology and genomic research.  A second iteration, PlantCAD2 ([Zhai et. al 2025 [September]](https://www.biorxiv.org/content/10.1101/2025.08.27.672609v1.full)), was published in preprint this year as well with long context support and a far more extensive evaluation suite (to name a couple improvements).

It is built on top of the [Caduceus](https://caduceus-dna.github.io/) architecture ([Schiff et al. 2024](https://doi.org/10.48550/arXiv.2403.03234)) with a bi-directional Mamba backbone.  As an experiment, it would be very useful to see what amount of compute is necessary to reach the capability of the first generation PlantCAD model via Marin.  I'll call that model PlantCAD1.

## Hypothesis or Goal

I believe it may be possible to accomplish much of what PlantCAD1 did using a standard LLM architecture.  In this experiment, I would like to test this primarily as a function of performance per a zero-shot evaluation task used frequently in DNA modeling: evolutionary conservation.  This is typically measured as ROC AUC on a binary labeling of sequences as either conserved or not conserved.  See `Methods` in [PlantCAD2](https://www.biorxiv.org/content/10.1101/2025.08.27.672609v1.full.pdf) for more details.

## Links

References:

- Discord thread: [marin#dna/1422730578242965565](https://discordapp.com/channels/1354881461060243556/1418673157585502370/1422730578242965565)
- Experimental branch: [Open-Athena/marin#2](https://github.com/Open-Athena/marin/pull/2)
- PlantCAD code: [kuleshov-group/PlantCaduceus](https://github.com/kuleshov-group/PlantCaduceus)
- Levanter PR for custom evals: [marin-community/levanter#1219](https://github.com/marin-community/levanter/pull/1219)

## Results

I ran a number of training experiments related to this effort in https://github.com/Open-Athena/marin/pull/2.  I'll keep most of the lower level details related to this experiment in that thread while summarizing only key results here.  Similarly, https://github.com/marin-community/marin/pull/1730 contains only the minimal code necessary to reproduce that work.  I'm hoping that can serve as a useful reference for anyone else potentially interested in running some experiments on DNA (e.g. @dlwh, @gonzalobenegas, @kothasuhas, @Helw150, @emarro).

Anyways, I currently have 3 useful results:

#### 1. Approximate PlantCAD parity

This scaling relationship, as a function of performance on a downstream DNA conservation task, looks promising for building plant DNA models on Marin:

<img width="3320" height="2041" alt="Image" src="https://github.com/user-attachments/assets/d5a8f3d2-2a7b-47c8-a23f-51d287f3b670" />

#### 2. Perplexity vs downstream performance

I'm not sure how often this will hold up, but my results don't suggest that eval perplexity is a great predictor of downstream performance when nearing convergence:

<img width="3306" height="1861" alt="Image" src="https://github.com/user-attachments/assets/d3c143c4-bdd0-4993-92f9-d3d748f444c5" />

Put differently, the downstream performance saturated before eval loss did.

#### 3. Hot/cold model gaps are substantial

I had no clue whether or not this should be true with DNA as it is for text, and it was notable to see how much of a difference WSD-S-style cooldowns (or at least that was the aspiration modulo this complication: [discord.com/marin#1425488001978339363](https://discordapp.com/channels/1354881461060243556/1357769641132298321/1425488001978339363)) made on eval loss:

<img width="3997" height="2081" alt="Image" src="https://github.com/user-attachments/assets/dd6a0201-c28d-48df-a440-84572083d2b5" />

The text annotations indicate the gap in terms of steps, epochs, and tokens between the same eval scores for the hot and cold models to provide some context on how big the difference is.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: Reproduce Plant Caduceus #1729

Description

Hypothesis or Goal

Links

Results

1. Approximate PlantCAD parity

2. Perplexity vs downstream performance

3. Hot/cold model gaps are substantial

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Experiment: Reproduce Plant Caduceus #1729

Description

Description

Hypothesis or Goal

Links

Results

1. Approximate PlantCAD parity

2. Perplexity vs downstream performance

3. Hot/cold model gaps are substantial

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions