Skip to content

Experiment: Reproduce Plant Caduceus #1729

@eric-czech

Description

@eric-czech

Description

Plant Caduceus (Zhai et al. 2025 [June]), a.k.a. PlantCAD, is a DNA language model trained on Angiosperm genomes. It is useful for a variety of applications in selective crop breeding, evolutionary biology and genomic research. A second iteration, PlantCAD2 (Zhai et. al 2025 [September]), was published in preprint this year as well with long context support and a far more extensive evaluation suite (to name a couple improvements).

It is built on top of the Caduceus architecture (Schiff et al. 2024) with a bi-directional Mamba backbone. As an experiment, it would be very useful to see what amount of compute is necessary to reach the capability of the first generation PlantCAD model via Marin. I'll call that model PlantCAD1.

Hypothesis or Goal

I believe it may be possible to accomplish much of what PlantCAD1 did using a standard LLM architecture. In this experiment, I would like to test this primarily as a function of performance per a zero-shot evaluation task used frequently in DNA modeling: evolutionary conservation. This is typically measured as ROC AUC on a binary labeling of sequences as either conserved or not conserved. See Methods in PlantCAD2 for more details.

Links

References:

Results

I ran a number of training experiments related to this effort in Open-Athena#2. I'll keep most of the lower level details related to this experiment in that thread while summarizing only key results here. Similarly, #1730 contains only the minimal code necessary to reproduce that work. I'm hoping that can serve as a useful reference for anyone else potentially interested in running some experiments on DNA (e.g. @dlwh, @gonzalobenegas, @kothasuhas, @Helw150, @emarro).

Anyways, I currently have 3 useful results:

1. Approximate PlantCAD parity

This scaling relationship, as a function of performance on a downstream DNA conservation task, looks promising for building plant DNA models on Marin:

Image

2. Perplexity vs downstream performance

I'm not sure how often this will hold up, but my results don't suggest that eval perplexity is a great predictor of downstream performance when nearing convergence:

Image

Put differently, the downstream performance saturated before eval loss did.

3. Hot/cold model gaps are substantial

I had no clue whether or not this should be true with DNA as it is for text, and it was notable to see how much of a difference WSD-S-style cooldowns (or at least that was the aspiration modulo this complication: discord.com/marin#1425488001978339363) made on eval loss:

Image

The text annotations indicate the gap in terms of steps, epochs, and tokens between the same eval scores for the hot and cold models to provide some context on how big the difference is.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions