Training mRNA Language Models Across 25 Species for $165

Training mRNA Language Models Across 25 Species for $165

5 0 0

Imagine going from a therapeutic protein concept to a synthesis-ready, codon-optimized DNA sequence in an afternoon. That’s what the team at OpenMed set out to build. And they actually did it for $165.

This is Part II of their journey, following a survey of the protein AI landscape. Now it’s about the build: a pipeline that takes a protein idea from concept to expression-ready DNA. Three stages: predict the 3D structure, design amino acid sequences that fold into that shape, and optimize the codons so the protein actually expresses in the target organism.

The folding and design parts use established tools like ESMFold and ProteinMPNN. The real work went into the mRNA optimization component. That’s where OpenMed trained new models, built new infrastructure, and ran extensive experiments.

The Architecture Exploration

The core question: which transformer architecture works best for codon-level language modeling? This matters because codon optimization is crucial for therapeutic mRNA, vaccines, and recombinant protein production. The genetic code is degenerate: the same protein can be encoded by astronomically many different DNA sequences, but some codon arrangements express 100x better than others. The Pfizer-BioNTech COVID vaccine was codon-optimized for human expression.

OpenMed started with a small CodonBERT baseline (6M params) and scaled up through two families: ModernBERT, representing the latest efficiency innovations from NLP, and RoBERTa, the proven workhorse behind Meta’s ESM protein language models. The choice of RoBERTa was deliberate: Meta’s ESM-2 (which powers ESMFold) is itself a RoBERTa variant trained on protein sequences.

The Results

CodonRoBERTa-large-v2 emerged as the clear winner with a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming ModernBERT. This is higher than I expected for a 312M parameter model trained on coding sequences. The team then scaled to 25 species, training 4 production models in 55 GPU-hours on a single RTX 6000 Ada. Total cost: $165 for cloud compute.

What surprised me is that ModernBERT, despite its architectural innovations, didn’t win. The RoBERTa family, with its proven MLM architecture, still reigns supreme for biological sequences. It’s a reminder that architectural novelty doesn’t always translate to better performance on domain-specific tasks.

The Pipeline Components

Protein Folding: ESMFold v1 predictions on 30 protein chains, average PTM of 0.79. Working batch pipeline.

Sequence Design: ProteinMPNN on scaffold 7K00, 42% sequence recovery. This is the part where you take a target structure and generate amino acid sequences that fold into it.

mRNA Optimization: Trained multiple transformer variants on 250k CDS, then scaled to 381k sequences across 25 species. CodonRoBERTa-large-v2: perplexity 4.10, CAI 0.40. Multi-species suite: 4 models spanning 25 organisms.

Where This Stands

This is not a polished success story. It’s a transparent account of what worked, what surprised them, and what they’d do differently. The code is runnable, the results are reproducible, and the cost is absurdly low. $165 for training mRNA language models across 25 species is a bargain compared to what pharma companies spend on proprietary tools.

The species-conditioned system is something no other open-source project offers. You can optimize codons for E. coli, human cells, yeast, or any of the 25 species they trained on. That’s genuinely useful for researchers working on protein expression.

What’s next? OpenMed plans to release the models on Hugging Face and integrate them into a web interface. If they pull that off, we’ll have a complete open-source pipeline from protein concept to synthesis-ready DNA. That’s the kind of tool that accelerates drug discovery and synthetic biology.

Comments (0)

Be the first to comment!