Benchmarking Biology: How Owkin is deepening its collaboration with NVIDIA towards the next frontier of Biological Artificial Super Intelligence

Building on our January 2026 collaboration announcement with NVIDIA, we are benchmarking state-of-the-art NVIDIA models on rigorous biological reasoning tasks to fuel the development of K Pro, our AI Scientist. Rigorous evaluation is how we earn confidence in the models that drug discovery and development depends on.

Five months ago, we announced our collaboration with NVIDIA to accelerate frontier model development for Biological Artificial Superintelligence, using NVIDIA NeMo RL to strengthen OwkinZero, our biological large reasoning model and one of the specialist models orchestrated within K Pro. As we head into BIO 2026 in San Diego, we want to share the next layer of that work, and where we are taking it.

The problem with evaluating AI models in biology

General AI benchmarks are not biology benchmarks. A model that performs well on coding, math, legal reasoning, or general knowledge may perform surprisingly poorly when asked to reason about biology. While those fields operate within structured, human-defined rule sets, biological inference requires navigating emergent properties, incomplete datasets, and chaotic systems. What works well in deterministic environments fails when confronted with the multi-layered complexity of human biology.

And in drug discovery, the liability of a poorly reasoning LLM extends beyond operational inefficiencies like missed targets and failed discovery programs. Crucially, such models lack the explainability required to rationalize why a predicted molecule is a viable candidate, ultimately undermining scientific validity and the validation process.This is why evaluation matters so much to us, and why we invest in building rigorous, biology-specific benchmarks rather than relying on general leaderboards as a proxy for scientific performance.

What we have been doing: benchmarking Nemotron™ on biological reasoning

The newest work in our NVIDIA collaboration is a benchmark study. We evaluated NVIDIA Nemotron-3-Nano-30B-A3B-BF16 on a Q&A dataset derived from PharmacoDB, a publicly available pharmacogenomics database. The benchmark was designed to test biological reasoning at multiple levels of difficulty, employing stringent split tests that challenged models with every combination of seen and unseen drugs and cell lines. We evaluated it alongside Qwen and a set of leading state-of-the-art commercial models.

The goal was straightforward: to understand how well these models perform on the kind of biological reasoning that actually matters for K Pro, and to generate concrete, domain-specific evidence that can guide the next phase of model development.

What’s next with BioNeMo Agent Toolkit

While the general-purpose capabilities of Nemotron-3-Nano-30B-A3B-BF16 provided the initial baseline for AI workflows, our next step centers on deploying NVIDIA BioNeMo Agent Toolkit to unlock highly specialized, domain-specific biological intelligence, to accelerate our drug discovery and development pipeline.

Why this matters for drug discovery and development

Pharmaceutical research demands an unusual degree of trust from any AI system it relies on. Before a model's output can inform a discovery decision, we need to be able to evaluate it transparently, adapt it to specific biological contexts, audit how it reasons, and understand precisely where it performs well and where it fails, whatever its provenance. This is why we benchmark models the way we do. Rigorous, domain-grounded evaluation surfaces where today's models are strong, where gaps remain, and which targeted improvements would most benefit life sciences applications. Models that can be examined and adapted in depth, such as NVIDIA Nemotron, lend themselves particularly well to this work, and the biology-specific evidence we generate can in turn help inform the continued development of NVIDIA models for life sciences. It is also a natural extension of our collaboration with NVIDIA: applying one of Owkin’s core strengths – rigorous, domain-grouded evaluation – to the hardest problems in biology.

What this means for K Pro, our AI Scientist

Every improvement in the biological model ecosystem feeds directly into K Pro, which draws on a range of specialist and general-purpose models — from our own OwkinZero to rigorously evaluated frontier foundation models — to reason across scientific tasks. The AI Scientist we are building requires models that do not just retrieve information but autonomously explore research space, reasoning over complex datasets, connecting a genomic variant to a clinical outcome, linking a molecular structure to a therapeutic hypothesis, synthesising evidence across modalities that no single human researcher could process at speed.

Benchmarking is not just an internal exercise. It is the scientific infrastructure that makes it possible to build an AI scientist we can trust in drug discovery: it is how we decide which model to rely on for which task, knowing exactly what it can do, where it needs support, and how to improve it systematically.

As we deepen this collaboration with NVIDIA, the ambition is clear. We want to bring specialist biological models to a level where they can serve as the reasoning backbone for autonomous AI scientists, trained and refined through shared expertise, evaluated against the hardest problems in biology, and ultimately working to automate R&D and accelerate the path from scientific question to treatment for patients.