Validating Paper-to-Skills: Double/Debiased Machine Learning Case Study

At Owkin, we believe strongly in research reproducibility. Like many academic labs, we strive to publish our methods with clean, well-documented source code whenever possible. This commitment to open science helps advance the field and enables researchers to build on each other's work.

However, the reality is that not all papers come with accessible implementations. Sometimes the core methodology lacks published code entirely. Other times, the main method is available, but important "satellite" techniques, such as data preprocessing steps, benchmarking procedures, or supplementary analyses, remain documented only in the paper's methods section.

This gap inspired us to build Paper-to-Skills, a tool that extracts methodologies from scientific papers and converts them into executable skills for AI coding agents like Claude Code.

But a critical question remained: How accurate are the extracted implementations compared to the original authors' code?

To answer this, we conducted a validation study using one of the most mathematically sophisticated papers in causal inference.

Coding agents and skills

AI coding agents (like Claude Code) can now write and execute code based on natural language instructions. By giving them specialized "skills" - reusable methodologies extracted from papers - researchers can apply complex computational methods without deep programming expertise. We're bringing this capability to scientific research.

The Test Case: Double/Debiased Machine Learning

We chose to validate Paper-to-Skills with Chernozhukov et al.'s "Double/Debiased Machine Learning for Treatment and Structural Parameters" (2018, The Econometrics Journal). This paper presents a rigorous framework for estimating causal treatment effects in the presence of high-dimensional confounders: a common challenge in clinical trial analysis and observational studies.

Why this paper?

Mathematical complexity: The methodology involves Neyman orthogonal scores, cross-fitting procedures, and influence-function-based standard errors: concepts that are theoretically deep and practically challenging to implement correctly.
Wide adoption: The method has become a standard tool in econometrics and biostatistics, with an official Python implementation (doubleml package) maintained by the original authors.
Published source code exists: This allowed us to validate our extraction against the definitive implementation, the gold standard for comparison.

Important note: When official source code exists (as in this case), we always recommend using it. This validation was specifically designed to test Paper-to-Skills' accuracy for scenarios where code is not available.

Methodology

We followed a rigorous validation protocol:

Extraction: We used Paper-to-Skills to extract the Double/Debiased ML methodology from the Chernozhukov et al. paper, generating executable code for the Partially Linear Regression (PLR) estimator.
Application: We applied both the Paper-to-Skills-generated code and the official doubleml package to three real-world clinical trials:
1. ACTG 175: Large HIV clinical trial (N=2,139)
2. Burn trial: Burn injury treatment study (N=154)
3. Licorice trial: Licorice root for ulcer prevention (N=233)
Comparison: We compared treatment effect estimates (θ), standard errors (SE), and p-values between both implementations.

*Figure: Screenshot of the executable code for the Partially Linear Regression (PLR) estimator generated using Paper-to-Skills.*

Results

The results demonstrate remarkable concordance between the Paper-to-Skills extraction and the official implementation:

Detailed Comparison

Trial	N	Method	Metric	Paper-to-Skills	Official Package	Absolute Diff	Relative Diff
ACTG 175	2,139	PLR-RF	θ	50.90	50.97	0.07	0.1%
ACTG 175	2,139	PLR-RF	SE	5.18	5.19	0.01	0.2%
Burn	154	PLR-Lasso	θ	-0.154	-0.150	0.004	2.6%
Burn	154	PLR-Lasso	p-value	0.0383	0.0421	0.0038	9.9%
Licorice	233	PLR-Lasso	θ	-0.099	-0.100	0.001	1.0%
Licorice	233	PLR-Lasso	p-value	0.0653	0.0605	0.0048	7.9%

Table: Comparison of Paper-to-Skills vs Official Package results.

PLR-RF: Partially Linear Regression with Random Forest for nuisance estimation

PLR-Lasso: Partially Linear Regression with Lasso for nuisance estimation

Key Observations

Treatment effect estimates (θ) matched within 0.1-2.6% across all trials. For the largest trial (ACTG 175), the difference was just 0.07 units on an estimate of ~51.
Standard errors were virtually identical (0.2% difference for ACTG 175).
P-values showed slightly larger relative differences (7.9-9.9%) but remained substantively identical—both implementations led to the same statistical conclusions.
Source of differences: The minor variations stem from random seed handling. The Paper-to-Skills implementation used a manual median-of-20 repetitions, while the doubleml package uses internal n_rep=20 aggregation. Both approaches are methodologically sound; the differences reflect stochastic variation, not implementation errors.

*Figure: Comparison of treatment effect estimates (θ) between both implementations.*

Technical Deep Dive: What Was Implemented?

The Paper-to-Skills extraction successfully reproduced several sophisticated components:

Cross-fitting: Data splitting to avoid overfitting bias in nuisance parameter estimation
Neyman orthogonality: Score function construction that is locally insensitive to nuisance parameter estimation errors
Debiasing: Correction for regularization bias in machine learning estimators
Influence function standard errors: Asymptotically valid inference accounting for estimation uncertainty

These are not trivial details. They represent the core theoretical innovations of the Chernozhukov et al. framework, and getting them right requires careful attention to the paper's mathematical exposition.

Implications

This validation demonstrates that Paper-to-Skills can accurately extract and implement complex statistical methodologies from scientific papers. The extracted code is not a rough approximation: it reproduces the mathematical framework with statistical equivalence to expert-maintained implementations.

When to Use Paper-to-Skills vs. Official Code

Use official implementations when available:

They represent the authors' definitive implementation
They're maintained and updated
They're the standard for reproducibility and methods comparisons

Use Paper-to-Skills when:

No official implementation exists for the method
You need satellite techniques (preprocessing, benchmarking) not covered in the code
You want to quickly prototype or understand a method before committing to a full implementation
You're working with legacy papers where code has been lost or deprecated

Conclusion

The Double/Debiased ML validation demonstrates that Paper-to-Skills can reliably extract sophisticated methodologies from scientific papers and generate code that matches expert implementations. While we always recommend using official source code when available, this validation gives us confidence that Paper-to-Skills fills a genuine gap for methods where code is unavailable.

We're continuing to test and improve Paper-to-Skills across diverse domains such as bioinformatics, computational biology, machine learning, and beyond. The tool is currently in beta, and we welcome feedback from the research community.

Try Paper-to-Skills: https://paper2skills.com

Reference:

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, James Robins. "Double/debiased machine learning for treatment and structural parameters." The Econometrics Journal, Volume 21, Issue 1, February 2018, Pages C1–C68. https://doi.org/10.1111/ectj.12097

Try Paper-to-Skills