Blog
February 16, 2026
5 mins

Validating Paper-to-Skills: Double/Debiased Machine Learning Case Study

At Owkin, we believe strongly in research reproducibility. Like many academic labs, we strive to publish our methods with clean, well-documented source code whenever possible. This commitment to open science helps advance the field and enables researchers to build on each other's work.

However, the reality is that not all papers come with accessible implementations. Sometimes the core methodology lacks published code entirely. Other times, the main method is available, but important "satellite" techniques, such as data preprocessing steps, benchmarking procedures, or supplementary analyses, remain documented only in the paper's methods section.

This gap inspired us to build Paper-to-Skills, a tool that extracts methodologies from scientific papers and converts them into executable skills for AI coding agents like Claude Code.

But a critical question remained: How accurate are the extracted implementations compared to the original authors' code?

To answer this, we conducted a validation study using one of the most mathematically sophisticated papers in causal inference.

Coding agents and skills

AI coding agents (like Claude Code) can now write and execute code based on natural language instructions. By giving them specialized "skills" - reusable methodologies extracted from papers - researchers can apply complex computational methods without deep programming expertise. We're bringing this capability to scientific research.

The Test Case: Double/Debiased Machine Learning

We chose to validate Paper-to-Skills with Chernozhukov et al.'s "Double/Debiased Machine Learning for Treatment and Structural Parameters" (2018, The Econometrics Journal). This paper presents a rigorous framework for estimating causal treatment effects in the presence of high-dimensional confounders: a common challenge in clinical trial analysis and observational studies.

Why this paper?

  1. Mathematical complexity: The methodology involves Neyman orthogonal scores, cross-fitting procedures, and influence-function-based standard errors: concepts that are theoretically deep and practically challenging to implement correctly.
  2. Wide adoption: The method has become a standard tool in econometrics and biostatistics, with an official Python implementation (doubleml package) maintained by the original authors.
  3. Published source code exists: This allowed us to validate our extraction against the definitive implementation, the gold standard for comparison.

Important note: When official source code exists (as in this case), we always recommend using it. This validation was specifically designed to test Paper-to-Skills' accuracy for scenarios where code is not available.

Methodology

We followed a rigorous validation protocol:

  1. Extraction: We used Paper-to-Skills to extract the Double/Debiased ML methodology from the Chernozhukov et al. paper, generating executable code for the Partially Linear Regression (PLR) estimator.
  2. Application: We applied both the Paper-to-Skills-generated code and the official doubleml package to three real-world clinical trials:
    1. ACTG 175: Large HIV clinical trial (N=2,139)
    2. Burn trial: Burn injury treatment study (N=154)
    3. Licorice trial: Licorice root for ulcer prevention (N=233)
  3. Comparison: We compared treatment effect estimates (θ), standard errors (SE), and p-values between both implementations.
Figure: Screenshot of the executable code for the Partially Linear Regression (PLR) estimator generated using Paper-to-Skills.
Results

The results demonstrate remarkable concordance between the Paper-to-Skills extraction and the official implementation:

Detailed Comparison
Trial N Method Metric Paper-to-Skills Official Package Absolute Diff Relative Diff
ACTG 175 2,139 PLR-RF θ 50.90 50.97 0.07 0.1%
ACTG 175 2,139 PLR-RF SE 5.18 5.19 0.01 0.2%
Burn 154 PLR-Lasso θ -0.154 -0.150 0.004 2.6%
Burn 154 PLR-Lasso p-value 0.0383 0.0421 0.0038 9.9%
Licorice 233 PLR-Lasso θ -0.099 -0.100 0.001 1.0%
Licorice 233 PLR-Lasso p-value 0.0653 0.0605 0.0048 7.9%
Table: Comparison of Paper-to-Skills vs Official Package results.

PLR-RF: Partially Linear Regression with Random Forest for nuisance estimation

PLR-Lasso: Partially Linear Regression with Lasso for nuisance estimation

Key Observations
  1. Treatment effect estimates (θ) matched within 0.1-2.6% across all trials. For the largest trial (ACTG 175), the difference was just 0.07 units on an estimate of ~51.
  2. Standard errors were virtually identical (0.2% difference for ACTG 175).
  3. P-values showed slightly larger relative differences (7.9-9.9%) but remained substantively identical—both implementations led to the same statistical conclusions.
  4. Source of differences: The minor variations stem from random seed handling. The Paper-to-Skills implementation used a manual median-of-20 repetitions, while the doubleml package uses internal n_rep=20 aggregation. Both approaches are methodologically sound; the differences reflect stochastic variation, not implementation errors.
Figure: Comparison of treatment effect estimates (θ) between both implementations.
Technical Deep Dive: What Was Implemented?

The Paper-to-Skills extraction successfully reproduced several sophisticated components:

  • Cross-fitting: Data splitting to avoid overfitting bias in nuisance parameter estimation
  • Neyman orthogonality: Score function construction that is locally insensitive to nuisance parameter estimation errors
  • Debiasing: Correction for regularization bias in machine learning estimators
  • Influence function standard errors: Asymptotically valid inference accounting for estimation uncertainty

These are not trivial details. They represent the core theoretical innovations of the Chernozhukov et al. framework, and getting them right requires careful attention to the paper's mathematical exposition.

Implications

This validation demonstrates that Paper-to-Skills can accurately extract and implement complex statistical methodologies from scientific papers. The extracted code is not a rough approximation: it reproduces the mathematical framework with statistical equivalence to expert-maintained implementations.

When to Use Paper-to-Skills vs. Official Code

Use official implementations when available:

  • They represent the authors' definitive implementation
  • They're maintained and updated
  • They're the standard for reproducibility and methods comparisons

Use Paper-to-Skills when:

  • No official implementation exists for the method
  • You need satellite techniques (preprocessing, benchmarking) not covered in the code
  • You want to quickly prototype or understand a method before committing to a full implementation
  • You're working with legacy papers where code has been lost or deprecated
Conclusion

The Double/Debiased ML validation demonstrates that Paper-to-Skills can reliably extract sophisticated methodologies from scientific papers and generate code that matches expert implementations. While we always recommend using official source code when available, this validation gives us confidence that Paper-to-Skills fills a genuine gap for methods where code is unavailable.

We're continuing to test and improve Paper-to-Skills across diverse domains such as bioinformatics, computational biology, machine learning, and beyond. The tool is currently in beta, and we welcome feedback from the research community.

Try Paper-to-Skills: https://paper2skills.com

Reference:

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, James Robins. "Double/debiased machine learning for treatment and structural parameters." The Econometrics Journal, Volume 21, Issue 1, February 2018, Pages C1–C68. https://doi.org/10.1111/ectj.12097

Authors

Davide Mantiero

Testimonial

No items found.
Validating Paper-to-Skills: Double/Debiased Machine Learning Case Study

No items found.
No items found.
No items found.