gemma-4-e4b-scientific-reviewer

Fine-tuned Gemma 4 E4B for academic peer review on CS/AI conference papers — generates a structured review (Summary · Strengths · Weaknesses · Technical Soundness · Clarity · Significance) and an accept/reject decision.

  • Base model: unsloth/gemma-4-E4B-it
  • Method: On-policy distillation from Gemini 3.1 Pro (LoRA SFT, r=32, 5 epochs)
  • Teacher: Gemini 3.1 Pro Preview (gemini-3.1-pro-preview) via Vertex AI
  • Training data: 334 (paper, teacher review, decision) pairs — prompts sampled from PeerRead ICLR 2017 + OpenReview ICLR 2019-2021, targets generated by Gemini under the same reviewer system prompt
  • Benchmark: PeerRead 95 papers (ACL 2017 + CoNLL 2016 + ICLR 2017 test/dev), zero train-set overlap

Performance

Same benchmark prompt on fine-tuned and base.

Fine-tuned vs Base

Summary

Metric Fine-tuned Base (gemma-4-E4B-it)
Decision F1 0.892 0.843
Accuracy 0.811 0.737
Precision 0.860 0.848
Recall 0.925 0.838
Predicted accept rate 90.5% 83.2%
Avg inference time ~2 s / paper ~2 s / paper

Ground-truth accept rate on this benchmark: 84.2%.

Per-venue F1 (fine-tuned): ICLR 2017 0.922 · ACL 2017 0.762 · CoNLL 2016 0.500 (n=3).

Calibration

Calibration

A key property of the distilled model is calibrated decisions. The teacher (Gemini 3.1 Pro) accepts ~74% of papers on the training prompts, and the student inherits this calibration: 90.5% predicted accept rate versus 84.2% ground truth — a 6.3 pp gap on a test set that is already skewed toward acceptance.

Usage

Serve with vLLM:

python -m vllm.entrypoints.openai.api_server \
  --model NhatCuong22/gemma-4-e4b-scientific-reviewer \
  --dtype bfloat16 \
  --max-model-len 12288 \
  --gpu-memory-utilization 0.88 \
  --max-num-seqs 16 \
  --enable-prefix-caching

Call with the benchmark prompt:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

SYSTEM = (
    "You are an experienced academic peer reviewer for top AI/ML conferences "
    "(ICLR, NeurIPS, ACL, CVPR). Produce structured and substantive reviews.\n\n"
    "RULES:\n"
    "1. Be SPECIFIC - reference concrete details from the paper.\n"
    "2. Be FAIR - list strengths AND weaknesses honestly.\n"
    "3. Your final recommendation should reflect real conference accept rates (~30-50%). "
    "Reserve rejection for papers with major methodological issues."
)

USER_TMPL = """Review the following paper. Format:

## Summary
<1-2 paragraphs on contribution>

## Strengths
- <specific strength 1>

## Weaknesses
- <specific weakness 1>

## Technical Soundness
<paragraph>

## Clarity and Writing
<paragraph>

## Significance
<paragraph>

## Final Recommendation
End with EXACTLY ONE line:
    I recommend this paper for ACCEPTANCE.
    I recommend this paper for REJECTION.

---

Title: {title}
Abstract: {abstract}
Paper content:
{body}
"""

resp = client.chat.completions.create(
    model="NhatCuong22/gemma-4-e4b-scientific-reviewer",
    messages=[
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": USER_TMPL.format(title=..., abstract=..., body=...)},
    ],
    max_tokens=1400,
    temperature=0.2,
    top_p=0.9,
)
print(resp.choices[0].message.content)

Method: Gemini distillation

  1. Prompt pool: 434 unique papers from PeerRead ICLR 2017 + OpenReview ICLR 2019-2021 with their system/user prompts.
  2. Teacher generation: Gemini 3.1 Pro Preview produced one review per prompt on Vertex AI (temperature 0.3, thinking budget 1024). Total teacher tokens: ~6.5M in / ~1.6M out.
  3. Filtering: keep only rows where the regex I recommend this paper for (ACCEPTANCE|REJECTION)\.? parses cleanly; dedupe by review-text hash. Yields 334 train + 100 val.
  4. Student SFT: LoRA r=32, α=64 on Gemma 4 E4B via Unsloth; 5 epochs, cosine LR 2e-5, batch 8, bf16. Final train loss 2.628. Merged to 16-bit.

The student inherits the teacher's reviewing style and calibrated decision distribution — whereas training directly on human PeerRead reviews tends to collapse to "accept everything" because the test set is 84% accept.

Training Setup

Setting Value
Base model unsloth/gemma-4-E4B-it (4-bit via Unsloth)
Method LoRA (r=32, α=64, dropout=0, all linear projections)
Teacher gemini-3.1-pro-preview (Vertex AI)
Training examples 334 pairs (72.1% accept) + 100 val
Epochs 5
Steps 210
Effective batch 8 (per-device 1 × grad-accum 8)
Max seq len 8192
LR 2e-5 cosine, warmup 3%
Optimizer AdamW 8-bit
Precision bf16
Seed 42
Hardware 1 × RTX 4080 SUPER 32 GB
Wall-clock 95 min training + 13 min merge
Training loss 9.52 → 2.63

Artifacts

  • model.safetensors — merged 16-bit weights
  • adapter_model.safetensors — LoRA adapter (for applying on top of the base)
  • benchmark/peerread_ft_full.jsonl — per-paper generations on the 95-paper test set
  • benchmark/peerread_ft_metrics.json — aggregated metrics
  • benchmark/peerread_base_full.jsonl — base-model predictions on the same test set
  • benchmark/peerread_base_metrics.json — base-model metrics

Known limitations

  • Benchmark is skewed toward acceptance (84.2%), so a model that always accepts scores a deceptively high F1. The calibration plot is the more robust signal.
  • Training prompts are dominated by ICLR; performance on non-ICLR venues (ACL, CoNLL) is lower.
  • Teacher and student share the Gemma/Gemini SentencePiece tokenizer family but are different architectures, so this is soft-label-free SFT distillation (text-level), not logit-level KD.

License

Gemma license, inherited from the base model.

Downloads last month
25
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NhatCuong22/gemma-4-e4b-scientific-reviewer

Finetuned
(69)
this model