Instructions to use preparebuddy/ielts-2b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use preparebuddy/ielts-2b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="preparebuddy/ielts-2b") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("preparebuddy/ielts-2b") model = AutoModelForCausalLM.from_pretrained("preparebuddy/ielts-2b") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use preparebuddy/ielts-2b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "preparebuddy/ielts-2b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "preparebuddy/ielts-2b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/preparebuddy/ielts-2b
- SGLang
How to use preparebuddy/ielts-2b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "preparebuddy/ielts-2b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "preparebuddy/ielts-2b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "preparebuddy/ielts-2b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "preparebuddy/ielts-2b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use preparebuddy/ielts-2b with Docker Model Runner:
docker model run hf.co/preparebuddy/ielts-2b
- PrepareBuddy IELTS-2B (Qwen3.5) โ the "data can rival size" model
- Why this model is interesting
- Where it fits best (real-world use cases)
- Pros & cons
- Links
- What it generates
- Prompt format (not a chat model โ use the tag prefix)
- Examples (real, unedited outputs from this 2B model; trimmed with [โฆ])
- Supported types per section โ and how to prompt each
- Generating a full exam section (one passage โ all question types)
- Usage (transformers)
- Recommended architecture for reliable output (important)
- Strengths & honest limits (2B)
- Training
- License
- The 2B / 4B / 9B family โ pick the right one
- Getting better results: grounding + the re-checking loop (the biggest quality lever)
- Prompt tips
- Why this model is interesting
PrepareBuddy IELTS-2B (Qwen3.5) โ the "data can rival size" model
A small, fast, specialised content-generation model that produces IELTS Academic practice material across all four sections โ Reading, Writing, Listening, Speaking โ from a simple structured prompt. A fine-tune of Qwen3.5-2B (Apache-2.0), trained on PrepareBuddy's own curated IELTS content.
A content generator, not an assessment tool. It writes passages, transcripts, tasks, questions and answer keys. It does not score student work. A fine-tune of Qwen3.5-2B โ not a from-scratch foundation model.
Why this model is interesting
This is the surprise of our 2B/4B/9B study. Fine-tuning the same data took the 2B's verdict accuracy from 40% โ 80% โ matching, even edging, the 9B (a model 4.5ร larger). The finding: fine-tuning's benefit is inversely proportional to base capability โ it transformed the weak 2B and was flat on the bigger models. Data can rival size on a target skill.
But the same 2B stays weaker on facts and completion than the 4B/9B โ which is exactly why you run it with grounding (generate against a real passage). Used that way, it's the most efficient choice in the family, and the best, cheapest verdict verifier (see the re-checking loop below). Full story: the technical report.
Where it fits best (real-world use cases)
Reach for the 2B when cost, speed, or footprint matter โ and you can ground it:
- On-device / privacy-sensitive apps โ runs in ~5 GB (or ~2 GB 4-bit) on a laptop or edge box; no cloud, data stays local.
- High-volume drafting on a budget โ the cheapest model to mass-generate Writing/Speaking and grounded Reading.
- The verifier in a pipeline โ its single best role: re-checking another model's answer keys (cheap, and the best checker in the family).
- Grounded generation โ feed it a real passage and its weak from-scratch facts stop mattering.
Not the best pick for from-scratch fact-heavy passages or sentence completion โ use the 4B/9B.
Pros & cons
| โ Pros | โ ๏ธ Cons |
|---|---|
| Cheapest & fastest; runs on a laptop/edge device | Weak from-scratch facts โ needs grounding |
| Best, cheapest answer-key verifier in the family | Completion only ~37% in-passage (vs 100% on 4B/9B) |
| Fluent Writing & Speaking | Verdicts over-commit (tend to restate facts as TRUE) |
| 0 non-English-token leak; Apache-2.0; tiny footprint | Not for unsupervised fact-heavy generation |
Links
- ๐ง Models: ielts-2b ยท ielts-4b ยท ielts-9b
- ๐ป Apple Silicon / LM Studio (MLX): ielts-2b-mlx ยท ielts-4b-mlx
- ๐ Try the live demo: Hugging Face Space
- ๐ Full technical report & findings: ielts-qwen3.5
What it generates
| Section | Types | Output |
|---|---|---|
| Reading | TFNG, YNNG, MCQ, Sentence/Summary Completion, Matching, Long-form | passage + questions + answer key with justifications |
| Writing | Task 1 (chart), Task 2 (essay) | task prompt + word limit + timing (+ chart data for T1) |
| Listening | dialogue/monologue | transcript + questions + answer key (text for downstream TTS) |
| Speaking | Part 1, 2, 3 | examiner question / cue card + model answer |
Prompt format (not a chat model โ use the tag prefix)
<TEST=IELTS><SECTION=READING><TYPE=TFNG><DIFF=medium><TOPIC=ocean currents> Generate a short passage with 4 True/False/Not Given statements and an answer key.
SECTION=READING | WRITING | LISTENING | SPEAKINGTYPE(Reading) =TFNG | YNNG | MCQ | MCQ_MULTI | SENTENCE_COMPLETION | SUMMARY_COMPLETION | MATCHING_HEADINGS | MATCHING_FEATURES | MATCHING_ENDINGS | LONGFORM; (Writing)TASK1 | TASK2; (Speaking)PART1 | PART2 | PART3; (Listening)LISTENINGDIFF=easy | medium | hard- Serve with
enable_thinking=Falseโ for this task, reasoning mode lowers verdict accuracy.
Examples (real, unedited outputs from this 2B model; trimmed with [โฆ])
The 2B is fluent on Writing/Speaking and correct on grounded Reading. It is weaker on from-scratch facts and on completion โ so the examples below lead with its strengths, and the Reading examples use the recommended grounded approach (keys verified against the passage).
Writing โ Task 2 (temp 0.7)
Input
<TEST=IELTS><SECTION=WRITING><TYPE=TASK2><DIFF=medium><TOPIC=whether university education should be free> Generate an IELTS Academic Writing Task 2 question with word limit and timing.
Output
TASK:
Some people think that university education should be completely free for all students, while
others believe it is more practical to make it partly affordable through fees and taxes.
Discuss both views and give your own opinion.
WORD LIMIT: 250 words TIME: 40 minutes
Speaking โ Part 2 (temp 0.7)
Output
CUE CARD:
Describe a memorable journey you have taken. You should say: what the destination was and when it
took place; what happened during the journey; how you felt; and what lasting impression it left.
MODEL ANSWER:
The most memorable journey of my life was a bus trip across the country to visit family in rural
southern England. I was travelling alone on a rainy Tuesday afternoon, and the journey lasted
roughly two hours and fifty minutes โ at first it felt like a long, uneventful passage through the
countryside. But as I settled into my seat and watched the raindrops roll down the window, the mood
of the ride changed [โฆ] I came to realise that life journeys are rarely perfectly predictable, and
that sometimes the detours carry something of their own value.
Reading โ verdicts (TFNG / YNNG)
The 2B's Reading strength is grounded MCQ (below) and its verifier role โ not producing FALSE / NOT GIVEN variety from scratch (it tends to restate facts as TRUE). For worked TFNG examples with FALSE and NOT GIVEN verdicts, see the 4B / 9B cards โ or run the 2B as the checker in the verification loop below.
Reading โ Multiple Choice (grounded, temp 0.3 โ key verified)
Output
QUESTION โ Choose the correct letter, A, B, C or D.
What are coral reefs built from?
A. Warm water and sunlight in shallow areas
B. Calcium carbonate secreted by tiny animals called polyps
C. Algae living in the polyps' tissues
D. The Great Barrier Reef
ANSWER KEY:
1. B โ the passage states reefs are "built by tiny animals called polyps, which secrete calcium carbonate". Warm water/sunlight are growth conditions; algae are the polyps' food.
Where the 2B needs help (be honest): sentence/summary completion is unreliable from this model (it often echoes the sentence instead of extracting a word โ ~37% in-passage vs 100% on the 4B/9B). For completion, prefer the 4B/9B, or always verify.
Supported types per section โ and how to prompt each
| Section ยท Type | Prompt <TYPE=โฆ> |
Temp | What you get |
|---|---|---|---|
| Reading ยท True/False/Not Given | TFNG |
0.3 | passage + statements + key |
| Reading ยท Yes/No/Not Given | YNNG |
0.3 | opinion passage + statements + key |
| Reading ยท Multiple choice | MCQ / MCQ_MULTI |
0.3 | passage + AโD question(s) + key |
| Reading ยท Sentence/Summary completion | SENTENCE_COMPLETION / SUMMARY_COMPLETION |
0.3 | gap items + key (weak on 2B โ verify/use 4B+) |
| Reading ยท Matching | MATCHING_* |
0.5 | matching task + key (experimental) |
| Reading ยท Long-form | LONGFORM |
0.6 | ~600-word passage + mixed questions + key |
| Writing ยท Task 1 / Task 2 | TASK1 / TASK2 |
0.7 | task + word limit + timing |
| Speaking ยท Part 1/2/3 | PART1 / PART2 / PART3 |
0.7 | examiner question / cue card + model answer |
| Listening | LISTENING |
0.7 | transcript + questions + key |
Tip โ for dependable Reading keys, generate grounded: prepend a real passage and add "Using ONLY the passage below โฆ Do not write a new passage."
Generating a full exam section (one passage โ all question types)
Generate one passage, then each question type against it:
<โฆTYPE=LONGFORMโฆ> Write ONLY a ~600-word IELTS reading passage. No questions.- For each type:
Using ONLY the passage below, write 5 TFNG statements with an answer key. Do not write a new passage.\nPASSAGE:\n<passage> - Concatenate โ a real-exam-style section. Grounding keeps facts consistent. (The demo Space does this.)
Usage (transformers)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "preparebuddy/ielts-2b"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, dtype=torch.bfloat16, device_map="auto").eval()
SYSTEM = ("You generate authentic IELTS Academic practice content across reading, writing, "
"listening, and speaking. Produce passages, transcripts, tasks, questions, and answer "
"keys or model answers as appropriate to the section. Use IELTS-style register: "
"academic, neutral, factually plausible. This is content generation, not assessment.")
user = "<TEST=IELTS><SECTION=SPEAKING><TYPE=PART2><DIFF=medium><TOPIC=a memorable journey> Generate an IELTS Speaking Part 2 cue card with a model answer."
inp = tok.apply_chat_template([{"role":"system","content":SYSTEM},{"role":"user","content":user}],
add_generation_prompt=True, enable_thinking=False, return_tensors="pt", return_dict=True).to(model.device)
out = model.generate(**inp, max_new_tokens=700, do_sample=True, temperature=0.7, top_p=0.9)
print(tok.decode(out[0][inp["input_ids"].shape[1]:], skip_special_tokens=True))
Settings: temp 0.3 for verdicts (TFNG/YNNG/MCQ), 0.7 for passages/writing/speaking; top_p 0.9; one SECTION+TYPE per call. Also available as ielts-2b-mlx for Apple Silicon / LM Studio.
Recommended architecture for reliable output (important)
The 2B is a strong drafter and an excellent verifier. For dependable answer keys, run it as a system:
- Ground โ generate questions against a real passage (facts come from the source, not the model). For the 2B this matters most โ it removes its biggest weakness (from-scratch facts).
- Verify โ re-check each answer key with an independent judge. The trained 2B is itself the best, cheapest verifier in the family.
- Review/regenerate the small flagged minority.
Measured end-to-end: raw grounded generation โ 75% โ โ 85โ90% with this verify loop.
Strengths & honest limits (2B)
- โ Best verdict judge / cheapest verifier in the family; fluent Writing / Speaking; fast on a laptop; 0 non-English-token leak.
- โ ๏ธ Weak completion grounding (~37% in-passage) and more factual slips in from-scratch passages than the 4B/9B โ grounding strongly recommended.
- โ ๏ธ Verdict generation from scratch tends to over-commit (restates facts as TRUE) โ use grounding + verification, or the 4B/9B, for FALSE/NOT GIVEN variety.
- Listening/Speaking output is text (for downstream TTS); no audio. Not an assessment tool.
Training
LoRA fine-tune of Qwen3.5-2B (bf16; r16/ฮฑ32; completion-only loss; enable_thinking=False; 2 epochs, lr 1e-4) on 1,438 curated + balanced examples (โโ
NOT GIVEN in verdict types), trained on NVIDIA cloud GPUs; runs on a laptop. Dataset not released (proprietary). Full method, hardware and results: technical report.
License
Apache-2.0, inheriting from Qwen3.5-2B. Free to use, modify, distribute (incl. commercially); retain attribution to the base model and PrepareBuddy.
The 2B / 4B / 9B family โ pick the right one
| ielts-2b โญ | ielts-4b | ielts-9b | |
|---|---|---|---|
| Best for | cheapest; best verdict judge/verifier | balanced general use | best facts (from scratch) |
| Verdict accuracy (fine-tuned)ยน | 80% | 74% | 77% |
| Completion answers in-passage | โ ๏ธ 37% | โ 100% | โ 100% |
| Facts in from-scratch passages | weakest | good | โ best |
| Size (bf16) | ~5 GB | ~9 GB | ~18 GB |
| Use with grounding | strongly | recommended | recommended |
ยน greedy, 101-item held-out gold. Fine-tuning's benefit is inversely proportional to base capability โ it transformed the 2B (+40) and was flat on the 4B/9B. Full method + findings + tables: technical report.
Getting better results: grounding + the re-checking loop (the biggest quality lever)
1. Ground โ generate against a real passage so facts come from the source:
Using ONLY the passage below, write 4 True/False/Not Given statements with an answer key. Do NOT write a new passage.
PASSAGE: <your real passage>
2. Re-check (verify) โ independently re-judge each answer key, flag disagreements:
for statement in generated_statements:
verdict = judge(model, passage, statement) # TRUE / FALSE / NOT GIVEN
if verdict != generated_key[statement]:
flag_for_review_or_regenerate(statement)
The trained 2B is the best, cheapest verifier here โ training made it a far better checker (40%โ80%), not just a generator. 3. Review / regenerate the flagged minority. Measured: โ 75% raw โ โ 85โ90% with this loop.
Prompt tips
- Always use the tag prefix
<TEST=IELTS><SECTION=โฆ><TYPE=โฆ><DIFF=โฆ><TOPIC=โฆ>โ it's not a chat model. - Temperature: 0.3 for verdicts, 0.7 for passages/writing/speaking; top_p 0.9.
enable_thinking=Falseโ reasoning mode lowers verdict accuracy.- One SECTION+TYPE per call; build a full section by generating each type against one shared passage.
- Downloads last month
- -


