| | --- |
| | tags: |
| | - lora |
| | - transformers |
| | base_model: local-synthetic-gpt2 |
| | license: mit |
| | task: text-generation |
| | --- |
| | |
| | # SQL OCR LoRA (synthetic, CPU-friendly) |
| |
|
| | This repository hosts a tiny GPT-2–style LoRA adapter trained on a synthetic SQL Q&A corpus that mimics table-structure reasoning prompts. The model and tokenizer are initialized from scratch to avoid external downloads and keep the pipeline CPU-friendly. |
| |
|
| | ## Model Details |
| | - **Architecture:** GPT-2 style causal LM (2 layers, 4 heads, 128 hidden size) |
| | - **Tokenizer:** Word-level tokenizer trained on the synthetic prompts/answers with special tokens `[BOS]`, `[EOS]`, `[PAD]`, `[UNK]` |
| | - **Task:** Text generation / instruction following for SQL-style outputs |
| | - **Base model:** `local-synthetic-gpt2` (initialized from scratch) |
| |
|
| | ## Training |
| | - **Data:** 64 synthetic Spider-inspired text pairs combining schema prompts with target SQL answers (no real images) |
| | - **Batch size:** 2 (gradient accumulation 1) |
| | - **Max steps:** 30 |
| | - **Precision:** fp32 on CPU |
| | - **Regularization:** LoRA rank 8, alpha 16 on `c_attn` modules |
| |
|
| | ## Usage |
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | model = AutoModelForCausalLM.from_pretrained("JohnnyZeppelin/sql-ocr") |
| | tokenizer = AutoTokenizer.from_pretrained("JohnnyZeppelin/sql-ocr") |
| | text = "<|system|>Given the database schema displayed above for database 'sales_0', analyze relations...<|end|><|user|>" |
| | inputs = tokenizer(text, return_tensors="pt") |
| | outputs = model.generate(**inputs, max_new_tokens=64) |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | ``` |
| |
|
| | ## Limitations & Notes |
| | - This is a demonstration LoRA trained on synthetic text-only data; it is **not** a production OCR or SQL model. |
| | - The tokenizer and model are tiny and intended for quick CPU experiments only. |
| | - Because training is fully synthetic, outputs will be illustrative rather than accurate for real schemas. |
| |
|