Instructions to use Truthseeker87/solarhive-26b-a4b-nf4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Truthseeker87/solarhive-26b-a4b-nf4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Truthseeker87/solarhive-26b-a4b-nf4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Truthseeker87/solarhive-26b-a4b-nf4") model = AutoModelForImageTextToText.from_pretrained("Truthseeker87/solarhive-26b-a4b-nf4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Truthseeker87/solarhive-26b-a4b-nf4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Truthseeker87/solarhive-26b-a4b-nf4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Truthseeker87/solarhive-26b-a4b-nf4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Truthseeker87/solarhive-26b-a4b-nf4
- SGLang
How to use Truthseeker87/solarhive-26b-a4b-nf4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Truthseeker87/solarhive-26b-a4b-nf4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Truthseeker87/solarhive-26b-a4b-nf4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Truthseeker87/solarhive-26b-a4b-nf4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Truthseeker87/solarhive-26b-a4b-nf4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Unsloth Studio new
How to use Truthseeker87/solarhive-26b-a4b-nf4 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Truthseeker87/solarhive-26b-a4b-nf4 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Truthseeker87/solarhive-26b-a4b-nf4 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Truthseeker87/solarhive-26b-a4b-nf4 to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="Truthseeker87/solarhive-26b-a4b-nf4", max_seq_length=2048, ) - Docker Model Runner
How to use Truthseeker87/solarhive-26b-a4b-nf4 with Docker Model Runner:
docker model run hf.co/Truthseeker87/solarhive-26b-a4b-nf4
- SolarHive 26B A4B NF4 — Community Solar Energy Intelligence (Quantized)
- Overview
- Mission
- Why 26B A4B? — Model Architecture Selection
- Quantization Details
- Benchmark Results
- Key Specifications
- Precision Note — NF4 vs the BF16 Source vs the Native Gemma 4 Release
- Training Details
- How to Use
- Core Capabilities
- Community Model Specifications
- Technical Notes
- Limitations
- Future Iteration — Multi-Token Prediction (MTP) Drafters
- Companion Repositories
- Citation
- Overview
SolarHive 26B A4B NF4 — Community Solar Energy Intelligence (Quantized)
Overview
SolarHive 26B A4B NF4 is the 4-bit quantized version of solarhive-26b-a4b-merged — pre-quantized to NF4 via bitsandbytes so the model loads in NF4 directly without runtime quantization. Designed for deployment on HuggingFace Spaces, Colab, or any GPU with 24+ GB VRAM.
It is a LoRA fine-tuned Gemma 4 26B A4B (MoE) model specialized in community solar energy intelligence with native function calling, multimodal VQA, and selective tool reasoning.
Why pre-quantized?
- No runtime quantization — skips the CPU-intensive quantization step at startup
- Faster Space startup — model loads directly in NF4, ready for inference immediately
- Same quality — NF4 quantization preserves 8/8 benchmark performance
- Embedded quantization config — loads in NF4 automatically without user-specified
BitsAndBytesConfig
Key Features:
- Domain expertise in solar production, battery management, grid optimization, and community coordination
- Multimodal visual question answering (sky analysis, panel inspection, neighborhood assessment)
- Native function calling for 4 energy-specific tools
- Grounded responses referencing real API data
- No Unsloth dependency — loads with standard
transformers
Mission
SolarHive is an open-source intelligence layer designed to coordinate community microgrids & community-based storage via fuel cells, pool midday energy surplus across these microgrids, and eliminate stranded capacity. It also helps forecast solar irradiance and cloud cover to plan ahead.
Why 26B A4B? — Model Architecture Selection
Gemma 4 offers four model sizes. We evaluated all four and selected two complementary architectures for a dual fine-tune strategy — one for cloud inference (this model), one for edge deployment (E4B):
| Model | Params (Total / Active) | Architecture | Vision Encoder | Context | Modalities | Selection |
|---|---|---|---|---|---|---|
| E2B | 5.1B / 2.3B effective | Dense + PLE | ~150M | 128K | Text, Image, Audio, Video | Ollama serving target |
| E4B | 8B / 4.5B effective | Dense + PLE | ~150M | 128K | Text, Image, Audio, Video | Fine-tuned for edge |
| 26B A4B | 25.2B / 3.8B active | MoE (8/128 + 1 shared) | ~550M | 256K | Text, Image | This model — cloud inference |
| 31B | 30.7B / 30.7B | Dense | ~550M | 256K | Text, Image | Rejected |
SolarHive requires two core capabilities: multimodal VQA (analyzing sky photos and panel images) and native function calling (invoking weather, solar, battery, and grid APIs in agentic loops). The official benchmarks show why 26B A4B delivers the best capability-to-cost ratio:
| Benchmark | SolarHive Use Case | E4B | 26B A4B | 31B |
|---|---|---|---|---|
| MMMU Pro (vision) | Sky/panel VQA analysis | 52.6% | 73.8% | 76.9% |
| MATH-Vision | Visual reasoning on solar data | 59.5% | 82.4% | 85.6% |
| OmniDocBench (lower=better) | Document understanding | 0.181 | 0.149 | 0.131 |
| MMLU Pro | Domain expertise (energy advisory) | 69.4% | 82.6% | 85.2% |
| GPQA Diamond | Scientific reasoning | 58.6% | 82.3% | 84.3% |
| MRCR v2 128K | Multi-round tool-calling context | 25.4% | 44.1% | 66.4% |
Source: Gemma 4 Model Card. All four models support native function calling and agentic workflows.
Why 26B A4B wins for SolarHive:
- ~550M vision encoder delivers 73.8% MMMU Pro — 40% better than E4B (52.6%) for sky/panel VQA, only 4% below 31B
- MoE sparse activation (3.8B active of 25.2B) achieves ~95% of 31B quality at a fraction of the compute
- 256K context window accommodates multi-round agentic tool-calling loops (4 API calls per turn)
- Best domain absorption — converged loss 0.6956 vs E4B's 0.9218 on the same training corpus
Why not 31B? Only 2-3% better on vision and reasoning but 2-4x more compute and VRAM. Not worth the cost for a community energy advisor.
Quantization Details
| Parameter | Value |
|---|---|
| Source Model | Truthseeker87/solarhive-26b-a4b-merged (BF16) |
| Quantization Method | NF4 (4-bit NormalFloat) via bitsandbytes |
| Compute dtype | BF16 (for dequantized computation) |
| Quantized Layers | 426 / 861 weighted layers (linear layers) |
| Non-quantized Layers | Embeddings, layer norms, gate layers (remain in BF16) |
| Model Size on Disk | ~48 GB (bitsandbytes serialized NF4 — includes quantization state) |
| Min VRAM | ~24 GB with device_map="auto" (fits L4, A10G, RTX 4090); ~49 GB on single GPU |
| Quantization Tool | bitsandbytes >=0.45.0 on Google Colab Pro (NVIDIA RTX PRO 6000 Blackwell) |
Why NF4?
NF4 (NormalFloat 4-bit) is optimized for normally-distributed neural network weights, providing better accuracy than uniform INT4 quantization. The 26B A4B uses a MoE architecture with 128 routed experts + 1 shared expert per MoE layer, activating only 8 experts (3.8B params) per forward pass out of 25.2B total. The model has 30 layers with a ~550M parameter vision encoder.
Architectural Note: Partial Quantization
bitsandbytes quantizes only torch.nn.Linear layers. Gemma 4's MoE expert weights use Gemma4ClippableLinear — a custom subclass that bitsandbytes does not recognize. This means 435 of 861 weighted layers (the MoE expert gate/routing layers) remain in BF16, resulting in ~48 GB on disk rather than the ~13 GB typical for fully-quantized 26B models.
For truly compact 4-bit files (~13 GB), GPTQ or AWQ quantization would be required — these methods quantize all linear layer types regardless of subclass. The runtime BitsAndBytesConfig approach (used in our live demo) achieves the same accuracy with standard transformers loading.
Benchmark Results
Benchmarked directly on the NF4 quantized model — not inherited from BF16. Same 8-question held-out evaluation used across all SolarHive model variants.
Domain Q&A (5/5)
All domain questions answered correctly on the NF4 model:
- Solar production impact of humidity/weather
- Battery management optimization
- Diagnostic troubleshooting
- Seasonal planning
- Grid frequency interpretation
Tool Calling (3/3)
| Question | Expected Tool(s) | Result |
|---|---|---|
| "What's the current battery state?" | get_battery_state |
PASS |
| "How much solar are we producing right now in Seattle?" | get_weather or get_solar_production |
PASS |
| "What are the general maintenance tips for panels?" | No tool call | PASS |
Initial Production Benchmark (8-question set)
- 5/5 Q&A correct — matches BF16 baseline
- 3/3 tool calling correct — matches BF16 baseline
- BF16 baseline: 8/8 (initial fine-tune validation)
- NF4 result: 8/8 (post-quantization validation)
This 8-question benchmark was the initial validation harness. The May 2026 final-run multi-variant inference (10-question parity benchmark) is the canonical headline number — see below.
Multi-Variant Deployment Validation (Final Run, May 2026)
End-to-end inference run on Colab Pro G4 (NVIDIA RTX PRO 6000 Blackwell, 102 GB VRAM total). This A4B NF4 variant was loaded directly via AutoModelForCausalLM.from_pretrained(..., device_map="cuda:0") — no BitsAndBytesConfig needed because the weights are pre-quantized on disk. VRAM utilization 49.2 GB.
Score: 5/5 Q&A + 4/5 tool = 9/10 on the 10-question parity benchmark — matches the BF16 merged variant exactly, confirming NF4 quantization is not measurably degrading output quality at this benchmark resolution.
The single FAIL is the lenient multi-call probe — "Compare today's irradiance forecast across Ann Arbor, Phoenix, and Seattle" (min_calls=2) — where this variant returned no tool call. The same multi-call failure appears on 4 of 5 measured variants in this run; only the E4B LoRA + base variant chained the multi-city calls (3 × get_weather). Worth a multi-trial re-run to characterize whether this is stochastic at temperature=1.0 or systematic — the consistent A4B-NF4 vs A4B-BF16 parity on the other 9 questions suggests the multi-call failure is shared at the fine-tune level, not introduced by quantization.
When2Call score: 3/3 — inferred from the A4B LoRA baseline. The When2Call probe suite was directly measured on the A4B LoRA baseline — score 3/3 — and on the E4B merged variant — score 2/3. This A4B NF4 variant inherits the 3/3 score by lossless equivalence: the underlying weights are the same fine-tuned A4B LoRA + base; the merge step is mathematically lossless on weights, and NF4 is hypothesized to preserve the refusal/follow-up decision boundary because that boundary is determined by the language-model linear layers rather than by the precision of any single weight. We label this score inferred (not directly measured in the May 2026 inference run) to distinguish it from the directly-measured 3/3 on A4B LoRA.
Why the inference is reasonable: the A4B family's 9-of-10 parity on Q&A + tool-routing held across LoRA, merged BF16, and NF4 (all three score 5/5 + 4/5 in the run), demonstrating that quantization does not shift first-order routing behavior. The When2Call decision is a routing decision; the same lossless-merge-then-quantize pipeline applies. We flag it as inferred for honesty — multi-trial direct measurement on this NF4 variant would close the audit gap.
Compare to the E4B family (solarhive-e4b-lora + solarhive-e4b-ollama) which scores 2/3 (fails (d) by calling get_weather for an air-quality question). The +1/3 W2C delta between A4B and E4B families is the empirical signature of size-vs-refusal scaling. A4B outperforming the smaller E4B fine-tune on reasoning-heavy probes was the pre-stated hypothesis per the official Google Gemma 4 docs "Models with higher parameters and bit counts are generally more capable" — this 26B A4B accesses ~25B total knowledge capacity (3.8B active per token via MoE sparsity) and a ~550M vision encoder vs E4B's 8B / 4.5B effective / ~150M.
Key Specifications
| Parameter | Value |
|---|---|
| Base Model | google/gemma-4-26b-a4b-it |
| Source Model | Truthseeker87/solarhive-26b-a4b-merged |
| Architecture | MoE — 25.2B total, 3.8B active (8/128 + 1 shared experts) |
| Modalities | Text + Image |
| Context Length | 256K tokens |
| Fine-Tuning Method | LoRA via Unsloth (BF16), merged to 16-bit, then quantized to NF4 |
| Training Data | 1,727 examples (solarhive-community-solar-multimodal) — text-only fine-tune; VQA at inference uses the base Gemma 4 vision encoder (~550M params), unmodified by our LoRA per the Vertex AI SFT recipe |
| Converged Loss | 0.6956 |
| Benchmark Score | 9/10 (5/5 domain Q&A + 4/5 tool calling) — May 2026 final run, multi-call regression on TQ5; matches BF16 merged baseline (NF4 is quantization-lossless on this benchmark) |
| Precision | NF4 (426/861 layers quantized, ~49 GB VRAM on single GPU) |
| License | MIT (adapters) / Gemma Terms (base model) |
Precision Note — NF4 vs the BF16 Source vs the Native Gemma 4 Release
This repository is the only quantization step in the SolarHive 26B A4B release. The pipeline is:
Google's open-source Gemma 4 26B A4B base (BF16, native release precision)
↓ LoRA fine-tuning at BF16 (Unsloth FastVisionModel)
solarhive-26b-a4b-lora — adapter weights (BF16, ~2 GB)
↓ merge_16bit (Unsloth save_pretrained_merged)
solarhive-26b-a4b-merged — full BF16 weights (same precision as base, ~48 GB)
↓ BitsAndBytesConfig(load_in_4bit=True, nf4) — THIS step is the only quantization
solarhive-26b-a4b-nf4 — partial NF4 quantization (this repo)
BF16 is Google's native release precision for Gemma 4 — the open-source
base at google/gemma-4-26b-a4b-it
is itself BF16. The intermediate LoRA
and merged
artifacts preserve that precision exactly. Only this NF4 repository
introduces a precision change — and only for 426 of 861 weighted layers
(the standard torch.nn.Linear ones); the MoE expert weights stay in BF16
because bitsandbytes does not recognize the Gemma4ClippableLinear subclass.
That is why the on-disk size is ~48 GB rather than the ~13 GB typical for
fully-packed 4-bit at this parameter count.
| Artifact | Precision | Notes |
|---|---|---|
| Google's Gemma 4 26B A4B base | BF16 | Native release precision; no FP32 source exists |
| SolarHive LoRA adapters | BF16 (delta only) | Apply over base via Unsloth |
| SolarHive merged BF16 | BF16 (full model) | Same precision as base + LoRA delta folded in |
| This NF4 repo | NF4 + BF16 hybrid | 426 Linear layers quantized; 435 MoE expert layers stay BF16 |
The 8/8 benchmark in this card was run on the NF4 model after quantization and matched the BF16 baseline — confirming the partial quantization preserves quality on the SolarHive held-out set.
Training Details
| Parameter | Value |
|---|---|
| Method | LoRA via Unsloth FastVisionModel (BF16, RTX PRO 6000 Blackwell 102 GB) |
| LoRA rank | 16 |
| LoRA alpha | 16 |
| Learning rate | 2e-4 |
| Epochs | 3 |
| Max sequence length | 2048 |
| Precision | BF16 (training) → NF4 (this quantized release) |
| Trainable parameters | 505.4M / 26.3B (1.92%) |
| Training time | 7,198 seconds (~120 minutes) |
| Hardware | Google Colab Pro (NVIDIA RTX PRO 6000 Blackwell) |
Training Data — 1,727 Examples
Canonical training corpus: solarhive-community-solar-multimodal:
- 413 hand-crafted examples across 15+ US cities, 9 energy domains
- ~1,117 API-grounded examples from Open-Meteo, PVWatts, OpenWeatherMap, EIA
- 183 tool-calling examples (positive, negative refusals, follow-up clarifications, failure-recovery — When2Call taxonomy)
- 14 image-grounded Q&A turns from 7 manually-labeled Ann Arbor sky photographs
How to Use
Loading Pre-Quantized Model (Recommended)
No BitsAndBytesConfig needed — weights are already quantized:
from transformers import AutoProcessor, AutoModelForCausalLM
import torch
processor = AutoProcessor.from_pretrained(
"google/gemma-4-26b-a4b-it",
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
"Truthseeker87/solarhive-26b-a4b-nf4",
device_map="cuda:0",
trust_remote_code=True,
)
Two-Step Tokenization (Required)
messages = [
{"role": "system", "content": "You are SolarHive, an AI energy advisor for a community of 12 homes with rooftop solar and shared battery storage in Ann Arbor, Michigan."},
{"role": "user", "content": "How will today's weather affect our solar production?"},
]
# Step 1: render text (tokenize=False)
text = processor.apply_chat_template(
messages, tools=tools,
add_generation_prompt=True,
enable_thinking=False,
tokenize=False,
)
# Step 2: tokenize separately
inputs = processor(text=text, images=None, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=1024, temperature=1.0, top_p=0.95, top_k=64)
response = processor.tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
Native Function Calling
def get_weather(location: str) -> dict:
"""Get current weather conditions for a location.
Args:
location: City name, e.g. 'Ann Arbor, MI'
Returns:
dict with temp_f, clouds_pct, wind_mph, humidity, sunrise, sunset
"""
...
def get_solar_production(clouds_pct: int, temp_f: float) -> dict:
"""Get estimated community solar production using GHI irradiance data.
Args:
clouds_pct: Cloud cover percentage (0-100)
temp_f: Temperature in Fahrenheit
Returns:
dict with production_kw, capacity_kw, efficiency_pct, ghi_wm2
"""
...
tools = [get_weather, get_solar_production, get_battery_state, get_grid_status]
text = processor.apply_chat_template(
messages, tools=tools,
add_generation_prompt=True,
enable_thinking=False,
tokenize=False,
)
Core Capabilities
1. Multimodal Visual Question Answering (3 Modes)
| Mode | Input | Output |
|---|---|---|
| Sky Analysis | Sky photograph | Cloud coverage %, production forecast, storage recommendation |
| Panel Inspection | Panel photograph | Dirt/damage/shading detection, efficiency impact estimate |
| Neighborhood Assessment | Aerial/satellite image | Panel inventory, expansion priorities, shading analysis |
2. Native Function Calling (5 Tools — all 3 keyed APIs wired)
| Tool | API | Returns |
|---|---|---|
get_weather(location) |
OpenWeatherMap (OWM_API_KEY) |
Temperature, clouds %, wind, humidity, sunrise/sunset |
get_solar_production(clouds_pct, temp_f) |
Open-Meteo GHI (keyless) | Production kW, efficiency %, GHI W/m2, temp derating |
get_battery_state() |
Community BMS (sim) | State of charge, capacity, charging status |
get_grid_status() |
EIA Open Data (EIA_API_KEY) |
Pricing period, rate/kWh, renewable %, CO2 intensity |
get_nrel_pvwatts_baseline() |
NREL PVWatts v8 (NREL_API_KEY) |
Annual + current-month typical kWh + avg kW for the 72 kW array |
Tool results feed back as a 2-message sequence matching the training distribution: {"role": "assistant", "tool_calls": [...]} then {"role": "tool", "name": "<fn>", "content": json.dumps(result)}. Shared across solarhive_datagen.py, solarhive_finetune.py, solarhive_inference.py Cell 4, and test_ollama_tools.py Solution B.
3. Selective Tool Reasoning
The model intelligently decides when to call tools:
- "What time does peak pricing start?" → Calls:
get_grid_status()only - "Is today's production above typical for January?" → Calls:
get_solar_production()+get_nrel_pvwatts_baseline() - "Should I run my pool heater now?" → Calls: all 5 tools
- "What are general maintenance tips?" → Calls: none
4. Inference-time When2Call Validation
Three held-out probes validate 3 of the 4 failure-mode categories from Ross, H., Mahabaleshwarkar, A. S., & Suhara, Y. (2025). When2Call: When (not) to Call Tools. arXiv:2504.18851 — the paper documents 9–67% tool-hallucination rates on (c)+(d) in untrained community models:
- (b) "What's the current grid rate?" → expect
get_grid_statuscall (well-specified, in-scope) - (c) "How much will a 10 kW array produce today?" → expect follow-up question (does NOT auto-fill location default)
- (d) "What's the current air quality index in Ann Arbor?" → expect refusal + redirect (does NOT hallucinate a tool)
Models trained without explicit unable-to-answer and follow-up clarification examples typically fail (c) + (d). The SolarHive training corpus includes 16 such examples (10 unable-to-answer + 6 follow-up clarification) following the When2Call taxonomy; the A4B family achieves 3/3 (directly measured on A4B LoRA, inferred-lossless on the merged BF16 + on this NF4 variant — see Multi-Variant Deployment Validation above).
Community Model Specifications
| Parameter | Value |
|---|---|
| Location | Ann Arbor, Michigan (42.2808N, 83.7430W) |
| Community size | 12 homes |
| Total panel capacity | 72 kW |
| Shared battery storage | 100 kWh |
| Grid region | MISO (Midcontinent Independent System Operator) |
Technical Notes
- Pre-quantized model: NF4 weights are saved directly via
push_to_hub()(bitsandbytes >=0.45.0) — noBitsAndBytesConfigneeded at load time. JustAutoModelForCausalLM.from_pretrained()withdevice_map="cuda:0". - Processor from base model: Use
AutoProcessor.from_pretrained("google/gemma-4-26b-a4b-it")— the base model's processor has the correct chat template with native tool-call support - Two-step tokenization: Single-step
tokenize=Truecrashes in transformers 5.5.x on messages without acontentkey — always use the two-step approach - System prompt repetition: Repeated system prompt improves instruction following (Leviathan et al., 2024)
- VRAM requirements: Observed 49.2 GB on single GPU (
device_map="cuda:0"). For GPUs with <48 GB, usedevice_map="auto"to enable CPU offloading - Memory footprint: 48.30 GB (dequantized computation size); actual VRAM usage ~49 GB on RTX PRO 6000
- Quantization verified: 426 Linear4bit layers confirmed after save/load cycle from HuggingFace Hub
- Sampling: temperature=1.0, top_p=0.95, top_k=64 (Kaggle-recommended defaults)
- Dependencies:
transformers>=5.5.0,accelerate,bitsandbytes>=0.45.0 - For full BF16 precision: Use the merged model (~48 GB VRAM)
Limitations
- Prototype tested on single community (12 homes, Ann Arbor) — validation needed across geographies
- Model occasionally uses "60 kW" instead of correct 72 kW capacity in direct VQA responses
- Tool responses depend on external API availability with rate limits
- Battery state simulator is deterministic for demonstrations
- NF4 quantization may introduce minor quality variations compared to BF16 on edge cases
- For maximum quality, use the BF16 merged model
Future Iteration — Multi-Token Prediction (MTP) Drafters
Not in the measured numbers above. Google announced Gemma 4 MTP drafters on May 5, 2026 (blog, overview, HF collection, Kaggle, @GoogleGemma) — after this artifact's final benchmark was captured. The benchmarks above reflect standard autoregressive decoding only. MTP integration is documented here as future iteration; no measured speedup is claimed in this release.
Theoretical foundation. Speculative decoding (Leviathan, Kalman & Matias, ICML 2023, arXiv:2211.17192) accelerates generation without changing the output distribution under argmax decoding: a smaller drafter proposes γ candidate tokens, the target verifies all γ in a single parallel forward pass, accepted tokens are kept, and any rejection is resampled from a corrected distribution. The output distribution is preserved exactly regardless of drafter quality; only acceptance rate α, and therefore walltime speedup, varies.
What Google released on May 5, 2026. Paired drafter checkpoints for all four IT-tuned Gemma 4 variants — gemma-4-E2B-it-assistant, gemma-4-E4B-it-assistant, gemma-4-26B-A4B-it-assistant, gemma-4-31B-it-assistant — discoverable via the google/gemma-4 Hugging Face collection and on Kaggle Models. The drafters share the input embedding table with their paired target and consume the target's last-layer activations (architecture per the MTP overview). For this target the paired drafter is google/gemma-4-26B-A4B-it-assistant (0.4 B params, BF16). Google reports up to 3× decode speedup with no quality degradation on the 26B-A4B configuration, and **2.2×** on Apple Silicon at batch sizes 4–8. Tested runtimes named in the blog: LiteRT-LM, MLX, Hugging Face Transformers, vLLM, SGLang, Ollama.
Integration cost is one kwarg in Hugging Face Transformers:
target = AutoModelForCausalLM.from_pretrained("Truthseeker87/solarhive-26b-a4b-nf4", device_map="cuda:0", ...)
assistant = AutoModelForCausalLM.from_pretrained("google/gemma-4-26B-A4B-it-assistant", dtype=torch.bfloat16, ...)
target.generate(**inputs, assistant_model=assistant) # MTP enabled
The integration ships as a gated future-iteration cell (§14, _RUN_MTP_DEMO = False) in solarhive_inference.py; reviewers can flip the flag to reproduce a baseline-vs-MTP comparison under argmax decoding.
Open question specific to this NF4-quantized target. Per the 2023 speculative-sampling guarantee, correctness is invariant to drafter quality — the target's verification step preserves the exact output distribution regardless of what the drafter proposes. What varies is acceptance rate α, since Google's released drafter is a BF16 model trained against the base gemma-4-26B-A4B-it, not against this NF4-quantized LoRA-merged target. Measured α for the BF16-drafter × NF4-target pairing is the planned post-hackathon contribution.
Companion Repositories
| Model | Repository | Purpose |
|---|---|---|
| SolarHive 26B A4B NF4 | This repo | Pre-quantized 4-bit cloud model for HF Spaces / resource-constrained GPUs |
| SolarHive 26B A4B Merged | solarhive-26b-a4b-merged | Full BF16 precision — production inference |
| SolarHive 26B A4B LoRA | solarhive-26b-a4b-lora | LoRA adapters from Unsloth fine-tune |
| SolarHive E4B LoRA | solarhive-e4b-lora | E4B adapter weights (~200 MB) — apply over base via Unsloth |
| SolarHive E4B safetensors | solarhive-e4b-ollama | Edge model — merged safetensors source for GGUF conversion via llama.cpp |
| SolarHive E4B GGUF | solarhive-e4b-gguf | Edge deployment — Q4_K_M GGUF + mmproj for Ollama / llama.cpp on 16 GB CPU laptop (10/10 benchmark) |
| SolarHive Dataset | solarhive-community-solar-multimodal | 1,727 training examples (1,713 text + 14 image-grounded) |
| Live Demo | HF Space | Interactive Gradio demo |
| LiteRT-LM Python edge runtime | solarhive_e4b_litert_v3.1.ipynb |
LiteRT Special Tech Track entry — runs upstream base litert-community/gemma-4-E4B-it-litert-lm .litertlm (3.66 GB) + SolarHive UX layer + on-device agentic loop. Q&A 8/8 on Colab Pro CPU + High-RAM. Fine-tuned LiteRT-LM bundle is a planned next iteration once upstream gemma4 example module lands in ai_edge_torch.generative.examples/. |
| GitHub | the-gemma4-good-hackathon-solarhive | Full source code, training and quantization notebooks, test_ollama_tools.py |
Citation
@misc{solarhive2026,
title={SolarHive: AI-Powered Community Solar Energy Intelligence},
author={Youshen Lim},
year={2026},
url={https://github.com/youshen-lim/the-gemma4-good-hackathon-solarhive},
note={Gemma 4 Good Hackathon submission — Google DeepMind x Kaggle}
}
Gemma is a trademark of Google LLC.
- Downloads last month
- 22
Model tree for Truthseeker87/solarhive-26b-a4b-nf4
Base model
Truthseeker87/solarhive-26b-a4b-mergedDataset used to train Truthseeker87/solarhive-26b-a4b-nf4
Space using Truthseeker87/solarhive-26b-a4b-nf4 1
Papers for Truthseeker87/solarhive-26b-a4b-nf4
Prompt Repetition Improves Non-Reasoning LLMs
When2Call: When (not) to Call Tools
Fast Inference from Transformers via Speculative Decoding
Evaluation results
- Accuracyself-reported1.000
- Accuracyself-reported1.000
