Instructions to use Jetlink/JetLLMLite-3.6 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Jetlink/JetLLMLite-3.6 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Jetlink/JetLLMLite-3.6")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Jetlink/JetLLMLite-3.6")
model = AutoModelForImageTextToText.from_pretrained("Jetlink/JetLLMLite-3.6")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Jetlink/JetLLMLite-3.6 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Jetlink/JetLLMLite-3.6"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jetlink/JetLLMLite-3.6",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Jetlink/JetLLMLite-3.6

SGLang

How to use Jetlink/JetLLMLite-3.6 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Jetlink/JetLLMLite-3.6" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jetlink/JetLLMLite-3.6",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Jetlink/JetLLMLite-3.6" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jetlink/JetLLMLite-3.6",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Jetlink/JetLLMLite-3.6 with Docker Model Runner:
```
docker model run hf.co/Jetlink/JetLLMLite-3.6
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

JetLLMLite-3.6

This repository hosts an organization-managed copy of JetLLMLite-3.6 for advanced coding, reasoning, long-context, and agentic AI workloads.

It is intended for teams that want to manage deployment, access, and internal distribution from their own namespace while preserving compatibility with the upstream model ecosystem.

Model Summary

JetLLMLite-3.6 is an open-weight post-trained model released in Hugging Face Transformers format. According to the official model card, these artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, and KTransformers. The model is described as a Causal Language Model with Vision Encoder, with 35B total parameters and 3B activated parameters, plus 262,144 native context length extensible up to 1,010,000 tokens. :contentReference[oaicite:0]{index=0}

Key Features

35B total parameters
3B activated parameters
MoE-based architecture
Vision-language capability
Strong coding and agentic performance
262,144 native context length
Extensible context up to 1,010,000 tokens
Compatible with Transformers, vLLM, SGLang, and KTransformers :contentReference[oaicite:1]{index=1}

Intended Use

This model is suitable for:

advanced chat assistants
coding assistants
repository-level reasoning
agentic workflows
multimodal question answering
long-context document understanding
RAG and tool-using systems
enterprise AI applications
research and benchmarking :contentReference[oaicite:2]{index=2}

Model Details

Architecture

According to the official Qwen model card:

Model type: Causal Language Model with Vision Encoder
Training stage: Pre-training & Post-training
Total parameters: 35B
Activated parameters: 3B
Hidden dimension: 2048
Token embedding: 248320 (padded)
Number of layers: 40
Hidden layout: 10 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE))
Number of experts: 256
Activated experts: 8 Routed + 1 Shared
Expert intermediate dimension: 512
MTP: trained with multi-steps
Context length: 262,144 natively and extensible up to 1,010,000 tokens :contentReference[oaicite:3]{index=3}

Performance Positioning

The official model card highlights JetLLMLite-3.6 as an open-weight model built with a focus on stability and real-world utility, with notable emphasis on:

Agentic coding
Frontend workflows
Repository-level reasoning
Thinking preservation across historical context :contentReference[oaicite:4]{index=4}

Hardware Requirements

This model does not have a single universal minimum hardware requirement for every deployment scenario.

Actual requirements depend on:

inference backend
precision / quantization
batch size
context length
whether vision inputs are enabled
concurrency
latency targets
KV cache configuration :contentReference[oaicite:5]{index=5}

Minimum System Requirements

Because the model has 35B total parameters and 3B activated parameters, real VRAM usage can vary substantially depending on the runtime and workload profile. The upstream model card does not publish a single hard minimum VRAM number. However, for practical planning, the following guidance is reasonable:

Estimated practical minimum for heavily quantized local inference: around 24 GB VRAM
More realistic for smoother local / development usage: 48–80 GB VRAM
Recommended for production serving of the original model: multi-GPU or high-memory datacenter GPU environments
Recommended for long-context or multimodal serving: high-memory datacenter-class infrastructure :contentReference[oaicite:6]{index=6}

Note: these values are practical estimates for deployment planning, not universal hard limits. Real memory usage can increase significantly with longer contexts, multimodal inputs, larger batch sizes, and serving-framework overhead.

Reference Hardware

The upstream model card provides official serving examples for this model with common inference stacks such as vLLM, SGLang, and Transformers, and includes a Docker example for SGLang. For practical deployment planning:

Quantized local experimentation: high-memory single-GPU environments may be sufficient
Standard production-oriented serving: modern datacenter GPUs are recommended
Long-context and higher-concurrency serving: multi-GPU deployment is the safer reference setup
Multimodal production workloads: high-memory server infrastructure is strongly recommended :contentReference[oaicite:7]{index=7}

Practical Recommendation

For most teams:

start with quantized evaluation if you are testing locally
benchmark using your real context lengths
use dedicated serving stacks such as vLLM or SGLang
reserve high-memory infrastructure for production-scale or long-context workloads :contentReference[oaicite:8]{index=8}

Software Requirements

Recommended environment:

Python 3.10+
Linux
CUDA-enabled GPU environment
One of the following runtimes:
- Transformers
- vLLM
- SGLang
- KTransformers

The official model card also notes that the latest transformers is required for JetLLMLite-3.6 and recommends ensuring torchvision and pillow are installed for multimodal use. :contentReference[oaicite:9]{index=9}

Common dependencies may include:

torch
transformers
torchvision
pillow

Quickstart

Install Transformers:

pip install "transformers[serving]"

Basic loading example:

from transformers import pipeline

pipe = pipeline(
    "image-text-to-text",
    model="Jetlink/JetLLMLite-3.6",
    trust_remote_code=True
)

Serving Examples

vLLM

vllm serve Jetlink/JetLLMLite-3.6

SGLang

docker run --gpus all \
  --shm-size 32g \
  -p 30000:30000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=<secret>" \
  --ipc=host \
  lmsysorg/sglang:latest \
  python3 -m sglang.launch_server \
      --model-path "Jetlink/JetLLMLite-3.6" \
      --host 0.0.0.0 \
      --port 30000

Docker Model Runner

docker model run hf.co/Jetlink/JetLLMLite-3.6

Long Context Notes

JetLLMLite-3.6 natively supports 262,144 tokens and can be extended up to 1,010,000 tokens. The official model card includes YaRN-style rope scaling configuration guidance for long-context usage. :contentReference[oaicite:10]{index=10}

Strengths

strong coding and agentic capabilities
improved frontend and repository-level reasoning
multimodal support
very long native context
extensible ultra-long context support
modern MoE architecture
compatibility with popular open inference frameworks :contentReference[oaicite:11]{index=11}

Limitations

infrastructure requirements can be substantial depending on deployment style
long-context inference can greatly increase memory pressure
multimodal workloads add additional overhead
local deployment practicality depends heavily on quantization and runtime choices
real-world latency and throughput vary significantly by framework and hardware configuration

Out-of-Scope / Cautionary Use

Outputs should be reviewed before use in:

medical decision-making
legal advice
safety-critical automation
high-stakes financial decisions
fully autonomous actions without validation
sensitive production workflows without guardrails

Human review, tool validation, and policy controls are strongly recommended.

License

This repository follows the same license as the upstream release.

License: Apache-2.0

If you redistribute, fine-tune, quantize, or otherwise modify this model, make sure your usage remains compliant with the upstream license and attribution requirements. :contentReference[oaicite:12]{index=12}

Attribution

Original upstream model:

Qwen/Qwen3.6-35B-A3B

This repository is an organization-managed copy and is not the original upstream source.

Citation

Please cite the original JetLLMLite-3.6 release when using this model in research, evaluation, or production documentation.

Disclaimer

This repository may include packaging, naming, or deployment-oriented changes for organizational use.

For official updates, benchmark details, long-context settings, and upstream release notes, refer to the original Qwen model card. :contentReference[oaicite:13]{index=13}

JetLLMLite-3.6 (Türkçe)

Bu depo, gelişmiş kodlama, akıl yürütme, uzun bağlam ve agentic AI iş yükleri için JetLLMLite-3.6 modelinin kurum tarafından yönetilen bir kopyasını barındırır.

Bu depo; modeli kendi namespace’i altında yönetmek, erişimi kontrol etmek ve dağıtımı kolaylaştırmak isteyen ekipler için hazırlanmıştır. Amaç, upstream model ekosistemiyle uyumluluğu koruyarak kurumsal kullanım sağlamaktır.

Model Özeti

JetLLMLite-3.6 resmi model kartına göre bu artefaktlar Hugging Face Transformers, vLLM, SGLang ve KTransformers ile uyumludur. Model, Vision Encoder içeren bir Causal Language Model olarak tanımlanır; 35B toplam parametre, 3B aktif parametre, 262.144 token yerel bağlam ve 1.010.000 token’a kadar genişletilebilir bağlam sunar. :contentReference[oaicite:14]{index=14}

Temel Özellikler

35B toplam parametre
3B aktif parametre
MoE tabanlı mimari
Vision-language kabiliyeti
Güçlü kodlama ve agentic performans
262.144 token yerel bağlam
1.010.000 token’a kadar genişletilebilir bağlam
Transformers, vLLM, SGLang ve KTransformers ile uyumluluk :contentReference[oaicite:15]{index=15}

Kullanım Amacı

Bu model aşağıdaki senaryolar için uygundur:

gelişmiş sohbet asistanları
kodlama asistanları
repository seviyesinde akıl yürütme
agentic workflow yapıları
multimodal soru-cevap
uzun bağlamlı doküman anlama
RAG ve tool-using sistemler
kurumsal AI uygulamaları
araştırma ve benchmark çalışmaları :contentReference[oaicite:16]{index=16}

Model Detayları

Mimari

Resmi Qwen model kartına göre:

Model tipi: Vision Encoder içeren Causal Language Model
Eğitim aşaması: Pre-training & Post-training
Toplam parametre: 35B
Aktif parametre: 3B
Hidden dimension: 2048
Token embedding: 248320 (padded)
Katman sayısı: 40
Hidden layout: 10 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE))
Expert sayısı: 256
Aktif expert: 8 Routed + 1 Shared
Expert intermediate dimension: 512
MTP: multi-steps ile eğitilmiş
Bağlam uzunluğu: yerel olarak 262.144, genişletilmiş olarak 1.010.000 token’a kadar :contentReference[oaicite:17]{index=17}

Performans Konumlandırması

Resmi model kartı JetLLMLite-3.6’yı özellikle şu alanlarda öne çıkarır:

Agentic coding
Frontend workflow’leri
Repository-level reasoning
Geçmiş mesajlardan düşünme bağlamını koruma :contentReference[oaicite:18]{index=18}

Donanım Gereksinimleri

Bu model için her senaryoya uyan tek bir evrensel minimum donanım gereksinimi yoktur.

Gerçek ihtiyaçlar şunlara bağlıdır:

inference backend
precision / quantization
batch size
bağlam uzunluğu
vision girdilerinin açık olup olmaması
concurrency
latency hedefleri
KV cache yapılandırması :contentReference[oaicite:19]{index=19}

Minimum Sistem Gereksinimleri

Model 35B toplam parametreye ve 3B aktif parametreye sahip olduğu için gerçek VRAM kullanımı runtime ve iş yüküne göre ciddi biçimde değişebilir. Upstream model kartı tek bir kesin minimum VRAM sayısı vermez. Ancak pratik planlama için şu rehber uygundur:

Ağır quantized local inference için tahmini pratik minimum: yaklaşık 24 GB VRAM
Daha rahat local / geliştirme kullanımı için daha gerçekçi seviye: 48–80 GB VRAM
Orijinal modelin production serving’i için önerilen: çoklu GPU veya yüksek bellekli datacenter GPU ortamları
Uzun bağlam veya multimodal serving için önerilen: yüksek bellekli datacenter sınıfı altyapı :contentReference[oaicite:20]{index=20}

Not: bunlar deployment planlaması için pratik tahminlerdir; evrensel kesin sınırlar değildir. Daha uzun bağlamlar, multimodal girdiler, büyük batch size ve serving framework kaynaklı ek yükler gerçek bellek kullanımını ciddi şekilde artırabilir.

Referans Donanım

Upstream model kartı bu model için vLLM, SGLang ve Transformers gibi inference stack’leriyle resmi serving örnekleri sunar ve ayrıca SGLang için bir Docker örneği içerir. Pratik dağıtım planlaması için:

Quantized local denemeler: yüksek bellekli tek GPU ortamları yeterli olabilir
Standart production serving: modern datacenter GPU’lar önerilir
Uzun bağlam ve daha yüksek concurrency: çoklu GPU dağıtımı daha güvenli referanstır
Multimodal production iş yükleri: yüksek bellekli sunucu altyapısı güçlü şekilde önerilir :contentReference[oaicite:21]{index=21}

Pratik Öneri

Çoğu ekip için en mantıklı yaklaşım:

local testte quantized değerlendirme ile başlamak
gerçek bağlam uzunluklarıyla benchmark almak
vLLM veya SGLang gibi özel serving stack’leri kullanmak
production ölçeği veya uzun bağlam için yüksek bellekli altyapı ayırmak :contentReference[oaicite:22]{index=22}

Yazılım Gereksinimleri

Önerilen ortam:

Python 3.10+
Linux
CUDA destekli GPU ortamı
Şu runtime’lardan biri:
- Transformers
- vLLM
- SGLang
- KTransformers

Resmi model kartı ayrıca JetLLMLite-3.6 için en güncel transformers sürümünün gerektiğini ve multimodal kullanım için torchvision ile pillow kurulu olması gerektiğini belirtir. :contentReference[oaicite:23]{index=23}

Yaygın bağımlılıklar:

torch
transformers
torchvision
pillow

Hızlı Başlangıç

Transformers kurulumu:

pip install "transformers[serving]"

Temel yükleme örneği:

from transformers import pipeline

pipe = pipeline(
    "image-text-to-text",
    model="Jetlink/JetLLMLite-3.6",
    trust_remote_code=True
)

Serving Örnekleri

vLLM

vllm serve Jetlink/JetLLMLite-3.6

SGLang

docker run --gpus all \
  --shm-size 32g \
  -p 30000:30000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=<secret>" \
  --ipc=host \
  lmsysorg/sglang:latest \
  python3 -m sglang.launch_server \
      --model-path "Jetlink/JetLLMLite-3.6" \
      --host 0.0.0.0 \
      --port 30000

Docker Model Runner

docker model run hf.co/Jetlink/JetLLMLite-3.6

Uzun Bağlam Notları

JetLLMLite-3.6 yerel olarak 262.144 token destekler ve 1.010.000 token’a kadar genişletilebilir. Resmi model kartı uzun bağlam kullanımı için YaRN tabanlı rope scaling yapılandırma rehberi de içerir. :contentReference[oaicite:24]{index=24}

Güçlü Yönler

güçlü kodlama ve agentic kabiliyetler
gelişmiş frontend ve repository-level reasoning
multimodal destek
çok uzun yerel bağlam
genişletilebilir ultra uzun bağlam
modern MoE mimarisi
popüler açık inference framework’leriyle uyumluluk :contentReference[oaicite:25]{index=25}

Sınırlamalar

deployment tipine göre altyapı ihtiyacı ciddi olabilir
uzun bağlam inference bellek baskısını büyük ölçüde artırabilir
multimodal iş yükleri ek kaynak tüketir
local kullanımın pratikliği quantization ve runtime seçimine çok bağlıdır
gerçek dünya latency ve throughput değerleri framework ve donanım yapılandırmasına göre ciddi biçimde değişir

Kapsam Dışı / Dikkat Gerektiren Kullanımlar

Çıktılar şu alanlarda insan kontrolü olmadan kullanılmamalıdır:

tıbbi karar verme
hukuki tavsiye
güvenlik kritik otomasyon
yüksek riskli finansal kararlar
doğrulama olmadan tam otonom aksiyonlar
korumasız hassas production iş akışları

İnsan incelemesi, tool doğrulaması ve politika kontrolleri güçlü şekilde önerilir.

Lisans

Bu depo, upstream sürümle aynı lisansı takip eder.

Lisans: Apache-2.0

Modeli yeniden dağıtıyor, fine-tune ediyor, quantize ediyor veya başka şekilde değiştiriyorsan; kullanımının upstream lisans ve attribution gereklilikleriyle uyumlu olduğundan emin olmalısın. :contentReference[oaicite:26]{index=26}

Atıf

Orijinal upstream model:

Qwen/Qwen3.6-35B-A3B

Bu depo, kurum tarafından yönetilen bir kopyadır ve orijinal upstream kaynak değildir.

Atıf / Citation

Bu modeli araştırma, değerlendirme veya production dokümantasyonunda kullanıyorsan, lütfen orijinal Qwen3.6 sürümüne atıf yap.

Feragatname

Bu depo, kurumsal kullanım amacıyla paketleme, isimlendirme veya dağıtım odaklı bazı değişiklikler içerebilir.

Resmi güncellemeler, benchmark detayları, uzun bağlam ayarları ve upstream sürüm notları için orijinal Qwen model kartına bakılmalıdır. :contentReference[oaicite:27]{index=27}

Downloads last month: 39

Safetensors

Model size

36B params

Tensor type

BF16

Model tree for Jetlink/JetLLMLite-3.6

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

(131)

this model

Quantizations

1 model