Instructions to use TensorCat/TensorTalk with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TensorCat/TensorTalk with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TensorCat/TensorTalk")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("TensorCat/TensorTalk", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use TensorCat/TensorTalk with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TensorCat/TensorTalk" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TensorCat/TensorTalk", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/TensorCat/TensorTalk
- SGLang
How to use TensorCat/TensorTalk with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TensorCat/TensorTalk" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TensorCat/TensorTalk", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TensorCat/TensorTalk" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TensorCat/TensorTalk", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use TensorCat/TensorTalk with Docker Model Runner:
docker model run hf.co/TensorCat/TensorTalk
Use Docker images
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "TensorCat/TensorTalk" \
--host 0.0.0.0 \
--port 30000# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "TensorCat/TensorTalk",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'TensorTalk
TensorTalk is a fully deployed Universiti Malaya Faculty of Computer Science and Information Technology handbook QA system built around Qwen3-8B, supervised fine-tuning, metadata-aware RAG, an official-source web helper, and a guarded harness for traceable answers.
The project includes the research and training pipeline in this repository, a separately maintained model repository, and a complete Vercel-deployed frontend experience. The live application provides conversation history, handbook and official-web routing, semantic retrieval controls, answer traces, grounding status, and source-aware responses.
Live Deployment
Try TensorTalk: https://tensor-talk.vercel.app/
| Project Component | Link | Role |
|---|---|---|
| Live web application | tensor-talk.vercel.app | Public Vercel deployment for interacting with TensorTalk. |
| Frontend source code | github.com/nfdlh/tensor-talk | Source repository for the deployed web interface. |
| Model repository | huggingface.co/nfdlh/tensortalk | Related TensorTalk model repository used by the deployed project. |
| Training and research repository | TensorCat/TensorTalk/UM_Handbook | SFT, RAG, agent-harness, PPO, datasets, adapters, and evaluation artifacts. |
Deployment Architecture
User Browser
|
v
Vercel Frontend
https://tensor-talk.vercel.app/
|
+-- Conversation threads and responsive chat interface
+-- Semantic retrieval and routing controls
+-- Evidence, grounding, and tracing views
|
v
TensorTalk Model + RAG / Agent Harness
|
+-- UM handbook knowledge base
+-- Metadata-aware dense retrieval
+-- Official UM / FSKTM web-source helper
+-- Evidence and answer-grounding checks
The frontend deployment turns the research notebooks and model artifacts into a complete user-facing application. It exposes the system's intermediate retrieval and validation states instead of presenting TensorTalk as a black-box chatbot.
Project Demonstration
The following GIF is generated from the complete deployment walkthrough. The original HDR recording was brightness-normalized for readability and accelerated to keep the README demonstration practical.
What This Project Does
TensorTalk answers handbook-style questions about UM FSKTM academic rules, student guidance, programme details, facilities, dress-code guidance, industrial training, supervision policy, postgraduate requirements, and other faculty handbook topics.
The project compares three stages:
Baseline 1: Closed-book SFT Qwen3-8B Fine-tunes Qwen3-8B on handbook question-answer pairs and tests how much the model can answer from parameters alone.
Baseline 2: SFT + metadata-aware RAG + agent harness Adds dense retrieval over handbook chunks, metadata reranking, official-source web assistance, and guardrail checks before showing the final answer.
Improved stage: PPO rule-reward post-training + RAG + agent harness Experiments with rule-based reward shaping so responses are more grounded, concise, and aligned with the desired handbook-answer style.
System Design
| Layer | Purpose |
|---|---|
| Qwen3-8B base model | General language model foundation. |
| LoRA / QLoRA SFT | Adapts the model to UM FSKTM handbook QA style. |
| Handbook knowledge base | Structured chunks from the undergraduate, postgraduate, and general handbook sources. |
| Dense retrieval + FAISS | Retrieves candidate evidence using BAAI/bge-base-en-v1.5 embeddings. |
| Metadata-aware reranker | Uses scope, section, subsection, and keywords to reduce wrong-context answers. |
| Official web helper | Searches constrained official UM/FSKTM-related sources when local handbook evidence is not enough. |
| Harness engineering | Runs source guards, fake-URL guards, evidence checks, grounding checks, retry logic, and fallback rules. |
| Vercel frontend | Provides the deployed conversation workspace, history, retrieval controls, and trace views. |
| TensorTalk UI | Shows answers together with traceable RAG, web, and harness evidence panels. |
Data Assets
The repository includes the core artifacts used to build and evaluate the assistant:
| Artifact | Path | Size / Role |
|---|---|---|
| SFT QA dataset | UM_Handbook/Dataset/SFT_Dataset/SFT_QA_Training_Ready.jsonl |
1,000 question-answer rows. |
| SFT metadata | UM_Handbook/Dataset/SFT_Dataset/SFT_QA_Metadata.jsonl |
1,000 rows with scope and source metadata. |
| RAG knowledge base | UM_Handbook/Dataset/RAG/UM_RAG_Knowledge_Base.jsonl |
521 retrieval chunks. |
| RAG evaluation set | UM_Handbook/Dataset/RAG/UM_RAG_Evaluation_Dataset.jsonl |
1,000 retrieval-evaluation rows. |
| Source chunk report | UM_Handbook/Dataset/Source Chunk Dataset/Source_Chunks_Dataset_report.json |
Chunk distribution and preprocessing notes. |
| Baseline 2 LoRA adapter | UM_Handbook/outputs/baseline2_rag_harness_agent/lora_adapter/ |
PEFT LoRA adapter and tokenizer assets. |
The 521 handbook chunks are split into 58 general, 250 postgraduate, and 213 undergraduate chunks. Low-information cover pages and divider pages are filtered before retrieval.
Evaluation Snapshot
| Component | Result |
|---|---|
| Dataset split | 800 train / 100 validation / 100 test, seed 42. |
| Baseline 2 train loss | 0.2748. |
| Retrieval eval size | 1,000 questions. |
| Hit@1 primary chunk | 82.1%. |
| Hit@3 primary chunk | 95.4%. |
| Hit@3 same knowledge group | 99.1%. |
| Scope match at rank 1 | 99.6%. |
| Plain generation token-F1 | 0.3391 on the sampled generation evaluation. |
| RAG generation token-F1 | 0.8460 on the same sampled evaluation. |
The retrieval results show why the project moved beyond closed-book SFT. The model can speak in the right academic tone after fine-tuning, but RAG and harness checks make the answers more evidence-grounded and easier to audit.
End-to-End Project Flow
- Handbook PDFs are converted into structured Markdown.
- Source chunks and question-answer datasets are built with scope and source metadata.
- Qwen3-8B is adapted with SFT using LoRA / QLoRA.
- BGE embeddings and FAISS retrieve handbook evidence, followed by metadata-aware reranking.
- The agent harness validates sources, rejects unsupported evidence, retries weak retrieval, and checks answer grounding.
- Rule-reward PPO experiments further shape response behavior.
- The Vercel frontend exposes the complete workflow through an interactive deployed experience.
Repository Map
UM_Handbook/
Baseline_1_SFT_QWEN3_UM_Handbook_.ipynb
Baseline_2_RAG_SFT_QWEN3_UM_Handbook_A100_intelligent_harness_agent.ipynb
Improved_Model_PPO_QWEN3_UM_Handbook_RAG_Agent_Harness.ipynb
UM_Handbook_Markdown_Preprocess.py
UM_Source_Chunk_Dataset_Builder.py
UM_SFT_QA_Dataset_Builder_from_Index.py
Dataset/
SFT_Dataset/
RAG/
Source Chunk Dataset/
outputs/
baseline2_rag_harness_agent/
lora_adapter/
retrieval_eval/
generation_eval/
rag_augmented_dataset/
Loading the Baseline 2 Adapter
The Baseline 2 adapter is stored in a subfolder of this repository. A typical PEFT loading flow is:
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model_id = "Qwen/Qwen3-8B"
adapter_repo = "TensorCat/TensorTalk"
adapter_subfolder = "UM_Handbook/outputs/baseline2_rag_harness_agent/lora_adapter"
tokenizer = AutoTokenizer.from_pretrained(
adapter_repo,
subfolder=adapter_subfolder,
trust_remote_code=True,
)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(
base_model,
adapter_repo,
subfolder=adapter_subfolder,
)
model.eval()
For the full TensorTalk behavior shown in the screenshot, use the adapter together with the RAG knowledge base, FAISS retriever, official-source web helper, and harness checks from the notebooks. The model weights alone do not include the live retrieval index or web-agent runtime.
For an immediate end-to-end demonstration, use the deployed TensorTalk web application.
Intended Use
TensorTalk is intended for research, education, and demonstration of:
- handbook question answering for UM FSKTM content;
- RAG-grounded answer generation;
- metadata-aware retrieval and reranking;
- controlled agent behavior over official web sources;
- harness engineering for evidence checks, fake URL detection, retries, and fallback;
- comparing closed-book SFT against retrieval-grounded and reward-shaped systems.
Out-of-Scope Use
Do not use TensorTalk as an official university policy authority, legal adviser, disciplinary decision system, or fully autonomous student-support system. University policies can change, and final answers should be checked against the latest official UM/FSKTM documents when used for real administrative decisions.
Limitations
- The strongest behavior comes from the full runtime pipeline, not from the adapter by itself.
- RAG quality depends on the handbook chunks, retrieval metadata, and official-source availability.
- The web helper is intentionally constrained to trusted domains; it is not a general web search assistant.
- PPO in this project is rule-reward post-training, not a large-scale human-feedback RLHF pipeline.
- Some notebook paths reflect the original training environment and may need local path adjustment before rerunning.
License
This project is released under the Apache 2.0 license.

Install from pip and serve model
# Install SGLang from pip: pip install sglang# Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TensorCat/TensorTalk" \ --host 0.0.0.0 \ --port 30000# Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TensorCat/TensorTalk", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'