Instructions to use TensorCat/TensorTalk with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TensorCat/TensorTalk with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TensorCat/TensorTalk")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("TensorCat/TensorTalk", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use TensorCat/TensorTalk with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TensorCat/TensorTalk" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TensorCat/TensorTalk", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/TensorCat/TensorTalk
- SGLang
How to use TensorCat/TensorTalk with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TensorCat/TensorTalk" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TensorCat/TensorTalk", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TensorCat/TensorTalk" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TensorCat/TensorTalk", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use TensorCat/TensorTalk with Docker Model Runner:
docker model run hf.co/TensorCat/TensorTalk
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("TensorCat/TensorTalk", dtype="auto")TensorTalk
TensorTalk is a fully deployed Universiti Malaya Faculty of Computer Science and Information Technology handbook QA system built around Qwen3-8B, supervised fine-tuning, metadata-aware RAG, an official-source web helper, and a guarded harness for traceable answers.
The project includes the research and training pipeline in this repository, a separately maintained model repository, and a complete Vercel-deployed frontend experience. The live application provides conversation history, handbook and official-web routing, semantic retrieval controls, answer traces, grounding status, and source-aware responses.
Live Deployment
Try TensorTalk: https://tensor-talk.vercel.app/
| Project Component | Link | Role |
|---|---|---|
| Live web application | tensor-talk.vercel.app | Public Vercel deployment for interacting with TensorTalk. |
| Frontend source code | github.com/nfdlh/tensor-talk | Source repository for the deployed web interface. |
| Model repository | huggingface.co/nfdlh/tensortalk | Related TensorTalk model repository used by the deployed project. |
| Training and research repository | TensorCat/TensorTalk/UM_Handbook | SFT, RAG, agent-harness, PPO, datasets, adapters, and evaluation artifacts. |
Deployment Architecture
User Browser
|
v
Vercel Frontend
https://tensor-talk.vercel.app/
|
+-- Conversation threads and responsive chat interface
+-- Semantic retrieval and routing controls
+-- Evidence, grounding, and tracing views
|
v
TensorTalk Model + RAG / Agent Harness
|
+-- UM handbook knowledge base
+-- Metadata-aware dense retrieval
+-- Official UM / FSKTM web-source helper
+-- Evidence and answer-grounding checks
The frontend deployment turns the research notebooks and model artifacts into a complete user-facing application. It exposes the system's intermediate retrieval and validation states instead of presenting TensorTalk as a black-box chatbot.
Project Demonstration
The following GIF is generated from the complete deployment walkthrough. The original HDR recording was brightness-normalized for readability and accelerated to keep the README demonstration practical.
What This Project Does
TensorTalk answers handbook-style questions about UM FSKTM academic rules, student guidance, programme details, facilities, dress-code guidance, industrial training, supervision policy, postgraduate requirements, and other faculty handbook topics.
The project compares three stages:
Baseline 1: Closed-book SFT Qwen3-8B Fine-tunes Qwen3-8B on handbook question-answer pairs and tests how much the model can answer from parameters alone.
Baseline 2: SFT + metadata-aware RAG + agent harness Adds dense retrieval over handbook chunks, metadata reranking, official-source web assistance, and guardrail checks before showing the final answer.
Improved stage: PPO rule-reward post-training + RAG + agent harness Experiments with rule-based reward shaping so responses are more grounded, concise, and aligned with the desired handbook-answer style.
System Design
| Layer | Purpose |
|---|---|
| Qwen3-8B base model | General language model foundation. |
| LoRA / QLoRA SFT | Adapts the model to UM FSKTM handbook QA style. |
| Handbook knowledge base | Structured chunks from the undergraduate, postgraduate, and general handbook sources. |
| Dense retrieval + FAISS | Retrieves candidate evidence using BAAI/bge-base-en-v1.5 embeddings. |
| Metadata-aware reranker | Uses scope, section, subsection, and keywords to reduce wrong-context answers. |
| Official web helper | Searches constrained official UM/FSKTM-related sources when local handbook evidence is not enough. |
| Harness engineering | Runs source guards, fake-URL guards, evidence checks, grounding checks, retry logic, and fallback rules. |
| Vercel frontend | Provides the deployed conversation workspace, history, retrieval controls, and trace views. |
| TensorTalk UI | Shows answers together with traceable RAG, web, and harness evidence panels. |
Data Assets
The repository includes the core artifacts used to build and evaluate the assistant:
| Artifact | Path | Size / Role |
|---|---|---|
| SFT QA dataset | UM_Handbook/Dataset/SFT_Dataset/SFT_QA_Training_Ready.jsonl |
1,000 question-answer rows. |
| SFT metadata | UM_Handbook/Dataset/SFT_Dataset/SFT_QA_Metadata.jsonl |
1,000 rows with scope and source metadata. |
| RAG knowledge base | UM_Handbook/Dataset/RAG/UM_RAG_Knowledge_Base.jsonl |
521 retrieval chunks. |
| RAG evaluation set | UM_Handbook/Dataset/RAG/UM_RAG_Evaluation_Dataset.jsonl |
1,000 retrieval-evaluation rows. |
| Source chunk report | UM_Handbook/Dataset/Source Chunk Dataset/Source_Chunks_Dataset_report.json |
Chunk distribution and preprocessing notes. |
| Baseline 2 LoRA adapter | UM_Handbook/outputs/baseline2_rag_harness_agent/lora_adapter/ |
PEFT LoRA adapter and tokenizer assets. |
The 521 handbook chunks are split into 58 general, 250 postgraduate, and 213 undergraduate chunks. Low-information cover pages and divider pages are filtered before retrieval.
Evaluation Snapshot
| Component | Result |
|---|---|
| Dataset split | 800 train / 100 validation / 100 test, seed 42. |
| Baseline 2 train loss | 0.2748. |
| Retrieval eval size | 1,000 questions. |
| Hit@1 primary chunk | 82.1%. |
| Hit@3 primary chunk | 95.4%. |
| Hit@3 same knowledge group | 99.1%. |
| Scope match at rank 1 | 99.6%. |
| Plain generation token-F1 | 0.3391 on the sampled generation evaluation. |
| RAG generation token-F1 | 0.8460 on the same sampled evaluation. |
The retrieval results show why the project moved beyond closed-book SFT. The model can speak in the right academic tone after fine-tuning, but RAG and harness checks make the answers more evidence-grounded and easier to audit.
End-to-End Project Flow
- Handbook PDFs are converted into structured Markdown.
- Source chunks and question-answer datasets are built with scope and source metadata.
- Qwen3-8B is adapted with SFT using LoRA / QLoRA.
- BGE embeddings and FAISS retrieve handbook evidence, followed by metadata-aware reranking.
- The agent harness validates sources, rejects unsupported evidence, retries weak retrieval, and checks answer grounding.
- Rule-reward PPO experiments further shape response behavior.
- The Vercel frontend exposes the complete workflow through an interactive deployed experience.
Repository Map
UM_Handbook/
Baseline_1_SFT_QWEN3_UM_Handbook_.ipynb
Baseline_2_RAG_SFT_QWEN3_UM_Handbook_A100_intelligent_harness_agent.ipynb
Improved_Model_PPO_QWEN3_UM_Handbook_RAG_Agent_Harness.ipynb
UM_Handbook_Markdown_Preprocess.py
UM_Source_Chunk_Dataset_Builder.py
UM_SFT_QA_Dataset_Builder_from_Index.py
Dataset/
SFT_Dataset/
RAG/
Source Chunk Dataset/
outputs/
baseline2_rag_harness_agent/
lora_adapter/
retrieval_eval/
generation_eval/
rag_augmented_dataset/
Loading the Baseline 2 Adapter
The Baseline 2 adapter is stored in a subfolder of this repository. A typical PEFT loading flow is:
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model_id = "Qwen/Qwen3-8B"
adapter_repo = "TensorCat/TensorTalk"
adapter_subfolder = "UM_Handbook/outputs/baseline2_rag_harness_agent/lora_adapter"
tokenizer = AutoTokenizer.from_pretrained(
adapter_repo,
subfolder=adapter_subfolder,
trust_remote_code=True,
)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(
base_model,
adapter_repo,
subfolder=adapter_subfolder,
)
model.eval()
For the full TensorTalk behavior shown in the screenshot, use the adapter together with the RAG knowledge base, FAISS retriever, official-source web helper, and harness checks from the notebooks. The model weights alone do not include the live retrieval index or web-agent runtime.
For an immediate end-to-end demonstration, use the deployed TensorTalk web application.
Intended Use
TensorTalk is intended for research, education, and demonstration of:
- handbook question answering for UM FSKTM content;
- RAG-grounded answer generation;
- metadata-aware retrieval and reranking;
- controlled agent behavior over official web sources;
- harness engineering for evidence checks, fake URL detection, retries, and fallback;
- comparing closed-book SFT against retrieval-grounded and reward-shaped systems.
Out-of-Scope Use
Do not use TensorTalk as an official university policy authority, legal adviser, disciplinary decision system, or fully autonomous student-support system. University policies can change, and final answers should be checked against the latest official UM/FSKTM documents when used for real administrative decisions.
Limitations
- The strongest behavior comes from the full runtime pipeline, not from the adapter by itself.
- RAG quality depends on the handbook chunks, retrieval metadata, and official-source availability.
- The web helper is intentionally constrained to trusted domains; it is not a general web search assistant.
- PPO in this project is rule-reward post-training, not a large-scale human-feedback RLHF pipeline.
- Some notebook paths reflect the original training environment and may need local path adjustment before rerunning.
License
This project is released under the Apache 2.0 license.

# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TensorCat/TensorTalk")