SLIP-Llama: Sensor Language-Informed Pretraining (Llama-3.2-1B backbone)

Learning Transferable Sensor Models via Language-Informed Pretraining

Yuliang Chen, Arvind Pillai, Yu Yvonne Wu, Tess Z. Griffin, Lisa Marsch, Michael V. Heinz, Nicholas C. Jacobson, Andrew Campbell

Dartmouth College

[Code] [Gemma checkpoint] [Dataset] [SFT Dataset]

Backbone variants: Gemma-3-270M · Llama-3.2-1B (this repo)


Overview

This is the Llama-3.2-1B backbone variant of SLIP. It is the pretrained base checkpoint (post_train: false), before any task-specific SFT.

SLIP is a multimodal pretraining framework that learns language-aligned sensor representations transferable across diverse sensor setups. It pairs CLIP-style contrastive alignment with sensor-conditioned captioning, giving both discriminative understanding and generative reasoning over multivariate time series from heterogeneous sensors. This variant swaps the original Gemma-3-270M backbone for meta-llama/Llama-3.2-1B, repurposed into a unimodal text encoder (first split_layer layers) and a multimodal decoder (remaining layers extended with cross-attention to the sensor stream).

Architecture

Component Setting
LLM backbone meta-llama/Llama-3.2-1B
Hidden size 2048
Vocab size 128256
Split layer (text-encoder / multimodal-decoder boundary) 12
Cross-attention heads (decoder) 32
Sensor pooler queries 64 (num_img_queries)
Sensor pooler heads 8 (img_attn_pool_num_heads)
Sensor encoder Transformer, embed_dim=768, depth=12, heads=12, FlexMLP patch embedding + 2D RoPE
Total parameters ~1.74B
Dtype mixed — Llama backbone bfloat16, sensor encoder / pooler float32

Files

File Description
model.safetensors Pretrained SLIP-Llama base weights (LoRA merged into the backbone)
config.json SlipConfig for the Llama backbone
configuration_slip.py, modeling_slip.py Custom model code (trust_remote_code)
multimodal_gemma.py, multimodal_llama.py, ts_transformer.py Backbone wrappers + sensor encoder
tokenizer.json, tokenizer_config.json, special_tokens_map.json Llama-3.2 tokenizer

Datasets

Pretrained and SFT'd on the same data as the Gemma checkpoint:

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "LeoChen085/SLIP-Llama", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("LeoChen085/SLIP-Llama")
model.eval()

The sensor-conditioning API (flexi-patch sensor inputs, get_embedding, get_sensor_embedding, sensor-conditioned generate) is identical to the Gemma checkpoint — see the usage examples and sensor input format documented at LeoChen085/SLIP and the GitHub repo. The only difference is the backbone; embedding/projection dimensions follow the Llama hidden size (2048) rather than Gemma's 640.

Citation

@article{chen2026slip,
  title={Learning Transferable Sensor Models via Language-Informed Pretraining},
  author={Chen, Yuliang and Pillai, Arvind and Wu, Yu Yvonne and Griffin, Tess Z. and Marsch, Lisa and Heinz, Michael V. and Jacobson, Nicholas C. and Campbell, Andrew},
  journal={Preprint},
  year={2026}
}

License

The SLIP model code is released under the MIT License. This checkpoint embeds (LoRA-merged) weights derived from Llama-3.2-1B and is therefore additionally governed by the Llama 3.2 Community License. Built with Llama.

Downloads last month
18
Safetensors
Model size
2B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LeoChen085/SLIP-Llama

Finetuned
(914)
this model

Datasets used to train LeoChen085/SLIP-Llama