SLIP-Llama: Sensor Language-Informed Pretraining (Llama-3.2-1B backbone)

Learning Transferable Sensor Models via Language-Informed Pretraining

Yuliang Chen, Arvind Pillai, Yu Yvonne Wu, Tess Z. Griffin, Lisa Marsch, Michael V. Heinz, Nicholas C. Jacobson, Andrew Campbell

Dartmouth College

[Code] [Gemma checkpoint] [Dataset] [SFT Dataset]

Backbone variants: Gemma-3-270M · Llama-3.2-1B (this repo)

Overview

This is the Llama-3.2-1B backbone variant of SLIP. It is the pretrained base checkpoint (post_train: false), before any task-specific SFT.

SLIP is a multimodal pretraining framework that learns language-aligned sensor representations transferable across diverse sensor setups. It pairs CLIP-style contrastive alignment with sensor-conditioned captioning, giving both discriminative understanding and generative reasoning over multivariate time series from heterogeneous sensors. This variant swaps the original Gemma-3-270M backbone for meta-llama/Llama-3.2-1B, repurposed into a unimodal text encoder (first split_layer layers) and a multimodal decoder (remaining layers extended with cross-attention to the sensor stream).

Architecture

Component	Setting
LLM backbone	`meta-llama/Llama-3.2-1B`
Hidden size	2048
Vocab size	128256
Split layer (text-encoder / multimodal-decoder boundary)	12
Cross-attention heads (decoder)	32
Sensor pooler queries	64 (`num_img_queries`)
Sensor pooler heads	8 (`img_attn_pool_num_heads`)
Sensor encoder	Transformer, `embed_dim=768`, `depth=12`, `heads=12`, FlexMLP patch embedding + 2D RoPE
Total parameters	~1.74B
Dtype	mixed — Llama backbone `bfloat16`, sensor encoder / pooler `float32`

Files

File	Description
`model.safetensors`	Pretrained SLIP-Llama base weights (LoRA merged into the backbone)
`config.json`	`SlipConfig` for the Llama backbone
`configuration_slip.py`, `modeling_slip.py`	Custom model code (`trust_remote_code`)
`multimodal_gemma.py`, `multimodal_llama.py`, `ts_transformer.py`	Backbone wrappers + sensor encoder
`tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`	Llama-3.2 tokenizer

Datasets

Pretrained and SFT'd on the same data as the Gemma checkpoint:

LeoChen085/SlipDataset — 600K+ sensor-caption pretraining pairs
LeoChen085/SlipSFTDataset — task-specific SFT (HAR / sleep / ECG / TSQA / captioning)

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "LeoChen085/SLIP-Llama", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("LeoChen085/SLIP-Llama")
model.eval()

The sensor-conditioning API (flexi-patch sensor inputs, get_embedding, get_sensor_embedding, sensor-conditioned generate) is identical to the Gemma checkpoint — see the usage examples and sensor input format documented at LeoChen085/SLIP and the GitHub repo. The only difference is the backbone; embedding/projection dimensions follow the Llama hidden size (2048) rather than Gemma's 640.

Citation

@article{chen2026slip,
  title={Learning Transferable Sensor Models via Language-Informed Pretraining},
  author={Chen, Yuliang and Pillai, Arvind and Wu, Yu Yvonne and Griffin, Tess Z. and Marsch, Lisa and Heinz, Michael V. and Jacobson, Nicholas C. and Campbell, Andrew},
  journal={Preprint},
  year={2026}
}

License

The SLIP model code is released under the MIT License. This checkpoint embeds (LoRA-merged) weights derived from Llama-3.2-1B and is therefore additionally governed by the Llama 3.2 Community License. Built with Llama.

Downloads last month: 18

Safetensors

Model size

2B params

Tensor type

F32

BF16

Model tree for LeoChen085/SLIP-Llama

Base model

meta-llama/Llama-3.2-1B

Finetuned

(914)

this model

LeoChen085
/

SLIP-Llama