I-JEPA ViT-Huge Target Encoder

This repository provides the target encoder weights for an I-JEPA Vision Transformer Huge model.

Model summary

Model type: Vision Transformer (ViT-H/14 style configuration)
Training paradigm: self-supervised representation learning (I-JEPA)
Input image size: 224
Patch size: 14
Embedding dimension: 1280
Transformer depth: 32
Attention heads: 16
MLP ratio: 4.0

Repository contents

model_weights.pth: PyTorch checkpoint containing target encoder weights
config.json: architecture configuration used for this checkpoint

Usage

The checkpoint contains weights only. Instantiate a matching model architecture, then load the state dict:

import json
import torch

with open("config.json", "r") as f:
    cfg = json.load(f)

# Build your model class with cfg (must match training architecture)
# model = build_target_encoder(**cfg)

state = torch.load("model_weights.pth", map_location="cpu")
# model.load_state_dict(state, strict=True)
# model.eval()

Notes

This repository currently does not include training or evaluation scripts.
For downstream tasks, attach your own task head and fine-tune as needed.

Citation

If you use I-JEPA in your work, please cite:

@article{assran2023self,
  title={Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture},
  author={Assran, Mahmoud and Duval, Quentin and Misra, Ishan and Bojanowski, Piotr and Vincent, Pascal and Rabbat, Michael and LeCun, Yann and Ballas, Nicolas},
  journal={arXiv preprint arXiv:2301.08243},
  year={2023}
}

Downloads last month: 44

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using Unspoiled-Egg/ijepa-vit-huge-target-encoder 1

Paper for Unspoiled-Egg/ijepa-vit-huge-target-encoder

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

Paper • 2301.08243 • Published Jan 19, 2023 • 7