I-JEPA ViT-Huge Target Encoder

This repository provides the target encoder weights for an I-JEPA Vision Transformer Huge model.

Model summary

  • Model type: Vision Transformer (ViT-H/14 style configuration)
  • Training paradigm: self-supervised representation learning (I-JEPA)
  • Input image size: 224
  • Patch size: 14
  • Embedding dimension: 1280
  • Transformer depth: 32
  • Attention heads: 16
  • MLP ratio: 4.0

Repository contents

  • model_weights.pth: PyTorch checkpoint containing target encoder weights
  • config.json: architecture configuration used for this checkpoint

Usage

The checkpoint contains weights only. Instantiate a matching model architecture, then load the state dict:

import json
import torch

with open("config.json", "r") as f:
    cfg = json.load(f)

# Build your model class with cfg (must match training architecture)
# model = build_target_encoder(**cfg)

state = torch.load("model_weights.pth", map_location="cpu")
# model.load_state_dict(state, strict=True)
# model.eval()

Notes

  • This repository currently does not include training or evaluation scripts.
  • For downstream tasks, attach your own task head and fine-tune as needed.

Citation

If you use I-JEPA in your work, please cite:

@article{assran2023self,
  title={Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture},
  author={Assran, Mahmoud and Duval, Quentin and Misra, Ishan and Bojanowski, Piotr and Vincent, Pascal and Rabbat, Michael and LeCun, Yann and Ballas, Nicolas},
  journal={arXiv preprint arXiv:2301.08243},
  year={2023}
}
Downloads last month
44
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using Unspoiled-Egg/ijepa-vit-huge-target-encoder 1

Paper for Unspoiled-Egg/ijepa-vit-huge-target-encoder