MorphStream Models

Models and TensorRT engine cache for real-time face processing used by MorphStream GPU Worker.

Private repository β€” requires access token for downloads.

Structure

/
β”œβ”€β”€ inswapper_128.onnx           # Standard face swap (529MB)
β”œβ”€β”€ inswapper_128_fp16.onnx      # FP16 optimized - default (265MB)
β”œβ”€β”€ hyperswap_1a_256.onnx        # HyperSwap variant A (384MB)
β”œβ”€β”€ hyperswap_1b_256.onnx        # HyperSwap variant B (384MB)
β”œβ”€β”€ hyperswap_1c_256.onnx        # HyperSwap variant C (384MB)
β”œβ”€β”€ yolov8n.onnx                 # Person detection (12MB)
β”œβ”€β”€ dfl_xseg.onnx                # XSeg v1 face segmentation β€” legacy (67MB)
β”œβ”€β”€ xseg_1.onnx                  # XSeg occlusion model 1 (67MB)
β”œβ”€β”€ xseg_2.onnx                  # XSeg occlusion model 2 (67MB)
β”œβ”€β”€ xseg_3.onnx                  # XSeg occlusion model 3 (67MB)
β”œβ”€β”€ 2dfan4.onnx                  # 68-point face landmarks (93MB)
β”œβ”€β”€ bisenet_resnet_34.onnx       # BiSeNet face parsing ResNet-34 (89MB)
β”œβ”€β”€ bisenet_resnet_18.onnx       # BiSeNet face parsing ResNet-18 (51MB)
β”œβ”€β”€ buffalo_l/                   # Direct ONNX face analysis models
β”‚   β”œβ”€β”€ det_10g.onnx             # SCRFD face detection FP32 (16MB)
β”‚   β”œβ”€β”€ det_10g_fp16.onnx        # SCRFD face detection FP16 (8.1MB)
β”‚   β”œβ”€β”€ w600k_r50.onnx           # ArcFace recognition embeddings (166MB)
β”‚   β”œβ”€β”€ 1k3d68.onnx              # 3D landmarks, 68 points (137MB)
β”‚   β”œβ”€β”€ 2d106det.onnx            # 2D landmarks, 106 points (4.8MB)
β”‚   └── genderage.onnx           # Gender/age estimation (1.3MB)
β”œβ”€β”€ gfpgan/                      # Face enhancement (not used in real-time)
β”‚   β”œβ”€β”€ GFPGANv1.4.pth
β”‚   └── weights/
β”‚       β”œβ”€β”€ detection_Resnet50_Final.pth
β”‚       └── parsing_parsenet.pth
β”œβ”€β”€ trt_cache/                   # Pre-compiled TensorRT engines
β”‚   β”œβ”€β”€ sm89/trt10.9_ort1.24/   # RTX 4090
β”‚   β”œβ”€β”€ sm86/trt10.9_ort1.24/   # RTX 3090
β”‚   └── ...                      # Other GPU arch + version combos
└── scripts/
    └── convert_scrfd_fp16.py    # FP32 β†’ FP16 conversion utility

Face Swap Models

Model Description Size Input Format
inswapper_128.onnx Standard quality 529 MB 128px ONNX FP32
inswapper_128_fp16.onnx FP16 optimized (default) 265 MB 128px ONNX FP16
hyperswap_1a_256.onnx High quality β€” variant A 384 MB 256px ONNX FP32
hyperswap_1b_256.onnx High quality β€” variant B 384 MB 256px ONNX FP32
hyperswap_1c_256.onnx High quality β€” variant C 384 MB 256px ONNX FP32

Face Analysis (buffalo_l)

Models originally from InsightFace buffalo_l pack. GPU Worker loads them directly via ONNX Runtime (DirectSCRFD, DirectArcFace, DirectLandmark106) without the InsightFace Python library.

Model GPU Worker Class Description Size
det_10g.onnx DirectSCRFD SCRFD face detection (FP32) 16 MB
det_10g_fp16.onnx DirectSCRFD SCRFD face detection (FP16, ~2x faster on Tensor Cores) 8.1 MB
w600k_r50.onnx DirectArcFace ArcFace R50 face recognition embeddings 166 MB
2d106det.onnx DirectLandmark106 2D face landmarks (106 points), CLAHE + face angle rotation. Used in face detection pipeline; 106-pt landmarks serve as fallback for masking when 68-pt unavailable 4.8 MB
1k3d68.onnx β€” 3D face landmarks (68 points) β€” not used at runtime 137 MB
genderage.onnx β€” Gender and age estimation β€” not used at runtime 1.3 MB

Face Landmarks

Model Description Size Input
2dfan4.onnx 2DFAN4 β€” 68-point face landmarks 93 MB 256px

FaceFusion-style 5/68 refinement: SCRFD detects face + coarse 5 kps, then 2DFAN4 produces 68 precise landmarks, converted to 5 alignment points (eye centers from 6 points each, exact nose tip, exact mouth corners). Improves face alignment quality for swap models.

Primary landmark model for face masking: 68-pt landmarks from 2DFAN4 are the preferred source for custom_paste_back compositing (hull, cutouts, mouth blend). 106-pt landmarks from 2d106det.onnx serve as fallback. Dual-landmark support: has_valid_68 preferred, has_valid_106 fallback, use_68 flag propagated through all mask functions. Landmarks are temporally smoothed via One Euro Filter in LandmarkSmoother (attribute face.landmark_2d_68).

Source: FaceFusion assets.

Person Detection

Model Description Size Input
yolov8n.onnx YOLOv8n β€” person detection (COCO class 0) 12 MB 640px

Used to distinguish "person left frame" vs "face occluded" during face swap.

Face Mask Models (FaceFusion 4-Mask System)

Occlusion detection (XSeg) and semantic face parsing (BiSeNet) models for composable mask pipeline. Used in GPU Worker's face_masker.py for box/occlusion/area/region masks.

Source: FaceFusion 3.x assets (Apache-2.0), mirrored here for reliability.

XSeg β€” Occlusion Detection

Model Description Size Input Output
dfl_xseg.onnx XSeg v1 β€” legacy binary face mask (not used) 67 MB 256px binary (face/bg)
xseg_1.onnx XSeg model 1 β€” occlusion detection 67 MB 256px binary (face/bg)
xseg_2.onnx XSeg model 2 β€” occlusion detection 67 MB 256px binary (face/bg)
xseg_3.onnx XSeg model 3 β€” occlusion detection 67 MB 256px binary (face/bg)

Runtime model selection via IPC: many (all 3 intersected), xseg_1, xseg_2, xseg_3. Input: NHWC float32 [0,1]. Output: intersection of all selected model masks (most conservative).

BiSeNet β€” Region Segmentation

Model Description Size Input Classes
bisenet_resnet_34.onnx BiSeNet ResNet-34 (default) 89 MB 512px 19 regions
bisenet_resnet_18.onnx BiSeNet ResNet-18 (lighter) 51 MB 512px 19 regions

Runtime model selection via IPC. Input: NCHW float32 ImageNet-normalized. 10 configurable face regions: skin, left-eyebrow, right-eyebrow, left-eye, right-eye, glasses, upper-lip, nose, lower-lip, mouth.

TensorRT Engine Cache

Pre-compiled TensorRT engines stored in trt_cache/ subfolder, keyed by GPU architecture and software versions. Eliminates cold-start TRT compilation (~180-300s) on new GPU instances.

Layout

trt_cache/
β”œβ”€β”€ sm89/trt10.9_ort1.24/          # RTX 4090 (Ada Lovelace)
β”‚   β”œβ”€β”€ manifest.json               # Metadata: cache_key, engine list, timestamps
β”‚   β”œβ”€β”€ TensorrtExecutionProvider_*.engine   # Compiled TRT engines
β”‚   β”œβ”€β”€ TensorrtExecutionProvider_*.profile  # Profiling data
β”‚   └── timing.cache                # cuDNN/TRT timing optimization cache
β”œβ”€β”€ sm86/trt10.9_ort1.24/          # RTX 3090 (Ampere)
β”‚   └── ...
└── sm80/trt10.9_ort1.24/          # A100 (Ampere)
    └── ...

Cache Key

Format: {gpu_arch}/trt{trt_version}_ort{ort_version}

Component Example Source
gpu_arch sm89 nvidia-smi --query-gpu=compute_cap β†’ 8.9 β†’ sm89
trt_version 10.9 tensorrt.__version__ major.minor
ort_version 1.24 onnxruntime.__version__ major.minor

Lifecycle

  1. Download β€” at container boot, GPU Worker checks HF for matching cache key. If found, downloads all engines (~10-30s vs ~180-300s compile).
  2. Compile β€” if no cache on HF, ONNX Runtime compiles TRT engines from scratch on first model load.
  3. Self-seed upload β€” after compilation, engines are uploaded to HF so future instances skip compilation.
  4. Incremental upload β€” if engines were downloaded from HF but new models compiled locally after (e.g., YOLOv8n during warmup), only the new engines are uploaded.

manifest.json

{
  "cache_key": "sm89/trt10.9_ort1.24",
  "gpu_arch": "sm89",
  "trt_version": "10.9",
  "ort_version": "1.24",
  "created_at": "2025-03-07T12:00:00Z",
  "machine_id": "C.12345",
  "engine_files": [
    "TensorrtExecutionProvider_model_hash.engine",
    "TensorrtExecutionProvider_model_hash.profile",
    "timing.cache"
  ]
}

Manifest serves as both metadata and upload gate β€” its presence signals that cache was downloaded, and engine_files list enables incremental upload detection.

GFPGAN (optional, not used in real-time)

Face restoration and enhancement. Too slow for real-time streaming (~50-150ms per frame).

Model Description Size
gfpgan/GFPGANv1.4.pth GFPGAN v1.4 restoration 332 MB
gfpgan/weights/detection_Resnet50_Final.pth RetinaFace detector 104 MB
gfpgan/weights/parsing_parsenet.pth ParseNet segmentation 81 MB

Usage

GPU Worker (production)

Models are baked into the Docker image at build time (buffalo_l + default swap + landmark + mask models). Alternative swap models (HyperSwap) are downloaded on-demand by ModelDownloadService.

TRT engine cache is downloaded asynchronously at boot via trt_cache.py (non-blocking β€” /health responds immediately).

# Manual download (local development)
HF_TOKEN=hf_xxx ./scripts/download_models.sh /models

Docker build

docker build --build-arg HF_TOKEN=hf_xxx -t morphstream-gpu-worker .

Python (huggingface_hub)

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="latark/MorphStream",
    filename="inswapper_128_fp16.onnx",
    token="hf_xxx"
)

Scripts

convert_scrfd_fp16.py

Converts SCRFD det_10g.onnx from FP32 to FP16:

pip install onnx onnxconverter-common
python scripts/convert_scrfd_fp16.py \
    --input buffalo_l/det_10g.onnx \
    --output buffalo_l/det_10g_fp16.onnx

Key: op_block_list=['BatchNormalization'] prevents epsilon underflow (1e-5 β†’ 0 in FP16 β†’ NaN).

License

MIT License

Downloads last month
1,775
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support