🇮🇳 ByteZira Vaani-350M

ByteZira Vaani-350M is a custom decoder-only Transformer language model trained completely from scratch using PyTorch and integrated into the Hugging Face ecosystem with a fully custom Transformers wrapper.

The model was developed by Anshul Pal under ByteZira Technologies and trained on approximately 3.3 billion tokens using a modern GPT-style architecture featuring:

  • RoPE positional embeddings
  • RMSNorm normalization
  • SwiGLU feed-forward layers
  • SDPA Attention
  • Flash Attention compatibility
  • KV-cache support
  • Weight tying
  • Gradient checkpointing

This is a pretrained foundation model and is not instruction-tuned yet.


🚀 Model Highlights

  • ~350 Million Parameters
  • Trained on 3.3B Tokens
  • Custom GPT-style Architecture
  • Built Fully in PyTorch
  • Hugging Face Compatible
  • Flash Attention Ready
  • Modern LLM Components
  • Trained From Scratch

🏗️ Model Details

Property Value
Model Name ByteZira Vaani-350M
Parameters ~350 Million
Architecture Custom Decoder-only Transformer
Training Tokens 3.3 Billion
Framework PyTorch
HF Compatibility Custom Transformers Wrapper
Developer Anshul Pal
Organization ByteZira Technologies

🧠 Architecture

Component Details
Transformer Layers 24
Attention Heads 16
Embedding Size 1024
Context Length 768 Tokens
Vocabulary Size 50,257
Positional Encoding RoPE
Normalization RMSNorm
Feed Forward Network SwiGLU
Attention SDPA / Flash Attention Compatible
Weight Tying Yes
Precision FP16

📚 Training Data

The model was trained using a weighted mixture of large-scale web and educational datasets.

Dataset Weight
HuggingFaceFW/fineweb (sample-10BT) 40%
HuggingFaceFW/fineweb-edu (sample-10BT) 30%
Wikimedia Wikipedia 30%
TinyStories + Book Corpus 5–10%
LexoraNLP/anshullpal 100%

⚙️ Training Configuration

Setting Value
Optimizer AdamW
Learning Rate 3e-4
Minimum LR 3e-5
Warmup Steps 51,200
LR Scheduler Cosine Decay
Gradient Accumulation 128
Mixed Precision FP16
Gradient Clipping 1.0

✨ Features

  • Custom Transformer Architecture
  • RoPE Positional Embeddings
  • RMSNorm
  • SwiGLU
  • SDPA Attention
  • Flash Attention Compatible
  • Hugging Face generate() Support
  • KV Cache Support
  • Gradient Checkpointing
  • Weight Tying

📊 Benchmark Results

Evaluated using the EleutherAI LM Evaluation Harness.

Task Metric Score
ARC Easy Accuracy 0.3312
HellaSwag Accuracy 0.2650
PIQA Accuracy 0.5631

Notes

  • Results are from the pretrained base checkpoint.
  • This model is not instruction-tuned yet.
  • Future versions with larger token counts and instruction tuning are planned.

📦 Installation

pip install transformers torch accelerate

🔥 Usage

Load Model

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM
)

model_id = "anshullpal/ByteZira-Vaani-350M-pretrain-base-model"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True
)

✍️ Text Generation

import torch

prompt = "India is a land of"

inputs = tokenizer(
    prompt,
    return_tensors="pt"
)

with torch.no_grad():

    outputs = model.generate(
        **inputs,

        max_new_tokens=80,

        temperature=0.45,

        top_p=0.82,

        top_k=40,

        repetition_penalty=1.35,

        no_repeat_ngram_size=4,

        do_sample=True,

        use_cache=False,

        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )

print(
    tokenizer.decode(
        outputs[0],
        skip_special_tokens=True
    )
)

🌐 Hugging Face Space

Try the live demo here:

👉 https://huggingface.co/spaces/anshullpal/Vaani-350M-Pretrain-Model


🔮 Future Plans

  • Instruction Tuned Version
  • Larger Context Length
  • 1B+ Parameter Models
  • Better Tokenizer
  • Multilingual Training
  • Quantized Variants
  • Chat Optimized Models

⚠️ Limitations

  • Not instruction-tuned
  • Can generate hallucinations
  • Limited reasoning capability compared to larger LLMs
  • Primarily optimized for English text generation

📜 License

Apache-2.0 License


👨‍💻 Developer

Developed by Anshul Pal
Organization: ByteZira Technologies


⭐ Acknowledgements

Special thanks to:

  • Hugging Face
  • PyTorch
  • EleutherAI
  • FineWeb Dataset Contributors
  • Open-source AI Community
Downloads last month
1,113
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train anshullpal/ByteZira-Vaani-350M-pretrain-base-model

Space using anshullpal/ByteZira-Vaani-350M-pretrain-base-model 1