ks-pret-5m: a 5 million word, 12 million token kashmiri pretraining dataset Paper • 2604.11066 • Published 4 days ago
synthocr-gen: A synthetic ocr dataset generator for low-resource languages- breaking the data barrier Paper • 2601.16113 • Published Jan 22
ks-lit-3m: A 3.1 million word kashmiri text dataset for large language model pretraining Paper • 2601.01091 • Published Jan 3
600k-ks-ocr: a large-scale synthetic dataset for optical character recognition in kashmiri script Paper • 2601.01088 • Published Jan 3