YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Async Web Crawler

High-performance async web scraper for dataset collection.

Install

pip install aiohttp

Usage

python crawler.py seeds.txt output_dir/ --workers 100

Get Seeds

curl -sL https://tranco-list.eu/top-1m.csv.zip -o tranco.zip && unzip tranco.zip
awk -F, '{print "https://"$2"/"}' top-1m.csv > seeds.txt

Output

Each file contains URL and extracted text.

OpenTransformers Ltd

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support