Buckets:

OpelSpeedster
/

Thomas_LLM-storage

7.62 GB

14 files

Updated 24 days ago

Ctrl+K

Name	Size	Uploaded	Xet hash
data		24 days ago	5 items
.gitattributes	2.76 kB xet	24 days ago	c97243b9
Evaluation prompts.yaml	11.8 kB xet	24 days ago	53df3467
MLP_input_output.npy	819 MB xet	24 days ago	83d33078
README.md	1.06 kB xet	24 days ago	14f45a98
TinyStories-train.txt	1.92 GB xet	24 days ago	e2a1497e
TinyStories-valid.txt	19.4 MB xet	24 days ago	a41b122d
TinyStoriesV2-GPT4-train.txt	2.23 GB xet	24 days ago	02e40cc5
TinyStoriesV2-GPT4-valid.txt	22.5 MB xet	24 days ago	e9c9ab08
TinyStories_all_data.tar.gz	1.61 GB xet	24 days ago	a527719a

README.md

Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary.

Described in the following paper: https://arxiv.org/abs/2305.07759.

The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M.

Additional resources: tinystories_all_data.tar.gz - contains a superset of the stories together with metadata and the prompt that was used to create each story.

TinyStoriesV2-GPT4-train.txt - Is a new version of the dataset that is based on generations by GPT-4 only (the original dataset also has generations by GPT-3.5 which are of lesser quality). It contains all the examples in TinyStories.txt which were GPT-4 generated as a subset (but is significantly larger).

Evaluation_prompts.yaml: List of prompts used to evaluate our models (see paper)

Total size: 7.62 GB

Files: 14

Last updated: May 28

Pre-warmed CDN: US EU US EU

Contributors