Buckets:
| Name | Size | Uploaded | Xet hash |
|---|---|---|---|
| data | 5 items | ||
| .gitattributes | 2.76 kB xet | c97243b9 | |
| Evaluation prompts.yaml | 11.8 kB xet | 53df3467 | |
| MLP_input_output.npy | 819 MB xet | 83d33078 | |
| README.md | 1.06 kB xet | 14f45a98 | |
| TinyStories-train.txt | 1.92 GB xet | e2a1497e | |
| TinyStories-valid.txt | 19.4 MB xet | a41b122d | |
| TinyStoriesV2-GPT4-train.txt | 2.23 GB xet | 02e40cc5 | |
| TinyStoriesV2-GPT4-valid.txt | 22.5 MB xet | e9c9ab08 | |
| TinyStories_all_data.tar.gz | 1.61 GB xet | a527719a |
Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary.
Described in the following paper: https://arxiv.org/abs/2305.07759.
The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M.
Additional resources: tinystories_all_data.tar.gz - contains a superset of the stories together with metadata and the prompt that was used to create each story.
TinyStoriesV2-GPT4-train.txt - Is a new version of the dataset that is based on generations by GPT-4 only (the original dataset also has generations by GPT-3.5 which are of lesser quality). It contains all the examples in TinyStories.txt which were GPT-4 generated as a subset (but is significantly larger).
Evaluation_prompts.yaml: List of prompts used to evaluate our models (see paper)
- Total size
- 7.62 GB
- Files
- 14
- Last updated
- May 28
- Pre-warmed CDN
- US EU US EU