| --- |
| language: en |
| tags: |
| - log-analysis |
| - pythia |
| - hdfs |
| license: mit |
| datasets: |
| - honicky/log-analysis-hdfs-preprocessed |
| metrics: |
| - cross-entropy |
| - perplexity |
| base_model: EleutherAI/pythia-70m |
| --- |
| |
| # pythia-70m-hdfs-logs |
|
|
| Fine-tuned Pythia-14m model for HDFS log analysis, specifically for anomaly detection. |
|
|
| ## Model Description |
|
|
| This model is fine-tuned from `EleutherAI/pythia-70m` for analyzing HDFS log sequences. It's designed to understand and predict patterns in |
| HDFS log data so that we can detect anomalies using the perplexity of the log sequence. THhe HDFS sequence is handy because it has labels |
| so we can use it to validate that the model can predict anomalies. |
|
|
| We will use this model to understand the ability of a small model to predict anomalies in a specific dataset. We will study model scale |
| and experiment with tokenization, intialization, data set size, etc. to find a configuration that is minimal in size and fast, but can |
| effectively predict anomalies. We will then attempt build a model that is more robust to different log formats. |
|
|
| - Huggingface Model: [honicky/pythia-14m-hdfs-logs](https://huggingface.co/honicky/pythia-14m-hdfs-logs) |
|
|
| ## Training Details |
| - Base model: EleutherAI/pythia-70m |
| - Dataset: https://zenodo.org/records/8196385/files/HDFS_v1.zip?download=1 + preprocessed data at honicky/log-analysis-hdfs-preprocessed |
| - Batch size: 32 |
| - Max sequence length: 405 |
| - Learning rate: 0.0001 |
| - Training steps: 16000 |
| - Weights and Biases run: https://wandb.ai/honicky/log-analysis-pythia/runs/dwb96ojk |
| |
| |
| ## Special Tokens |
| - Added `<|sep|>` token for event ID separation |
| |
| ## Intended Use |
| This model is intended for: |
| - Analyzing HDFS log sequences |
| - Detecting anomalies in log patterns |
| - Understanding system behavior through log analysis |
| |
| ## Limitations |
| - Model is specifically trained on HDFS logs and may not generalize to other log formats |
| - Limited to the context window size of 405 tokens |
| |
| |
| |