If your job fails after downloading the training file, but before training starts, the most likely source of the error is the training data. For example, your event log might look like
You can verify the formatting of your input file with the Together CLI tool with the following command:
$ together files check ~/Downloads/unified_joke_explanations.jsonl { "is_check_passed": true, "model_special_tokens": "we are not yet checking end of sentence tokens for this model", "file_present": "File found", "file_size": "File size 0.0 GB", "num_samples": 356 }
Despite our best efforts, the file checker does not catch all errors. Please contact support if your training data file passes the checks, but you are still seeing the above error conditions.
If you see an error during other steps in your training job, this may be due to internal errors in our training stack (e.g. hardware failure or bugs). We actively monitor job failures, and work as quickly as we can to resolve these issues. Once the issue has been resolved by our engineers, your job will be automatically or manually restarted. Charges for the restarted job will be refunded.