Slurm FAQ
Last updated: March 10, 2026
Getting Started with Slurm
What is Slurm and how do I use it?
Slurm (Simple Linux Utility for Resource Management) is a workload manager that schedules and manages jobs on Together AI's GPU clusters. Common commands include:
srun- Run commands on compute nodessbatch- Submit batch job scriptssinfo- View cluster and partition informationsqueue- View job queue statusscancel- Cancel jobs
How do I submit a job to Slurm?
For interactive sessions:
srun --time=05:00:00 --cpus-per-task=10 --gres=gpu:1 --mem=90G --pty bash -l
For batch jobs, create a script and submit with:
sbatch myjob.sh
How do I specify GPU resources in my Slurm jobs?
Use the --gres flag to request GPUs:
srun --gres=gpu:1 # Request 1 GPU
srun --gres=gpu:8 # Request 8 GPUs
For multi-node jobs with GPU binding:
srun --ntasks-per-node=1 --nodes=2 --gres=gpu:8 --gpu-bind=closest your_command
Connection and Authentication Issues
I'm getting "Unable to contact slurm controller (connect failure)"
This error indicates the Slurm controller is unreachable. Common causes:
Controller downtime: The Slurm controller may be down or restarting
Network issues: Connectivity problems between nodes
Maintenance window: Scheduled or emergency maintenance
What to do:
Check if other team members are experiencing the same issue
Wait 5-10 minutes and retry
Contact support if the issue persists for more than 30 minutes, providing:
Cluster name
Exact error message
Timestamp when the issue started
SSH connection keeps failing or getting "Connection refused"
If you're seeing errors like:
channel 0: open failed: connect failed: Connection refused
stdio forwarding failed
Troubleshooting steps:
Verify your SSH proxy command is correct
Check if the login node is accessible
Ensure your SSH keys are properly configured
Try connecting to a different login node if available
For persistent connection issues: Contact support with your SSH command and full error output.
None of us can log in / Login authentication failing
If multiple users cannot authenticate:
This usually indicates a cluster-wide authentication service issue
Contact support immediately as this requires infrastructure team intervention
The issue may affect Slurm job submission and control
My environment keeps getting corrupted or zsh is missing after reconnecting
If your login node environment becomes unstable:
Cause: Login node pod may be restarting or experiencing issues
Symptoms: Missing packages, broken shell, frequent disconnections
Solution: Contact support to investigate login node stability
Report issues with:
Frequency of crashes
What you were doing when it occurred (e.g., using Claude Code, running specific commands)
Whether environment packages disappear after reconnection
Job Submission and Scheduling Issues
My jobs are stuck in pending state for a long time
Common reasons for pending jobs:
Insufficient resources: All GPUs are currently allocated
squeue # Check current job queue sinfo # Check node availabilityResource request exceeds limits: Requesting more resources than available per node
Partition configuration: Job may be targeting a partition with limited capacity
Check job status:
squeue -u $USER
scontrol show job <job_id>
How do I run a Jupyter notebook on a GPU node?
Reserve the node through Slurm with an interactive session:
Start an interactive session on your target node:
srun --pty --nodes=1 --nodelist=<node-name> --ntasks-per-node=1 --gres=gpu:8 /bin/bashLaunch Jupyter from that session:
jupyter lab --no-browser --port=8888 --ip=0.0.0.0
Create an SSH tunnel to the worker node:
ssh -N -L 8888:localhost:8888 -J $USER@ssh.<cluster>.<region>.cloud.together.ai $USER@<node-name>.slurm-compute.slurm
Open
http://localhost:8888in your browser and use the token from step.
How do I increase the MaxArraySize for array jobs?
If you need to submit array jobs with more than 1000 tasks:
Error you'll see:
sbatch: error: Batch job submission failed: Invalid job array specification
Solution: Contact support to request increasing MaxArraySize. Specify:
Current limit (usually 1000)
Desired limit (e.g., 8192)
Use case for large array jobs
Can I adjust the KillWait timeout in Slurm?
The default KillWait setting (usually 30 seconds) may be too short for some workloads that need graceful shutdown time.
To check current setting:
scontrol show config | grep -i killwait
To request an increase (e.g., to 10 minutes), contact support with:
Current KillWait value
Desired timeout
Justification (e.g., checkpoint saving, cleanup operations)
Job Failures and Errors
Why is a node drained with reason "KillTaskFailure"?
You may see nodes in a drained state with a reason like:
sinfo -R
# or
scontrol show node <node>
What it means
Slurm tried to cleanly terminate one or more job steps on that node, but the termination did not complete successfully (or within the expected time). As a safety measure, Slurm drains the node so no new work is scheduled there until the issue is investigated.
Common causes
Unkillable processes: A process stuck in kernel I/O (D-state), hung GPU driver call, or similar condition that ignores signals.
Container/runtime issues: The job is running in a container (e.g., enroot/pyxis) and the container teardown fails or leaves processes behind.
cgroup / process tracking problems: Slurmd cannot reliably enumerate or kill all descendant processes (often shows up as lingering processes after job end).
Filesystem / NFS hangs: Jobs blocked on a shared filesystem can become hard to kill.
Short KillWait / slow cleanup: The workload needs more time to exit (checkpointing, large teardown), but Slurm is configured to move on and then considers the kill a failure.
How to rectify (user checklist)
Confirm the reason and impacted jobs
scontrol show node <node> | egrep -i "State=|Reason=" squeue -w <node> sacct -j <job_id> --format=JobID,State,ExitCode,NodeList,ElapsedTry resuming the node
scontrol update nodename <node> state=resumeIf this is your job, reduce “hard to kill” behavior next time
Add signal handling in your training script so it exits quickly on
SIGTERM.Save checkpoints periodically so you can exit quickly.
Avoid doing long blocking work in an exit handler.
If using distributed training, ensure all ranks exit when one rank is terminated.
When to contact support (recommended)
Because draining/undraining and process cleanup are typically admin actions, contact Together support with:
Cluster name
Node name
Drain reason string (e.g.,
KillTaskFailure)Job ID(s) that were running
Approximate time the drain happened
Any relevant job logs (especially shutdown/teardown)
What support may do (FYI)
Identify and kill orphaned processes, or reboot the node if processes are stuck.
Verify slurmd health, cgroups configuration, and container runtime state.
Adjust relevant settings (e.g.,
KillWait) when appropriate.
Jobs are failing with "couldn't chdir to home directory" error
Error example:
slurmstepd-b65c909e-26: error: couldn't chdir to `/home/username': No such file or directory: going to /tmp instead
Cause: Home directory is not mounted or accessible on compute nodes
Solutions:
Use
/data/usernameor a shared filesystem that's mounted on all nodesEnsure your job script sets the working directory explicitly
Contact support if home directories should be available but aren't
I'm seeing CrashLoopBackOff for Slurm worker pods
If Kubernetes shows Slurm workers in CrashLoopBackOff:
kubectl get pods | grep slurm-worker
This indicates:
Init containers are failing
Configuration issues with the Slurm worker setup
Node-level problems
Action: Contact support immediately with the output of:
kubectl get pods | grep CrashLoopBackOff
kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous
Distributed training jobs are timing out
For PyTorch distributed training with Slurm:
Common timeout causes:
Network connectivity between nodes
Incorrect NCCL settings
Firewall blocking required ports
Debug steps:
Test inter-node communication
Check NCCL environment variables
Verify InfiniBand/RDMA is working (if applicable)
Review logs for specific timeout errors
Contact support if you suspect infrastructure issues.
Configuration and Customization
How do I set up custom Slurm partitions?
Partitions allow you to organize nodes by priority, preemption policies, or resource types.
Example partition configuration:
PartitionName=batch Default=YES MaxTime=UNLIMITED PriorityTier=30 Nodes=node-01,node-02
PartitionName=urgent Default=NO MaxTime=UNLIMITED PriorityTier=40 Nodes=node-03,node-04
PartitionName=low Default=NO MaxTime=UNLIMITED PriorityTier=20 PreemptMode=cancel Nodes=node-01,node-02,node-03,node-04
To request partition changes: Contact support with:
Partition name
Node list
Priority tier
Preemption mode (if applicable)
Default status
How do I install packages that persist on login nodes?
Issue: Packages installed with apt install or pip install may not persist after pod restarts.
Solutions:
For system packages: Contact support to have them added to the base image
For Python packages: Use a virtual environment or conda environment stored on persistent storage
# Create venv on persistent storage python -m venv /data/username/.venv source /data/username/.venv/bin/activate pip install <packages>For user-level tools: Install to your home directory or
/data
Can I customize Slurm configuration settings?
Some Slurm settings can be customized per customer:
MaxArraySize
KillWait
Partition configuration
Priority tiers
Preemption policies
Contact support with your requirements and we'll assess feasibility.
Resource Limits and Quotas
What are the default resource limits per job?
Default limits vary by cluster but typically include:
Max CPUs per job
Max memory per job
Max GPUs per job
Max job runtime (often UNLIMITED)
Check current limits:
scontrol show config | grep -i max
sinfo -o "%20P %5a %.10l %16F"
How do I request more resources for my jobs?
If you need resources beyond default limits:
Temporary increase: Contact support for specific jobs
Permanent increase: Discuss with your account team
Provide:
Current limits
Desired limits
Use case and justification
Can I reserve nodes for exclusive use?
For dedicated node access:
srun --exclusive <other-options> your_command
For longer-term reservations, contact support to discuss:
Number of nodes needed
Duration
GPU types
Business justification
Monitoring and Debugging
How do I check the status of my jobs?
View your jobs:
squeue -u $USER
Detailed job information:
scontrol show job <job_id>
Job history:
sacct -u $USER
sacct -j <job_id> --format=JobID,JobName,Partition,State,ExitCode,Start,End
How do I view Slurm logs for debugging?
For job output:
Standard output:
slurm-<jobid>.outStandard error:
slurm-<jobid>.err
For system-level Slurm issues: Contact support for access to:
Slurm controller logs
Slurmd (node daemon) logs
Slurmstepd (job step daemon) logs
What information should I include when reporting Slurm issues?
To help support resolve issues quickly, provide:
Required:
Cluster name/ID
Username
Exact error message
Timestamp
Commands you ran
Helpful:
Job ID (if applicable)
Output of
sinfoOutput of
scontrol show config | grep <relevant-setting>Whether issue affects multiple users
Recent changes to your workflow
Known Issues and Workarounds
Slurm controller becomes unreachable periodically
If you experience periodic Slurm controller outages:
This may indicate infrastructure issues requiring investigation
Workaround: Wait for controller to recover (usually 5-15 minutes)
Long-term fix: Contact support to investigate root cause
Login node SSH disconnects frequently
For unstable login node connections:
Use
tmuxorscreento maintain sessions across disconnectsKeep critical work in version control or persistent storage
Report frequent disconnections to support for investigation
PyTorch distributed jobs fail with import errors
Common error:
ModuleNotFoundError: No module named 'torch.distributed'
Causes:
Python environment not consistent across nodes
Missing packages on compute nodes
Solutions:
Use containerized environments (Docker/Singularity)
Ensure packages are installed on shared filesystem
Use Pyxis for containerized Slurm jobs
Jobs fail with library version conflicts
For library conflicts (e.g., CUDA, NCCL, PyTorch versions):
Use containers to ensure consistent environments
Check that library versions are compatible with GPU hardware
Contact support if you need specific library versions installed cluster-wide
Best Practices
Efficient job submission
Test interactively first: Use
srunfor small tests beforesbatchfor large jobsRequest only needed resources: Don't over-allocate CPUs, memory, or GPUs
Use job arrays: For many similar jobs, use array jobs instead of individual submissions
Set appropriate time limits: Use
-timeto help scheduler optimize allocation
Data management
Use shared filesystems: Store data on
/dataor other mounted shared storageDon't use local node storage: Data on local disks is lost when jobs end
Clean up temporary files: Remove job outputs and logs periodically
Back up critical data: Don't rely solely on cluster storage
Debugging strategies
Start small: Debug with single-node, single-GPU jobs before scaling
Check logs immediately: Review error messages before jobs are purged
Use interactive sessions:
srun --pty bashfor hands-on debuggingTest network connectivity: Verify inter-node communication for multi-node jobs