Slurm FAQ

Last updated: March 10, 2026

Getting Started with Slurm

What is Slurm and how do I use it?

Slurm (Simple Linux Utility for Resource Management) is a workload manager that schedules and manages jobs on Together AI's GPU clusters. Common commands include:

srun - Run commands on compute nodes
sbatch - Submit batch job scripts
sinfo - View cluster and partition information
squeue - View job queue status
scancel - Cancel jobs

How do I submit a job to Slurm?

For interactive sessions:

srun --time=05:00:00 --cpus-per-task=10 --gres=gpu:1 --mem=90G --pty bash -l

For batch jobs, create a script and submit with:

sbatch myjob.sh

How do I specify GPU resources in my Slurm jobs?

Use the --gres flag to request GPUs:

srun --gres=gpu:1  # Request 1 GPU
srun --gres=gpu:8  # Request 8 GPUs

For multi-node jobs with GPU binding:

srun --ntasks-per-node=1 --nodes=2 --gres=gpu:8 --gpu-bind=closest your_command

Connection and Authentication Issues

I'm getting "Unable to contact slurm controller (connect failure)"

This error indicates the Slurm controller is unreachable. Common causes:

Controller downtime: The Slurm controller may be down or restarting
Network issues: Connectivity problems between nodes
Maintenance window: Scheduled or emergency maintenance

What to do:

Check if other team members are experiencing the same issue
Wait 5-10 minutes and retry
Contact support if the issue persists for more than 30 minutes, providing:
- Cluster name
- Exact error message
- Timestamp when the issue started

SSH connection keeps failing or getting "Connection refused"

If you're seeing errors like:

channel 0: open failed: connect failed: Connection refused
stdio forwarding failed

Troubleshooting steps:

Verify your SSH proxy command is correct
Check if the login node is accessible
Ensure your SSH keys are properly configured
Try connecting to a different login node if available

For persistent connection issues: Contact support with your SSH command and full error output.

None of us can log in / Login authentication failing

If multiple users cannot authenticate:

This usually indicates a cluster-wide authentication service issue
Contact support immediately as this requires infrastructure team intervention
The issue may affect Slurm job submission and control

My environment keeps getting corrupted or zsh is missing after reconnecting

If your login node environment becomes unstable:

Cause: Login node pod may be restarting or experiencing issues
Symptoms: Missing packages, broken shell, frequent disconnections
Solution: Contact support to investigate login node stability

Report issues with:

Frequency of crashes
What you were doing when it occurred (e.g., using Claude Code, running specific commands)
Whether environment packages disappear after reconnection

Job Submission and Scheduling Issues

My jobs are stuck in pending state for a long time

Common reasons for pending jobs:

Insufficient resources: All GPUs are currently allocated

squeue  # Check current job queue
sinfo   # Check node availability

Resource request exceeds limits: Requesting more resources than available per node
Partition configuration: Job may be targeting a partition with limited capacity

Check job status:

squeue -u $USER
scontrol show job <job_id>

How do I run a Jupyter notebook on a GPU node?

Reserve the node through Slurm with an interactive session:

Start an interactive session on your target node:
srun --pty --nodes=1 --nodelist=<node-name> --ntasks-per-node=1 --gres=gpu:8 /bin/bash
Launch Jupyter from that session:

jupyter lab --no-browser --port=8888 --ip=0.0.0.0

Create an SSH tunnel to the worker node:

ssh -N -L 8888:localhost:8888 -J $USER@ssh.<cluster>.<region>.cloud.together.ai $USER@<node-name>.slurm-compute.slurm

Open http://localhost:8888 in your browser and use the token from step.

How do I increase the MaxArraySize for array jobs?

If you need to submit array jobs with more than 1000 tasks:

Error you'll see:

sbatch: error: Batch job submission failed: Invalid job array specification

Solution: Contact support to request increasing MaxArraySize. Specify:

Current limit (usually 1000)
Desired limit (e.g., 8192)
Use case for large array jobs

Can I adjust the KillWait timeout in Slurm?

The default KillWait setting (usually 30 seconds) may be too short for some workloads that need graceful shutdown time.

To check current setting:

scontrol show config | grep -i killwait

To request an increase (e.g., to 10 minutes), contact support with:

Current KillWait value
Desired timeout
Justification (e.g., checkpoint saving, cleanup operations)

Job Failures and Errors

Why is a node drained with reason "KillTaskFailure"?

You may see nodes in a drained state with a reason like:

sinfo -R
# or
scontrol show node <node>

What it means

Slurm tried to cleanly terminate one or more job steps on that node, but the termination did not complete successfully (or within the expected time). As a safety measure, Slurm drains the node so no new work is scheduled there until the issue is investigated.

Common causes

Unkillable processes: A process stuck in kernel I/O (D-state), hung GPU driver call, or similar condition that ignores signals.
Container/runtime issues: The job is running in a container (e.g., enroot/pyxis) and the container teardown fails or leaves processes behind.
cgroup / process tracking problems: Slurmd cannot reliably enumerate or kill all descendant processes (often shows up as lingering processes after job end).
Filesystem / NFS hangs: Jobs blocked on a shared filesystem can become hard to kill.
Short KillWait / slow cleanup: The workload needs more time to exit (checkpointing, large teardown), but Slurm is configured to move on and then considers the kill a failure.

How to rectify (user checklist)

Confirm the reason and impacted jobs

scontrol show node <node> | egrep -i "State=|Reason="
squeue -w <node>
sacct -j <job_id> --format=JobID,State,ExitCode,NodeList,Elapsed

Try resuming the node

scontrol update nodename <node> state=resume

If this is your job, reduce “hard to kill” behavior next time
- Add signal handling in your training script so it exits quickly on SIGTERM.
- Save checkpoints periodically so you can exit quickly.
- Avoid doing long blocking work in an exit handler.
- If using distributed training, ensure all ranks exit when one rank is terminated.

When to contact support (recommended)

Because draining/undraining and process cleanup are typically admin actions, contact Together support with:

Cluster name
Node name
Drain reason string (e.g., KillTaskFailure)
Job ID(s) that were running
Approximate time the drain happened
Any relevant job logs (especially shutdown/teardown)

What support may do (FYI)

Identify and kill orphaned processes, or reboot the node if processes are stuck.
Verify slurmd health, cgroups configuration, and container runtime state.
Adjust relevant settings (e.g., KillWait) when appropriate.

Jobs are failing with "couldn't chdir to home directory" error

Error example:

slurmstepd-b65c909e-26: error: couldn't chdir to `/home/username': No such file or directory: going to /tmp instead

Cause: Home directory is not mounted or accessible on compute nodes

Solutions:

Use /data/username or a shared filesystem that's mounted on all nodes
Ensure your job script sets the working directory explicitly
Contact support if home directories should be available but aren't

I'm seeing CrashLoopBackOff for Slurm worker pods

If Kubernetes shows Slurm workers in CrashLoopBackOff:

kubectl get pods | grep slurm-worker

This indicates:

Init containers are failing
Configuration issues with the Slurm worker setup
Node-level problems

Action: Contact support immediately with the output of:

kubectl get pods | grep CrashLoopBackOff
kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous

Distributed training jobs are timing out

For PyTorch distributed training with Slurm:

Common timeout causes:

Network connectivity between nodes
Incorrect NCCL settings
Firewall blocking required ports

Debug steps:

Test inter-node communication
Check NCCL environment variables
Verify InfiniBand/RDMA is working (if applicable)
Review logs for specific timeout errors

Contact support if you suspect infrastructure issues.

Configuration and Customization

How do I set up custom Slurm partitions?

Partitions allow you to organize nodes by priority, preemption policies, or resource types.

Example partition configuration:

PartitionName=batch Default=YES MaxTime=UNLIMITED PriorityTier=30 Nodes=node-01,node-02
PartitionName=urgent Default=NO MaxTime=UNLIMITED PriorityTier=40 Nodes=node-03,node-04
PartitionName=low Default=NO MaxTime=UNLIMITED PriorityTier=20 PreemptMode=cancel Nodes=node-01,node-02,node-03,node-04

To request partition changes: Contact support with:

Partition name
Node list
Priority tier
Preemption mode (if applicable)
Default status

How do I install packages that persist on login nodes?

Issue: Packages installed with apt install or pip install may not persist after pod restarts.

Solutions:

For system packages: Contact support to have them added to the base image

For Python packages: Use a virtual environment or conda environment stored on persistent storage

# Create venv on persistent storage
python -m venv /data/username/.venv
source /data/username/.venv/bin/activate
pip install <packages>

For user-level tools: Install to your home directory or /data

Can I customize Slurm configuration settings?

Some Slurm settings can be customized per customer:

MaxArraySize
KillWait
Partition configuration
Priority tiers
Preemption policies

Contact support with your requirements and we'll assess feasibility.

Resource Limits and Quotas

What are the default resource limits per job?

Default limits vary by cluster but typically include:

Max CPUs per job
Max memory per job
Max GPUs per job
Max job runtime (often UNLIMITED)

Check current limits:

scontrol show config | grep -i max
sinfo -o "%20P %5a %.10l %16F"

How do I request more resources for my jobs?

If you need resources beyond default limits:

Temporary increase: Contact support for specific jobs
Permanent increase: Discuss with your account team

Provide:

Current limits
Desired limits
Use case and justification

Can I reserve nodes for exclusive use?

For dedicated node access:

srun --exclusive <other-options> your_command

For longer-term reservations, contact support to discuss:

Number of nodes needed
Duration
GPU types
Business justification

Monitoring and Debugging

How do I check the status of my jobs?

View your jobs:

squeue -u $USER

Detailed job information:

scontrol show job <job_id>

Job history:

sacct -u $USER
sacct -j <job_id> --format=JobID,JobName,Partition,State,ExitCode,Start,End

How do I view Slurm logs for debugging?

For job output:

Standard output: slurm-<jobid>.out
Standard error: slurm-<jobid>.err

For system-level Slurm issues: Contact support for access to:

Slurm controller logs
Slurmd (node daemon) logs
Slurmstepd (job step daemon) logs

What information should I include when reporting Slurm issues?

To help support resolve issues quickly, provide:

Required:

Cluster name/ID
Username
Exact error message
Timestamp
Commands you ran

Helpful:

Job ID (if applicable)
Output of sinfo
Output of scontrol show config | grep <relevant-setting>
Whether issue affects multiple users
Recent changes to your workflow

Known Issues and Workarounds

Slurm controller becomes unreachable periodically

If you experience periodic Slurm controller outages:

This may indicate infrastructure issues requiring investigation
Workaround: Wait for controller to recover (usually 5-15 minutes)
Long-term fix: Contact support to investigate root cause

Login node SSH disconnects frequently

For unstable login node connections:

Use tmux or screen to maintain sessions across disconnects
Keep critical work in version control or persistent storage
Report frequent disconnections to support for investigation

PyTorch distributed jobs fail with import errors

Common error:

ModuleNotFoundError: No module named 'torch.distributed'

Causes:

Python environment not consistent across nodes
Missing packages on compute nodes

Solutions:

Use containerized environments (Docker/Singularity)
Ensure packages are installed on shared filesystem
Use Pyxis for containerized Slurm jobs

Jobs fail with library version conflicts

For library conflicts (e.g., CUDA, NCCL, PyTorch versions):

Use containers to ensure consistent environments
Check that library versions are compatible with GPU hardware
Contact support if you need specific library versions installed cluster-wide

Best Practices

Efficient job submission

Test interactively first: Use srun for small tests before sbatch for large jobs
Request only needed resources: Don't over-allocate CPUs, memory, or GPUs
Use job arrays: For many similar jobs, use array jobs instead of individual submissions
Set appropriate time limits: Use -time to help scheduler optimize allocation

Data management

Use shared filesystems: Store data on /data or other mounted shared storage
Don't use local node storage: Data on local disks is lost when jobs end
Clean up temporary files: Remove job outputs and logs periodically
Back up critical data: Don't rely solely on cluster storage

Debugging strategies

Start small: Debug with single-node, single-GPU jobs before scaling
Check logs immediately: Review error messages before jobs are purged
Use interactive sessions: srun --pty bash for hands-on debugging
Test network connectivity: Verify inter-node communication for multi-node jobs