Slurm FAQ

Last updated: March 10, 2026

Getting Started with Slurm

What is Slurm and how do I use it?

Slurm (Simple Linux Utility for Resource Management) is a workload manager that schedules and manages jobs on Together AI's GPU clusters. Common commands include:

  • srun - Run commands on compute nodes

  • sbatch - Submit batch job scripts

  • sinfo - View cluster and partition information

  • squeue - View job queue status

  • scancel - Cancel jobs

How do I submit a job to Slurm?

For interactive sessions:

srun --time=05:00:00 --cpus-per-task=10 --gres=gpu:1 --mem=90G --pty bash -l

For batch jobs, create a script and submit with:

sbatch myjob.sh

How do I specify GPU resources in my Slurm jobs?

Use the --gres flag to request GPUs:

srun --gres=gpu:1  # Request 1 GPU
srun --gres=gpu:8  # Request 8 GPUs

For multi-node jobs with GPU binding:

srun --ntasks-per-node=1 --nodes=2 --gres=gpu:8 --gpu-bind=closest your_command

Connection and Authentication Issues

I'm getting "Unable to contact slurm controller (connect failure)"

This error indicates the Slurm controller is unreachable. Common causes:

  1. Controller downtime: The Slurm controller may be down or restarting

  2. Network issues: Connectivity problems between nodes

  3. Maintenance window: Scheduled or emergency maintenance

What to do:

  • Check if other team members are experiencing the same issue

  • Wait 5-10 minutes and retry

  • Contact support if the issue persists for more than 30 minutes, providing:

    • Cluster name

    • Exact error message

    • Timestamp when the issue started

SSH connection keeps failing or getting "Connection refused"

If you're seeing errors like:

channel 0: open failed: connect failed: Connection refused
stdio forwarding failed

Troubleshooting steps:

  1. Verify your SSH proxy command is correct

  2. Check if the login node is accessible

  3. Ensure your SSH keys are properly configured

  4. Try connecting to a different login node if available

For persistent connection issues: Contact support with your SSH command and full error output.

None of us can log in / Login authentication failing

If multiple users cannot authenticate:

  • This usually indicates a cluster-wide authentication service issue

  • Contact support immediately as this requires infrastructure team intervention

  • The issue may affect Slurm job submission and control

My environment keeps getting corrupted or zsh is missing after reconnecting

If your login node environment becomes unstable:

  • Cause: Login node pod may be restarting or experiencing issues

  • Symptoms: Missing packages, broken shell, frequent disconnections

  • Solution: Contact support to investigate login node stability

Report issues with:

  • Frequency of crashes

  • What you were doing when it occurred (e.g., using Claude Code, running specific commands)

  • Whether environment packages disappear after reconnection

Job Submission and Scheduling Issues

My jobs are stuck in pending state for a long time

Common reasons for pending jobs:

  1. Insufficient resources: All GPUs are currently allocated

    squeue  # Check current job queue
    sinfo   # Check node availability
    
  2. Resource request exceeds limits: Requesting more resources than available per node

  3. Partition configuration: Job may be targeting a partition with limited capacity

Check job status:

squeue -u $USER
scontrol show job <job_id>

How do I run a Jupyter notebook on a GPU node?

Reserve the node through Slurm with an interactive session:

  1. Start an interactive session on your target node:

    srun --pty --nodes=1 --nodelist=<node-name> --ntasks-per-node=1 --gres=gpu:8 /bin/bash

  2. Launch Jupyter from that session:

jupyter lab --no-browser --port=8888 --ip=0.0.0.0

  1. Create an SSH tunnel to the worker node:

ssh -N -L 8888:localhost:8888 -J $USER@ssh.<cluster>.<region>.cloud.together.ai $USER@<node-name>.slurm-compute.slurm

  1. Open http://localhost:8888 in your browser and use the token from step.

How do I increase the MaxArraySize for array jobs?

If you need to submit array jobs with more than 1000 tasks:

Error you'll see:

sbatch: error: Batch job submission failed: Invalid job array specification

Solution: Contact support to request increasing MaxArraySize. Specify:

  • Current limit (usually 1000)

  • Desired limit (e.g., 8192)

  • Use case for large array jobs

Can I adjust the KillWait timeout in Slurm?

The default KillWait setting (usually 30 seconds) may be too short for some workloads that need graceful shutdown time.

To check current setting:

scontrol show config | grep -i killwait

To request an increase (e.g., to 10 minutes), contact support with:

  • Current KillWait value

  • Desired timeout

  • Justification (e.g., checkpoint saving, cleanup operations)

Job Failures and Errors

Why is a node drained with reason "KillTaskFailure"?

You may see nodes in a drained state with a reason like:

sinfo -R
# or
scontrol show node <node>

What it means

  • Slurm tried to cleanly terminate one or more job steps on that node, but the termination did not complete successfully (or within the expected time). As a safety measure, Slurm drains the node so no new work is scheduled there until the issue is investigated.

Common causes

  1. Unkillable processes: A process stuck in kernel I/O (D-state), hung GPU driver call, or similar condition that ignores signals.

  2. Container/runtime issues: The job is running in a container (e.g., enroot/pyxis) and the container teardown fails or leaves processes behind.

  3. cgroup / process tracking problems: Slurmd cannot reliably enumerate or kill all descendant processes (often shows up as lingering processes after job end).

  4. Filesystem / NFS hangs: Jobs blocked on a shared filesystem can become hard to kill.

  5. Short KillWait / slow cleanup: The workload needs more time to exit (checkpointing, large teardown), but Slurm is configured to move on and then considers the kill a failure.

How to rectify (user checklist)

  1. Confirm the reason and impacted jobs

    scontrol show node <node> | egrep -i "State=|Reason="
    squeue -w <node>
    sacct -j <job_id> --format=JobID,State,ExitCode,NodeList,Elapsed
    
  2. Try resuming the node

    scontrol update nodename <node> state=resume
    
  3. If this is your job, reduce “hard to kill” behavior next time

    • Add signal handling in your training script so it exits quickly on SIGTERM.

    • Save checkpoints periodically so you can exit quickly.

    • Avoid doing long blocking work in an exit handler.

    • If using distributed training, ensure all ranks exit when one rank is terminated.

When to contact support (recommended)

Because draining/undraining and process cleanup are typically admin actions, contact Together support with:

  • Cluster name

  • Node name

  • Drain reason string (e.g., KillTaskFailure)

  • Job ID(s) that were running

  • Approximate time the drain happened

  • Any relevant job logs (especially shutdown/teardown)

What support may do (FYI)

  • Identify and kill orphaned processes, or reboot the node if processes are stuck.

  • Verify slurmd health, cgroups configuration, and container runtime state.

  • Adjust relevant settings (e.g., KillWait) when appropriate.

Jobs are failing with "couldn't chdir to home directory" error

Error example:

slurmstepd-b65c909e-26: error: couldn't chdir to `/home/username': No such file or directory: going to /tmp instead

Cause: Home directory is not mounted or accessible on compute nodes

Solutions:

  • Use /data/username or a shared filesystem that's mounted on all nodes

  • Ensure your job script sets the working directory explicitly

  • Contact support if home directories should be available but aren't

I'm seeing CrashLoopBackOff for Slurm worker pods

If Kubernetes shows Slurm workers in CrashLoopBackOff:

kubectl get pods | grep slurm-worker

This indicates:

  • Init containers are failing

  • Configuration issues with the Slurm worker setup

  • Node-level problems

Action: Contact support immediately with the output of:

kubectl get pods | grep CrashLoopBackOff
kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous

Distributed training jobs are timing out

For PyTorch distributed training with Slurm:

Common timeout causes:

  • Network connectivity between nodes

  • Incorrect NCCL settings

  • Firewall blocking required ports

Debug steps:

  1. Test inter-node communication

  2. Check NCCL environment variables

  3. Verify InfiniBand/RDMA is working (if applicable)

  4. Review logs for specific timeout errors

Contact support if you suspect infrastructure issues.

Configuration and Customization

How do I set up custom Slurm partitions?

Partitions allow you to organize nodes by priority, preemption policies, or resource types.

Example partition configuration:

PartitionName=batch Default=YES MaxTime=UNLIMITED PriorityTier=30 Nodes=node-01,node-02
PartitionName=urgent Default=NO MaxTime=UNLIMITED PriorityTier=40 Nodes=node-03,node-04
PartitionName=low Default=NO MaxTime=UNLIMITED PriorityTier=20 PreemptMode=cancel Nodes=node-01,node-02,node-03,node-04

To request partition changes: Contact support with:

  • Partition name

  • Node list

  • Priority tier

  • Preemption mode (if applicable)

  • Default status

How do I install packages that persist on login nodes?

Issue: Packages installed with apt install or pip install may not persist after pod restarts.

Solutions:

  1. For system packages: Contact support to have them added to the base image

  2. For Python packages: Use a virtual environment or conda environment stored on persistent storage

    # Create venv on persistent storage
    python -m venv /data/username/.venv
    source /data/username/.venv/bin/activate
    pip install <packages>
    
  3. For user-level tools: Install to your home directory or /data

Can I customize Slurm configuration settings?

Some Slurm settings can be customized per customer:

  • MaxArraySize

  • KillWait

  • Partition configuration

  • Priority tiers

  • Preemption policies

Contact support with your requirements and we'll assess feasibility.

Resource Limits and Quotas

What are the default resource limits per job?

Default limits vary by cluster but typically include:

  • Max CPUs per job

  • Max memory per job

  • Max GPUs per job

  • Max job runtime (often UNLIMITED)

Check current limits:

scontrol show config | grep -i max
sinfo -o "%20P %5a %.10l %16F"

How do I request more resources for my jobs?

If you need resources beyond default limits:

  1. Temporary increase: Contact support for specific jobs

  2. Permanent increase: Discuss with your account team

Provide:

  • Current limits

  • Desired limits

  • Use case and justification

Can I reserve nodes for exclusive use?

For dedicated node access:

srun --exclusive <other-options> your_command

For longer-term reservations, contact support to discuss:

  • Number of nodes needed

  • Duration

  • GPU types

  • Business justification

Monitoring and Debugging

How do I check the status of my jobs?

View your jobs:

squeue -u $USER

Detailed job information:

scontrol show job <job_id>

Job history:

sacct -u $USER
sacct -j <job_id> --format=JobID,JobName,Partition,State,ExitCode,Start,End

How do I view Slurm logs for debugging?

For job output:

  • Standard output: slurm-<jobid>.out

  • Standard error: slurm-<jobid>.err

For system-level Slurm issues: Contact support for access to:

  • Slurm controller logs

  • Slurmd (node daemon) logs

  • Slurmstepd (job step daemon) logs

What information should I include when reporting Slurm issues?

To help support resolve issues quickly, provide:

Required:

  • Cluster name/ID

  • Username

  • Exact error message

  • Timestamp

  • Commands you ran

Helpful:

  • Job ID (if applicable)

  • Output of sinfo

  • Output of scontrol show config | grep <relevant-setting>

  • Whether issue affects multiple users

  • Recent changes to your workflow

Known Issues and Workarounds

Slurm controller becomes unreachable periodically

If you experience periodic Slurm controller outages:

  • This may indicate infrastructure issues requiring investigation

  • Workaround: Wait for controller to recover (usually 5-15 minutes)

  • Long-term fix: Contact support to investigate root cause

Login node SSH disconnects frequently

For unstable login node connections:

  • Use tmux or screen to maintain sessions across disconnects

  • Keep critical work in version control or persistent storage

  • Report frequent disconnections to support for investigation

PyTorch distributed jobs fail with import errors

Common error:

ModuleNotFoundError: No module named 'torch.distributed'

Causes:

  • Python environment not consistent across nodes

  • Missing packages on compute nodes

Solutions:

  • Use containerized environments (Docker/Singularity)

  • Ensure packages are installed on shared filesystem

  • Use Pyxis for containerized Slurm jobs

Jobs fail with library version conflicts

For library conflicts (e.g., CUDA, NCCL, PyTorch versions):

  • Use containers to ensure consistent environments

  • Check that library versions are compatible with GPU hardware

  • Contact support if you need specific library versions installed cluster-wide

Best Practices

Efficient job submission

  1. Test interactively first: Use srun for small tests before sbatch for large jobs

  2. Request only needed resources: Don't over-allocate CPUs, memory, or GPUs

  3. Use job arrays: For many similar jobs, use array jobs instead of individual submissions

  4. Set appropriate time limits: Use -time to help scheduler optimize allocation

Data management

  1. Use shared filesystems: Store data on /data or other mounted shared storage

  2. Don't use local node storage: Data on local disks is lost when jobs end

  3. Clean up temporary files: Remove job outputs and logs periodically

  4. Back up critical data: Don't rely solely on cluster storage

Debugging strategies

  1. Start small: Debug with single-node, single-GPU jobs before scaling

  2. Check logs immediately: Review error messages before jobs are purged

  3. Use interactive sessions: srun --pty bash for hands-on debugging

  4. Test network connectivity: Verify inter-node communication for multi-node jobs