Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

slotd

slotd

slotd is a Rust-based, single-node, single-user scheduler with a Slurm-style command surface.

Other languages:

It is intended for one workstation, not for a cluster. The goal is to keep common Slurm command names and familiar options while simplifying the runtime model:

  • one local daemon
  • one SQLite database
  • one execution host
  • one local user workflow

You use slotd through the same command names you would expect in Slurm:

  • sbatch
  • srun
  • salloc
  • squeue
  • sacct
  • scontrol
  • scancel
  • sinfo

What It Is Good For

slotd works well for:

  • local experiment queues
  • long-running CPU or GPU jobs
  • one-machine batch pipelines
  • interactive work with resource reservation
  • a lightweight Slurm-like interface on a workstation

It is not trying to provide:

  • multi-node scheduling
  • cluster administration
  • accounts, QoS, or fairshare
  • federation or reservations across hosts

Main Characteristics

  • Built as a single Rust binary
  • Uses a daemon plus a Unix domain socket
  • Persists state in SQLite
  • Schedules CPU, memory, and GPU reservations
  • Supports batch jobs, arrays, interactive runs, allocations, and steps
  • Supports delayed start, requeue-once, dependencies, and local feature constraints

Documentation Map

Installation

Requirements

  • Linux or WSL
  • Rust toolchain with cargo
  • systemd --user if you want automatic daemon management
  • nvidia-smi if you want automatic GPU detection

Clone the Repository

git clone https://github.com/ymgaq/slotd.git
cd slotd

Install with the Provided Script

From the repository root:

./scripts/install.sh

By default this will:

  • build slotd in release mode
  • install slotd under ~/.local/bin
  • create command aliases such as sbatch and squeue
  • create a runtime root under ~/.local/share/slotd
  • write ~/.config/slotd/slotd.env
  • install and start a systemd --user service

Installer Options

OptionDescriptionDefault
--repo-root PATHBuild from a different repository rootcurrent repo
--profile NAMECargo profile to buildrelease
--install-bin-dir PATHInstall binary and alias directory~/.local/bin
--runtime-root PATHRuntime root used as SLOTD_ROOT~/.local/share/slotd
--config-dir PATHConfiguration directory~/.config/slotd
--systemd-user-dir PATHUser unit directory~/.config/systemd/user
--cpu-partitions VALUEValue for SLOTD_CPU_PARTITIONScpu
--gpu-partitions VALUEValue for SLOTD_GPU_PARTITIONSgpu
--features VALUEValue for SLOTD_FEATURESunset
--notify-cmd VALUEValue for SLOTD_NOTIFY_CMDunset
--cgroup-base PATHValue for SLOTD_CGROUP_BASEunset
--skip-buildReuse an existing build outputoff
--skip-systemdDo not install or start a user serviceoff
--uninstallRemove the installed setupoff
--purge-runtimeRemove persisted state during uninstalloff

Example:

./scripts/install.sh \
  --features cpu,gpu \
  --notify-cmd 'notify-send "slotd" "$SLOTD_JOB_ID $SLOTD_JOB_STATE"'

If you set --cgroup-base, use a writable cgroup v2 subtree. Leaving it unset keeps CPU and memory as reservation-only scheduling values.

Uninstall

Remove the installation:

./scripts/install.sh --uninstall

Remove installation and runtime state:

./scripts/install.sh --uninstall --purge-runtime

Manual Setup

If you do not want to use the installer, you can still build and run slotd directly:

cargo build --release
SLOTD_ROOT="$HOME/.local/share/slotd" ./target/release/slotd daemon

Then use the same SLOTD_ROOT in another shell:

SLOTD_ROOT="$HOME/.local/share/slotd" ./target/release/slotd sbatch --wrap 'echo hello'

Runtime Files

The default runtime root is:

~/.local/share/slotd

Important files and directories:

  • run/slotd.sock
  • lib/state.db
  • lib/jobs/<job_id>/

The client and the daemon must use the same SLOTD_ROOT.

Quick Start

1. Verify the Daemon

If you installed with the script and did not use --skip-systemd, the daemon should already be running.

Check the basic commands:

sinfo
squeue
sacct

Typical first-run output:

  • sinfo shows one row per configured partition
  • squeue is empty
  • sacct is empty

2. Submit a Simple Batch Job

sbatch --wrap 'echo hello from slotd'

Typical output:

Submitted batch job 1

3. Inspect the Queue

squeue

Typical output while a job is active:

JOBID | PARTITION | NAME | USER | ST | TIME | NODELIST(REASON)
1     | cpu       | wrap | ...  | R  | 0:00 | localhost

4. Inspect Completed Jobs

sacct

Typical output after the job finishes:

JobID | Partition | JobName | User | State     | ExitCode
1     | cpu       | wrap    | ...  | COMPLETED | 0:0

5. Show Detailed Job Information

scontrol show job 1

This shows:

  • job identity
  • job state and reason
  • requested resources
  • output paths
  • working directory
  • timestamps

6. Try an Interactive Run

srun --label --unbuffered -- echo hello

Typical output:

0: hello

Runtime Model

High-Level Model

slotd is a single-host scheduler. The runtime model is intentionally simple:

  • one local daemon
  • one local SQLite database
  • one local execution host
  • one local user workflow

There is no controller/worker split and no remote node launch protocol.

Core Resources

slotd schedules three resource types:

  • CPU
  • memory
  • GPU

Current behavior:

  • CPU reservation is ntasks * cpus-per-task
  • ntasks launches one local process per task rank for batch and foreground execution
  • total memory defaults to host-detected MemTotal from /proc/meminfo, with a 16384 MB fallback
  • memory is stored in MB
  • GPUs are integer slots
  • admission is reservation-based, not usage-based
  • if SLOTD_CGROUP_BASE is unset, CPU and memory remain reservation-only
  • if SLOTD_CGROUP_BASE is set to a writable cgroup v2 subtree, slotd writes memory.max and cpu.max
  • if cgroup setup fails after explicit configuration, launch fails instead of silently skipping enforcement

Partitions

Configured by environment:

  • SLOTD_CPU_PARTITIONS
  • SLOTD_GPU_PARTITIONS

Rules:

  • only configured partition names are accepted
  • if there are no GPUs, no GPU partition is exposed
  • if a GPU partition is selected and --gpus is omitted, the default GPU request is 1
  • otherwise the default GPU request is 0
  • CPU and GPU partitions are virtual views over one local host
  • CPU and memory capacity stay shared across partitions; only GPU visibility/defaults differ by partition

GPU Detection

If SLOTD_GPU_COUNT is not set, slotd tries to detect GPUs from nvidia-smi.

The current implementation checks:

  • nvidia-smi
  • /usr/bin/nvidia-smi
  • /usr/lib/wsl/lib/nvidia-smi
  • /bin/nvidia-smi

Job Types

Persisted records are one of:

  • top-level batch jobs
  • allocation-only jobs
  • array tasks
  • steps under allocations

Job States

Implemented states:

  • PENDING
  • RUNNING
  • COMPLETING
  • COMPLETED
  • FAILED
  • CANCELLED
  • TIMEOUT
  • OUT_OF_MEMORY

Terminal states:

  • COMPLETED
  • FAILED
  • CANCELLED
  • TIMEOUT
  • OUT_OF_MEMORY

Scheduling Rules

The daemon loop runs every 300ms.

Pending jobs are blocked by:

  • dependencies
  • array concurrency limits
  • delayed start time
  • exclusive host use
  • insufficient reserved resources
  • user hold state

Ordering:

  • submission order is the base rule
  • explicit job priority can override pure submission order
  • array tasks are interleaved by array group

Runtime Files

Within SLOTD_ROOT:

  • run/slotd.sock: daemon socket
  • lib/state.db: SQLite state
  • lib/jobs/<job_id>/script.sh: batch script
  • lib/jobs/<job_id>/runner.sh: daemon wrapper
  • lib/jobs/<job_id>/exit_status: wrapper exit status

Notifications

If SLOTD_NOTIFY_CMD is set, slotd runs it for terminal top-level jobs.

Exported variables:

  • SLOTD_JOB_ID
  • SLOTD_JOB_NAME
  • SLOTD_JOB_STATE
  • SLOTD_JOB_PARTITION
  • SLOTD_JOB_REASON

Batch Jobs with sbatch

Forms

sbatch [options] <script>
sbatch [options] --wrap '<command>'

What sbatch Does

sbatch creates a persisted batch job record and submits it to the local daemon.

In script mode:

  • it reads the script from disk
  • stores the body in the job directory
  • parses leading #SBATCH directives

In --wrap mode:

  • it creates an internal shell script around the command
  • it launches one local process per task rank when --ntasks is greater than 1

Typical output:

Submitted batch job 1

With --parsable:

1

Main Options

OptionMeaning
--wrap <command>Submit an inline shell command
-J, --job-name <name>Set the job name
-p, --partition <partition>Choose a partition
-c, --cpus-per-task <n>CPUs per task
-n, --ntasks <n>Number of concurrently launched local tasks
--mem <size>Requested memory, such as 512M or 8G
-t, --time <time>Time limit
-G, --gpus <n>Requested GPU slots
-o, --output <path>Stdout path pattern
-e, --error <path>Stderr path pattern
-D, --chdir <path>Working directory
--constraint <feature>Require matching local features
-d, --dependency <spec>Dependency expression
-a, --array <spec>Array specification
--export <spec>Export environment values into the job
--export-file <path>Load environment variables from a file
--open-mode append|truncateAppend to or truncate output files
--signal <spec>Send a warning signal before timeout
--begin <time>Delay job eligibility
--exclusiveDo not share the host with other top-level jobs
--requeueRequeue once after certain failure states
--parsablePrint only the job ID
-W, --waitWait for job completion

Defaults

When not specified:

  • cpus-per-task = 1
  • ntasks = 1
  • mem = 512M
  • partition = configured default partition
  • GPUs default to 1 for GPU partitions and 0 otherwise

#SBATCH Support

Supported directives:

  • -J, --job-name
  • -p, --partition
  • -c, --cpus-per-task
  • -n, --ntasks
  • --mem
  • -t, --time
  • -G, --gpus
  • -o, --output
  • -e, --error
  • -D, --chdir
  • --constraint
  • --begin
  • --exclusive
  • --requeue
  • -d, --dependency
  • -a, --array

Precedence:

  1. command-line options
  2. SBATCH_* environment variables
  3. #SBATCH directives
  4. built-in defaults

Example batch script:

#!/usr/bin/env bash
#SBATCH -J script-demo
#SBATCH -p cpu
#SBATCH -c 2
#SBATCH --mem 1G
#SBATCH -t 00:05:00
#SBATCH -o logs/%j.out

echo "hello from script mode"
echo "job=$SLURM_JOB_ID cpus=$SLURM_CPUS_PER_TASK"

Submit it with:

sbatch ./script-demo.sh

Expected result:

  • sbatch reads the script from disk and applies the leading #SBATCH directives
  • the job runs with the requested name, partition, CPU count, memory, and output path
  • logs/<jobid>.out contains the echoed lines from the script body

Dependencies

Supported dependency expressions:

  • after:<jobid>[,<jobid>...]
  • afterany:<jobid>[,<jobid>...]
  • afterok:<jobid>[,<jobid>...]
  • afternotok:<jobid>[,<jobid>...]
  • singleton

Arrays

Supported array forms:

  • single IDs
  • ranges, such as 0-7
  • stepped ranges, such as 0-15:2
  • concurrency limits, such as 0-31%4

Example:

sbatch -a 0-9%2 --wrap 'echo task=$SLURM_ARRAY_TASK_ID'

Expected result:

  • multiple persisted task records
  • at most two running at the same time for that array

Delayed Start

--begin supports:

  • epoch seconds
  • YYYY-MM-DD
  • YYYY-MM-DDTHH:MM:SS
  • now+<duration>

Example:

sbatch --begin now+00:10:00 --wrap 'echo delayed'

Requeue Once

--requeue changes failure handling:

  • FAILED requeues once
  • TIMEOUT requeues once
  • OUT_OF_MEMORY requeues once
  • COMPLETED does not requeue
  • CANCELLED does not requeue

Example:

sbatch --requeue --wrap 'exit 1'

Output Paths

Pattern tokens:

  • %j: job ID
  • %A: array job ID
  • %a: array task ID
  • %x: job name
  • %u: user name
  • %N: hostname
  • %%: literal %

Defaults:

  • non-array stdout: slurm-%j.out
  • array stdout: slurm-%A_%a.out
  • stderr defaults to stdout unless --error is set

Environment Export

--export supports:

  • ALL
  • NONE
  • KEY=VALUE,...

Example:

sbatch --export FOO=bar,HELLO=world --wrap 'echo "$FOO $HELLO"'

Expected result:

  • the output contains bar world

Interactive Execution with srun

Form

srun [options] -- <command...>

What srun Does

srun runs a command in the foreground by default.

Behavior depends on whether you are already inside an allocation:

  • inside an allocation:
    • creates a step record
    • runs the command directly in the foreground
  • outside an allocation:
    • creates an allocation-like top-level record
    • waits for it to run
    • creates a step record
    • runs the command in the foreground

Only --no-wait submits a daemon-managed run job.

When --ntasks is greater than 1, foreground srun launches one local process per task rank on the same host and exports task-local ranks through SLURM_PROCID and SLURM_LOCALID.

Main Options

OptionMeaning
-J, --job-name <name>Set the job name
-p, --partition <partition>Choose a partition
-c, --cpus-per-task <n>CPUs per task
-n, --ntasks <n>Number of concurrently launched local tasks
--mem <size>Requested memory
-t, --time <time>Time limit
-G, --gpus <n>Requested GPU slots
-o, --output <path>Foreground stdout path
-e, --error <path>Foreground stderr path
-D, --chdir <path>Working directory
--immediateFail if resources are not available immediately
--ptyReserved for PTY support; currently rejected
--constraint <feature>Require matching local features
--cpu-bind <mode>Bind CPU affinity
--labelPrefix output with <task_id>:
--unbufferedFlush forwarded output eagerly
--no-waitSubmit a daemon-managed run job

Output Behavior

Example:

srun --label --unbuffered -- echo hello

Typical output:

0: hello

CPU Binding

Supported values:

  • none
  • cores
  • map_cpu:<id,id,...>

Example:

srun --cpu-bind map_cpu:0,2 -- python train.py

Immediate Mode

--immediate fails instead of waiting if resources are not available right away.

Example:

srun --immediate -p gpu -G 1 -- nvidia-smi

--no-wait

--no-wait submits a run job to the daemon instead of waiting in the foreground.

Typical output:

Submitted run job 12

Restrictions:

  • --label and --unbuffered are not supported together with --no-wait
  • --pty is parsed for compatibility but currently exits with a clear “not implemented yet” error until a real PTY path exists

Allocations with salloc

Form

salloc [options] [command...]

What salloc Does

salloc creates an allocation-only top-level job, waits for it to become runnable, and then starts a foreground command inside that allocation.

If no command is given, it starts your shell.

When --ntasks is greater than 1, the foreground command launches one local process per task rank on the same host.

Typical output:

Granted job allocation 4

Main Options

OptionMeaning
-J, --job-name <name>Set the allocation name
-p, --partition <partition>Choose a partition
-c, --cpus-per-task <n>CPUs per task
-n, --ntasks <n>Number of concurrently launched local tasks
--mem <size>Requested memory
-t, --time <time>Time limit
-G, --gpus <n>Requested GPU slots
-D, --chdir <path>Working directory
--constraint <feature>Require matching local features
--immediateFail if the allocation cannot start immediately

Example

salloc -p gpu -c 4 --mem 8G -G 1 -t 00:30:00

Expected result:

  • an allocation record is created
  • the command waits until the allocation is running
  • your shell starts inside the allocation
  • later srun commands become steps under that allocation
  • the allocation command uses the allocation task count for local multi-task execution

Queue and Accounting

squeue

squeue shows top-level queued and running jobs.

Common Options

OptionMeaning
--allShow all states
-t, --statesFilter by state
-j, --jobsFilter by job IDs
-u, --userFilter by user
-p, --partitionFilter by partition
-o, --formatSelect output fields
-S, --sortSort rows
-l, --longLong default view
--startShow estimated start times
--arrayShow array-style job IDs
--noheaderOmit the header

Default View

JOBID | PARTITION | NAME | USER | ST | TIME | NODELIST(REASON)

Long View

JOBID | PARTITION | NAME | USER | ST | TIME | TIME_LIMIT | NTASKS | CPUS | REQ_MEM | REQ_GPU | NODELIST(REASON)

Start-Time View

With --start and no explicit format:

JOBID | PARTITION | NAME | USER | ST | START_TIME | NODELIST(REASON)

Format Fields

Supported field names:

  • JobID
  • Partition
  • Name, JobName
  • User
  • ST, State
  • Time, Elapsed
  • TimeLimit, Time_Limit
  • NTasks
  • CPUS, ReqCPUS
  • ReqMem
  • ReqGPU, ReqGPUS
  • Start, StartTime
  • NodeList(Reason), NodeListReason, Reason, NodeList

Supported % codes:

  • %i
  • %P
  • %j
  • %u
  • %t, %T
  • %M
  • %S
  • %R, %N

sacct

sacct shows persisted accounting data, including completed jobs and steps.

Common Options

OptionMeaning
-j, --jobsFilter by job IDs
-s, --stateFilter by state
-S, --starttimeFilter by start time
-E, --endtimeFilter by end time
-u, --userFilter by user
-p, --partitionFilter by partition
-o, --formatSelect output fields
-P, --parsable2Use `
-n, --noheaderOmit the header

Default View

JobID | Partition | JobName | User | State | ExitCode

Record Types

sacct includes:

  • top-level jobs
  • allocation records
  • step records
  • completed records

ID rendering rules:

  • step IDs appear as <job_id>.<step_id>
  • array tasks appear as <array_job_id>_<task_id>

Format Fields

Supported field names:

  • JobID
  • ArrayJobID
  • ArrayTaskID
  • JobName
  • Partition
  • User
  • State
  • Reason
  • ExitCode
  • Elapsed
  • AllocCPUS
  • ReqMem
  • ReqTRES
  • AllocTRES
  • NodeList
  • Submit
  • Start
  • End
  • WorkDir
  • BatchFlag
  • MaxRSS

Supported % codes:

  • %i
  • %F
  • %K
  • %j
  • %P
  • %u
  • %t, %T
  • %R
  • %X
  • %M
  • %C
  • %m
  • %b
  • %B
  • %N
  • %V
  • %S
  • %E
  • %Z

Job Control

scontrol

Supported forms:

scontrol show job <job_id>
scontrol hold job <job_id>
scontrol release job <job_id>
scontrol update job <job_id> KEY=VALUE...

show job

Shows detailed job information, including:

  • job identity and ownership
  • state and reason
  • requested resources
  • time limits
  • dependency string
  • submit, start, and end timestamps
  • command and working directory
  • stdout and stderr paths
  • array metadata
  • ReqTRES
  • AllocTRES
  • MaxRSS
  • step summary

hold job

Moves a pending job into a held state.

Result:

  • the job stays PENDING
  • the reason becomes JobHeldUser

release job

Releases a held pending job back into normal scheduling.

update job

Supported update keys:

KeyRule
JobName / NameCan be changed only while the job is PENDING
PartitionCan be changed only while the job is PENDING
TimeLimit / TimeCan be changed until the job reaches a terminal state
PriorityCan be changed only while the job is PENDING

Example:

scontrol update job 10 TimeLimit=02:00:00

scancel

Supported forms:

scancel <job_id>
scancel <job_id.step_id>
scancel --signal <sig> <job_id>
scancel --signal <sig> <job_id.step_id>

Default Cancel Behavior

  • pending jobs become CANCELLED immediately
  • running jobs transition through COMPLETING
  • the runner sends SIGTERM
  • after the grace period it sends SIGKILL if needed

The recorded cancel reason is:

CancelledByUser

Signal Mode

--signal sends a specific signal instead of performing normal cancellation.

Example:

scancel --signal TERM 12

Node and Partition View

sinfo

sinfo shows partition and host state for the local node.

Common Options

OptionMeaning
-p, --partitionFilter partitions
-N, --NodeSwitch to the single-node summary view
-l, --longUse the long default view
-o, --formatSelect output fields
--noheaderOmit the header

Default View

Typical output:

PARTITION | HOSTNAMES | STATE | FEATURES | GRES_USED
cpu*      | localhost | idle  | cpu      | N/A
gpu       | localhost | idle  | cpu,generic_gpu | gpu:0

Notes:

  • one row is shown per configured partition
  • the default partition is marked with *
  • CPU partitions show only cpu in FEATURES
  • GPU partitions show cpu plus detected GPU model features, and do not include the generic gpu label in FEATURES
  • CPU and GPU partitions are virtual views for convenience on the same host
  • CPU and memory capacity are shared across those partitions rather than split into separate pools

Node View

sinfo -N collapses the partition rows into a single local node summary.

Typical output:

PARTITION | HOSTNAMES | STATE
cpu,gpu*  | localhost | idle

Notes:

  • the PARTITION column becomes a comma-joined partition list
  • the default partition in that list keeps the * marker
  • capacity and allocation fields in long or formatted node views are aggregated across the visible partitions

Long View

sinfo -l adds capacity and allocation details:

PARTITION | HOSTNAMES | STATE | FEATURES | CPUS | CPU_ALLOC | MEMORY | MEM_ALLOC | GPUS | GPU_ALLOC | RUNNING | PENDING | GRES_USED

Supported Format Fields

Field names:

  • Partition
  • Hostnames, Hostname, NodeList
  • State
  • Features
  • CPUS
  • CPU_ALLOC, CPUSLOAD, CPUALLOC
  • Memory, Mem
  • MEM_ALLOC, MemoryAllocated, MemAlloc
  • GPUS
  • GPU_ALLOC, GpusAllocated, GpuAlloc
  • Running, RunningJobs
  • Pending, PendingJobs
  • GRES_USED, GresUsed

Supported % codes:

  • %P
  • %N
  • %t, %T
  • %f
  • %G

Testing

slotd is mainly verified with Rust integration tests in tests/. Each test creates an isolated temporary runtime, launches its own daemon, and drives the public Slurm-style commands through the compiled slotd binary. That keeps test state separate from your normal local SLOTD_ROOT.

How to Run the Tests

Run the full suite:

cargo test

Run one integration test file:

cargo test --test scheduling

Run one named test:

cargo test dependency_job_waits_for_prerequisite_before_running --test scheduling

What the Tests Cover

The current suite focuses on behavior that matters to end users:

  • core command flows for sbatch, srun, salloc, sinfo, squeue, sacct, and scontrol
  • scheduling rules such as dependencies, job arrays, delayed start, constraints, resource flags, and requeue behavior
  • interactive and foreground execution paths including srun --pty, --label, --unbuffered, and allocation or step handling
  • output, recovery, and lifecycle behavior including cancellation, warning signals, update processing, and output file placement
  • notification and reporting paths such as SLOTD_NOTIFY_CMD and parsable query output

Representative files in tests/ include:

  • cli_basic.rs, sbatch_options.rs, srun.rs, srun_options.rs, srun_modes.rs, salloc.rs, sinfo.rs, control.rs, query_squeue.rs, query_sacct.rs
  • scheduling.rs, compound_scheduling.rs, dependency_variants.rs, array.rs, begin.rs, constraint.rs, resource_flags.rs, requeue.rs, timeout.rs
  • srun_interactive.rs, srun_allocation.rs, cpu_bind.rs, output_files.rs, cancellation.rs, recovery.rs, update.rs, warning_signal.rs, notify.rs

Manual Smoke Testing

For a quick manual check, start the daemon:

cargo run -- daemon

Then submit a small job from another shell using the same SLOTD_ROOT:

cargo run -- sbatch --wrap 'echo hello'

Examples

CPU Batch Job

sbatch \
  -J hello \
  -p cpu \
  -c 1 \
  --mem 512M \
  -t 00:05:00 \
  -o logs/%j.out \
  --wrap 'echo hello'

Expected result:

  • Submitted batch job <id>
  • logs/<id>.out contains hello

GPU Batch Job

sbatch \
  -J gpu-demo \
  -p gpu \
  -c 4 \
  --mem 8G \
  -G 1 \
  -t 01:00:00 \
  -o logs/%j.out \
  --wrap 'nvidia-smi'

Expected result:

  • the job runs on the GPU partition
  • output contains nvidia-smi data

Interactive Foreground Run

srun --label --unbuffered -- echo hello

Expected result:

0: hello

Interactive Allocation

salloc -p gpu -c 4 --mem 8G -G 1 -t 00:30:00

Expected result:

  • Granted job allocation <id>
  • a shell starts inside the allocation

Array Job

sbatch \
  -J array-demo \
  -a 0-9%2 \
  -o logs/%A_%a.out \
  --wrap 'echo task=$SLURM_ARRAY_TASK_ID'

Expected result:

  • multiple task records
  • logs such as logs/<array_id>_0.out

Requeue Once

sbatch --requeue --wrap 'exit 1'

Expected result:

  • the first failure returns the job to PENDING
  • the second failure leaves the final state as FAILED

Delayed Start

sbatch --begin now+00:10:00 --wrap 'echo delayed'

Expected result:

  • the job remains pending until the begin time
  • squeue --start shows an estimated future start time

Explicit Export

sbatch \
  --export FOO=bar,HELLO=world \
  --wrap 'echo "$FOO $HELLO"'

Expected result:

  • output contains bar world

Manual Daemon Run

SLOTD_ROOT="$HOME/.local/share/slotd" ./target/release/slotd daemon

Then from another shell:

SLOTD_ROOT="$HOME/.local/share/slotd" ./target/release/slotd sbatch --wrap 'echo hello'

Troubleshooting

Connection refused

Symptom:

error: io error: Connection refused (os error 111)

Meaning:

  • the client found a socket path
  • but no daemon is currently accepting connections there

Check:

  • the daemon is actually running
  • the client and the daemon use the same SLOTD_ROOT
  • you do not have a stale socket file from an older run

Useful checks:

ls -l "$SLOTD_ROOT/run/slotd.sock"
sinfo

Wrong Runtime Root

If the daemon uses one runtime root and the client uses another, commands appear to fail randomly.

Common cause:

  • daemon started by systemd --user
  • client launched from a shell without the same SLOTD_ROOT

Best fix:

  • install with scripts/install.sh
  • use the wrapper commands from ~/.local/bin

No GPU Partition Appears

If sinfo only shows cpu, check:

  • nvidia-smi works in the daemon environment
  • SLOTD_GPU_PARTITIONS is configured as expected
  • the daemon has been restarted after GPU configuration changes

Manual check:

nvidia-smi
sinfo

Text file busy During Reinstall

The installer now replaces binaries atomically. If you still hit issues:

  • stop the daemon
  • run the installer again

Commands:

systemctl --user restart slotd.service
./scripts/install.sh

Jobs Stay Pending

Possible causes:

  • insufficient resources
  • dependency not satisfied
  • array concurrency limit
  • delayed start time not reached
  • job is held
  • an exclusive job is already running

Inspect with:

squeue
scontrol show job <job_id>

cgroup Setup Fails

If SLOTD_CGROUP_BASE is set but does not point at a writable cgroup v2 subtree, job launch fails with an explicit cgroup error.

Check:

  • the path exists in the daemon environment
  • the path is a cgroup v2 subtree, not a regular directory or file
  • the daemon user can create per-job subdirectories and write control files there

A Job Was Cancelled but Ended as OUT_OF_MEMORY

This can happen if cgroup memory events indicate an OOM during termination. In that case the final state prefers OUT_OF_MEMORY.

Output Files Are Missing

Check:

  • the job’s working directory
  • the -o/--output and -e/--error paths
  • pattern expansion such as %j or %A_%a

Inspect with:

scontrol show job <job_id>