slotd

slotd is a Rust-based, single-node, single-user scheduler with a Slurm-style command surface.

Other languages:

It is intended for one workstation, not for a cluster. The goal is to keep common Slurm command names and familiar options while simplifying the runtime model:

one local daemon
one SQLite database
one execution host
one local user workflow

You use slotd through the same command names you would expect in Slurm:

sbatch
srun
salloc
squeue
sacct
scontrol
scancel
sinfo

What It Is Good For

slotd works well for:

local experiment queues
long-running CPU or GPU jobs
one-machine batch pipelines
interactive work with resource reservation
a lightweight Slurm-like interface on a workstation

It is not trying to provide:

multi-node scheduling
cluster administration
accounts, QoS, or fairshare
federation or reservations across hosts

Main Characteristics

Built as a single Rust binary
Uses a daemon plus a Unix domain socket
Persists state in SQLite
Schedules CPU, memory, and GPU reservations
Supports batch jobs, arrays, interactive runs, allocations, and steps
Supports delayed start, requeue-once, dependencies, and local feature constraints

Installation

Requirements

Linux or WSL
Rust toolchain with cargo
systemd --user if you want automatic daemon management
nvidia-smi if you want automatic GPU detection

Clone the Repository

git clone https://github.com/ymgaq/slotd.git
cd slotd

Install with the Provided Script

From the repository root:

./scripts/install.sh

By default this will:

build slotd in release mode
install slotd under ~/.local/bin
create command aliases such as sbatch and squeue
create a runtime root under ~/.local/share/slotd
write ~/.config/slotd/slotd.env
install and start a systemd --user service

Installer Options

Option	Description	Default
`--repo-root PATH`	Build from a different repository root	current repo
`--profile NAME`	Cargo profile to build	`release`
`--install-bin-dir PATH`	Install binary and alias directory	`~/.local/bin`
`--runtime-root PATH`	Runtime root used as `SLOTD_ROOT`	`~/.local/share/slotd`
`--config-dir PATH`	Configuration directory	`~/.config/slotd`
`--systemd-user-dir PATH`	User unit directory	`~/.config/systemd/user`
`--cpu-partitions VALUE`	Value for `SLOTD_CPU_PARTITIONS`	`cpu`
`--gpu-partitions VALUE`	Value for `SLOTD_GPU_PARTITIONS`	`gpu`
`--features VALUE`	Value for `SLOTD_FEATURES`	unset
`--notify-cmd VALUE`	Value for `SLOTD_NOTIFY_CMD`	unset
`--cgroup-base PATH`	Value for `SLOTD_CGROUP_BASE`	unset
`--skip-build`	Reuse an existing build output	off
`--skip-systemd`	Do not install or start a user service	off
`--uninstall`	Remove the installed setup	off
`--purge-runtime`	Remove persisted state during uninstall	off

Example:

./scripts/install.sh \
  --features cpu,gpu \
  --notify-cmd 'notify-send "slotd" "$SLOTD_JOB_ID $SLOTD_JOB_STATE"'

If you set --cgroup-base, use a writable cgroup v2 subtree. Leaving it unset keeps CPU and memory as reservation-only scheduling values.

Uninstall

Remove the installation:

./scripts/install.sh --uninstall

Remove installation and runtime state:

./scripts/install.sh --uninstall --purge-runtime

Manual Setup

If you do not want to use the installer, you can still build and run slotd directly:

cargo build --release
SLOTD_ROOT="$HOME/.local/share/slotd" ./target/release/slotd daemon

Then use the same SLOTD_ROOT in another shell:

SLOTD_ROOT="$HOME/.local/share/slotd" ./target/release/slotd sbatch --wrap 'echo hello'

Runtime Files

The default runtime root is:

~/.local/share/slotd

Important files and directories:

run/slotd.sock
lib/state.db
lib/jobs/<job_id>/

The client and the daemon must use the same SLOTD_ROOT.

Quick Start

1. Verify the Daemon

If you installed with the script and did not use --skip-systemd, the daemon should already be running.

Check the basic commands:

sinfo
squeue
sacct

Typical first-run output:

sinfo shows one row per configured partition
squeue is empty
sacct is empty

2. Submit a Simple Batch Job

sbatch --wrap 'echo hello from slotd'

Typical output:

Submitted batch job 1

3. Inspect the Queue

squeue

Typical output while a job is active:

JOBID | PARTITION | NAME | USER | ST | TIME | NODELIST(REASON)
1     | cpu       | wrap | ...  | R  | 0:00 | localhost

4. Inspect Completed Jobs

sacct

Typical output after the job finishes:

JobID | Partition | JobName | User | State     | ExitCode
1     | cpu       | wrap    | ...  | COMPLETED | 0:0

5. Show Detailed Job Information

scontrol show job 1

This shows:

job identity
job state and reason
requested resources
output paths
working directory
timestamps

6. Try an Interactive Run

srun --label --unbuffered -- echo hello

Typical output:

0: hello

Runtime Model

High-Level Model

slotd is a single-host scheduler. The runtime model is intentionally simple:

one local daemon
one local SQLite database
one local execution host
one local user workflow

There is no controller/worker split and no remote node launch protocol.

Core Resources

slotd schedules three resource types:

CPU
memory
GPU

Current behavior:

CPU reservation is ntasks * cpus-per-task
ntasks launches one local process per task rank for batch and foreground execution
total memory defaults to host-detected MemTotal from /proc/meminfo, with a 16384 MB fallback
memory is stored in MB
GPUs are integer slots
admission is reservation-based, not usage-based
if SLOTD_CGROUP_BASE is unset, CPU and memory remain reservation-only
if SLOTD_CGROUP_BASE is set to a writable cgroup v2 subtree, slotd writes memory.max and cpu.max
if cgroup setup fails after explicit configuration, launch fails instead of silently skipping enforcement

Partitions

Configured by environment:

SLOTD_CPU_PARTITIONS
SLOTD_GPU_PARTITIONS

Rules:

only configured partition names are accepted
if there are no GPUs, no GPU partition is exposed
if a GPU partition is selected and --gpus is omitted, the default GPU request is 1
otherwise the default GPU request is 0
CPU and GPU partitions are virtual views over one local host
CPU and memory capacity stay shared across partitions; only GPU visibility/defaults differ by partition

GPU Detection

If SLOTD_GPU_COUNT is not set, slotd tries to detect GPUs from nvidia-smi.

The current implementation checks:

nvidia-smi
/usr/bin/nvidia-smi
/usr/lib/wsl/lib/nvidia-smi
/bin/nvidia-smi

Job Types

Persisted records are one of:

top-level batch jobs
allocation-only jobs
array tasks
steps under allocations

Job States

Implemented states:

PENDING
RUNNING
COMPLETING
COMPLETED
FAILED
CANCELLED
TIMEOUT
OUT_OF_MEMORY

Terminal states:

COMPLETED
FAILED
CANCELLED
TIMEOUT
OUT_OF_MEMORY

Scheduling Rules

The daemon loop runs every 300ms.

Pending jobs are blocked by:

dependencies
array concurrency limits
delayed start time
exclusive host use
insufficient reserved resources
user hold state

Ordering:

submission order is the base rule
explicit job priority can override pure submission order
array tasks are interleaved by array group

Runtime Files

Within SLOTD_ROOT:

run/slotd.sock: daemon socket
lib/state.db: SQLite state
lib/jobs/<job_id>/script.sh: batch script
lib/jobs/<job_id>/runner.sh: daemon wrapper
lib/jobs/<job_id>/exit_status: wrapper exit status

Notifications

If SLOTD_NOTIFY_CMD is set, slotd runs it for terminal top-level jobs.

Exported variables:

SLOTD_JOB_ID
SLOTD_JOB_NAME
SLOTD_JOB_STATE
SLOTD_JOB_PARTITION
SLOTD_JOB_REASON

Batch Jobs with `sbatch`

Forms

sbatch [options] <script>
sbatch [options] --wrap '<command>'

Main Options

Option	Meaning
`--wrap <command>`	Submit an inline shell command
`-J`, `--job-name <name>`	Set the job name
`-p`, `--partition <partition>`	Choose a partition
`-c`, `--cpus-per-task <n>`	CPUs per task
`-n`, `--ntasks <n>`	Number of concurrently launched local tasks
`--mem <size>`	Requested memory, such as `512M` or `8G`
`-t`, `--time <time>`	Time limit
`-G`, `--gpus <n>`	Requested GPU slots
`-o`, `--output <path>`	Stdout path pattern
`-e`, `--error <path>`	Stderr path pattern
`-D`, `--chdir <path>`	Working directory
`--constraint <feature>`	Require matching local features
`-d`, `--dependency <spec>`	Dependency expression
`-a`, `--array <spec>`	Array specification
`--export <spec>`	Export environment values into the job
`--export-file <path>`	Load environment variables from a file
`--open-mode append\|truncate`	Append to or truncate output files
`--signal <spec>`	Send a warning signal before timeout
`--begin <time>`	Delay job eligibility
`--exclusive`	Do not share the host with other top-level jobs
`--requeue`	Requeue once after certain failure states
`--parsable`	Print only the job ID
`-W`, `--wait`	Wait for job completion

Defaults

When not specified:

cpus-per-task = 1
ntasks = 1
mem = 512M
partition = configured default partition
GPUs default to 1 for GPU partitions and 0 otherwise

`#SBATCH` Support

Supported directives:

-J, --job-name
-p, --partition
-c, --cpus-per-task
-n, --ntasks
--mem
-t, --time
-G, --gpus
-o, --output
-e, --error
-D, --chdir
--constraint
--begin
--exclusive
--requeue
-d, --dependency
-a, --array

Precedence:

command-line options
SBATCH_* environment variables
#SBATCH directives
built-in defaults

Example batch script:

#!/usr/bin/env bash
#SBATCH -J script-demo
#SBATCH -p cpu
#SBATCH -c 2
#SBATCH --mem 1G
#SBATCH -t 00:05:00
#SBATCH -o logs/%j.out

echo "hello from script mode"
echo "job=$SLURM_JOB_ID cpus=$SLURM_CPUS_PER_TASK"

Submit it with:

sbatch ./script-demo.sh

Expected result:

sbatch reads the script from disk and applies the leading #SBATCH directives
the job runs with the requested name, partition, CPU count, memory, and output path
logs/<jobid>.out contains the echoed lines from the script body

Dependencies

Supported dependency expressions:

after:<jobid>[,<jobid>...]
afterany:<jobid>[,<jobid>...]
afterok:<jobid>[,<jobid>...]
afternotok:<jobid>[,<jobid>...]
singleton

Arrays

Supported array forms:

single IDs
ranges, such as 0-7
stepped ranges, such as 0-15:2
concurrency limits, such as 0-31%4

Example:

sbatch -a 0-9%2 --wrap 'echo task=$SLURM_ARRAY_TASK_ID'

Expected result:

multiple persisted task records
at most two running at the same time for that array

Delayed Start

--begin supports:

epoch seconds
YYYY-MM-DD
YYYY-MM-DDTHH:MM:SS
now+<duration>

Example:

sbatch --begin now+00:10:00 --wrap 'echo delayed'

Requeue Once

--requeue changes failure handling:

FAILED requeues once
TIMEOUT requeues once
OUT_OF_MEMORY requeues once
COMPLETED does not requeue
CANCELLED does not requeue

Example:

sbatch --requeue --wrap 'exit 1'

Output Paths

Pattern tokens:

%j: job ID
%A: array job ID
%a: array task ID
%x: job name
%u: user name
%N: hostname
%%: literal %

Defaults:

non-array stdout: slurm-%j.out
array stdout: slurm-%A_%a.out
stderr defaults to stdout unless --error is set

Environment Export

--export supports:

ALL
NONE
KEY=VALUE,...

Example:

sbatch --export FOO=bar,HELLO=world --wrap 'echo "$FOO $HELLO"'

Expected result:

the output contains bar world

Interactive Execution with `srun`

Form

srun [options] -- <command...>

What `srun` Does

srun runs a command in the foreground by default.

Behavior depends on whether you are already inside an allocation:

inside an allocation:
- creates a step record
- runs the command directly in the foreground
outside an allocation:
- creates an allocation-like top-level record
- waits for it to run
- creates a step record
- runs the command in the foreground

Only --no-wait submits a daemon-managed run job.

When --ntasks is greater than 1, foreground srun launches one local process per task rank on the same host and exports task-local ranks through SLURM_PROCID and SLURM_LOCALID.

Main Options

Option	Meaning
`-J`, `--job-name <name>`	Set the job name
`-p`, `--partition <partition>`	Choose a partition
`-c`, `--cpus-per-task <n>`	CPUs per task
`-n`, `--ntasks <n>`	Number of concurrently launched local tasks
`--mem <size>`	Requested memory
`-t`, `--time <time>`	Time limit
`-G`, `--gpus <n>`	Requested GPU slots
`-o`, `--output <path>`	Foreground stdout path
`-e`, `--error <path>`	Foreground stderr path
`-D`, `--chdir <path>`	Working directory
`--immediate`	Fail if resources are not available immediately
`--pty`	Reserved for PTY support; currently rejected
`--constraint <feature>`	Require matching local features
`--cpu-bind <mode>`	Bind CPU affinity
`--label`	Prefix output with `<task_id>:`
`--unbuffered`	Flush forwarded output eagerly
`--no-wait`	Submit a daemon-managed run job

Output Behavior

Example:

srun --label --unbuffered -- echo hello

Typical output:

0: hello

CPU Binding

Supported values:

none
cores
map_cpu:<id,id,...>

Example:

srun --cpu-bind map_cpu:0,2 -- python train.py

Immediate Mode

--immediate fails instead of waiting if resources are not available right away.

Example:

srun --immediate -p gpu -G 1 -- nvidia-smi

`--no-wait`

--no-wait submits a run job to the daemon instead of waiting in the foreground.

Typical output:

Submitted run job 12

Restrictions:

--label and --unbuffered are not supported together with --no-wait
--pty is parsed for compatibility but currently exits with a clear “not implemented yet” error until a real PTY path exists

Allocations with `salloc`

Form

salloc [options] [command...]

What `salloc` Does

salloc creates an allocation-only top-level job, waits for it to become runnable, and then starts a foreground command inside that allocation.

If no command is given, it starts your shell.

When --ntasks is greater than 1, the foreground command launches one local process per task rank on the same host.

Typical output:

Granted job allocation 4

Main Options

Option	Meaning
`-J`, `--job-name <name>`	Set the allocation name
`-p`, `--partition <partition>`	Choose a partition
`-c`, `--cpus-per-task <n>`	CPUs per task
`-n`, `--ntasks <n>`	Number of concurrently launched local tasks
`--mem <size>`	Requested memory
`-t`, `--time <time>`	Time limit
`-G`, `--gpus <n>`	Requested GPU slots
`-D`, `--chdir <path>`	Working directory
`--constraint <feature>`	Require matching local features
`--immediate`	Fail if the allocation cannot start immediately

Example

salloc -p gpu -c 4 --mem 8G -G 1 -t 00:30:00

Expected result:

an allocation record is created
the command waits until the allocation is running
your shell starts inside the allocation
later srun commands become steps under that allocation
the allocation command uses the allocation task count for local multi-task execution

Queue and Accounting

`squeue`

squeue shows top-level queued and running jobs.

Common Options

Option	Meaning
`--all`	Show all states
`-t`, `--states`	Filter by state
`-j`, `--jobs`	Filter by job IDs
`-u`, `--user`	Filter by user
`-p`, `--partition`	Filter by partition
`-o`, `--format`	Select output fields
`-S`, `--sort`	Sort rows
`-l`, `--long`	Long default view
`--start`	Show estimated start times
`--array`	Show array-style job IDs
`--noheader`	Omit the header

Default View

JOBID | PARTITION | NAME | USER | ST | TIME | NODELIST(REASON)

Long View

JOBID | PARTITION | NAME | USER | ST | TIME | TIME_LIMIT | NTASKS | CPUS | REQ_MEM | REQ_GPU | NODELIST(REASON)

Start-Time View

With --start and no explicit format:

JOBID | PARTITION | NAME | USER | ST | START_TIME | NODELIST(REASON)

Format Fields

Supported field names:

JobID
Partition
Name, JobName
User
ST, State
Time, Elapsed
TimeLimit, Time_Limit
NTasks
CPUS, ReqCPUS
ReqMem
ReqGPU, ReqGPUS
Start, StartTime
NodeList(Reason), NodeListReason, Reason, NodeList

Supported % codes:

%i
%P
%j
%u
%t, %T
%M
%S
%R, %N

`sacct`

sacct shows persisted accounting data, including completed jobs and steps.

Common Options

Option	Meaning
`-j`, `--jobs`	Filter by job IDs
`-s`, `--state`	Filter by state
`-S`, `--starttime`	Filter by start time
`-E`, `--endtime`	Filter by end time
`-u`, `--user`	Filter by user
`-p`, `--partition`	Filter by partition
`-o`, `--format`	Select output fields
`-P`, `--parsable2`	Use `
`-n`, `--noheader`	Omit the header

Default View

JobID | Partition | JobName | User | State | ExitCode

Record Types

sacct includes:

top-level jobs
allocation records
step records
completed records

ID rendering rules:

step IDs appear as <job_id>.<step_id>
array tasks appear as <array_job_id>_<task_id>

Format Fields

Supported field names:

JobID
ArrayJobID
ArrayTaskID
JobName
Partition
User
State
Reason
ExitCode
Elapsed
AllocCPUS
ReqMem
ReqTRES
AllocTRES
NodeList
Submit
Start
End
WorkDir
BatchFlag
MaxRSS

Supported % codes:

%i
%F
%K
%j
%P
%u
%t, %T
%R
%X
%M
%C
%m
%b
%B
%N
%V
%S
%E
%Z

Job Control

`scontrol`

Supported forms:

scontrol show job <job_id>
scontrol hold job <job_id>
scontrol release job <job_id>
scontrol update job <job_id> KEY=VALUE...

`show job`

Shows detailed job information, including:

job identity and ownership
state and reason
requested resources
time limits
dependency string
submit, start, and end timestamps
command and working directory
stdout and stderr paths
array metadata
ReqTRES
AllocTRES
MaxRSS
step summary

`hold job`

Moves a pending job into a held state.

Result:

the job stays PENDING
the reason becomes JobHeldUser

`release job`

Releases a held pending job back into normal scheduling.

`update job`

Supported update keys:

Key	Rule
`JobName` / `Name`	Can be changed only while the job is `PENDING`
`Partition`	Can be changed only while the job is `PENDING`
`TimeLimit` / `Time`	Can be changed until the job reaches a terminal state
`Priority`	Can be changed only while the job is `PENDING`

Example:

scontrol update job 10 TimeLimit=02:00:00

`scancel`

Supported forms:

scancel <job_id>
scancel <job_id.step_id>
scancel --signal <sig> <job_id>
scancel --signal <sig> <job_id.step_id>

Default Cancel Behavior

pending jobs become CANCELLED immediately
running jobs transition through COMPLETING
the runner sends SIGTERM
after the grace period it sends SIGKILL if needed

The recorded cancel reason is:

CancelledByUser

Signal Mode

--signal sends a specific signal instead of performing normal cancellation.

Example:

scancel --signal TERM 12

Node and Partition View

`sinfo`

sinfo shows partition and host state for the local node.

Common Options

Option	Meaning
`-p`, `--partition`	Filter partitions
`-N`, `--Node`	Switch to the single-node summary view
`-l`, `--long`	Use the long default view
`-o`, `--format`	Select output fields
`--noheader`	Omit the header

Default View

Typical output:

PARTITION | HOSTNAMES | STATE | FEATURES | GRES_USED
cpu*      | localhost | idle  | cpu      | N/A
gpu       | localhost | idle  | cpu,generic_gpu | gpu:0

Notes:

one row is shown per configured partition
the default partition is marked with *
CPU partitions show only cpu in FEATURES
GPU partitions show cpu plus detected GPU model features, and do not include the generic gpu label in FEATURES
CPU and GPU partitions are virtual views for convenience on the same host
CPU and memory capacity are shared across those partitions rather than split into separate pools

Node View

sinfo -N collapses the partition rows into a single local node summary.

Typical output:

PARTITION | HOSTNAMES | STATE
cpu,gpu*  | localhost | idle

Notes:

the PARTITION column becomes a comma-joined partition list
the default partition in that list keeps the * marker
capacity and allocation fields in long or formatted node views are aggregated across the visible partitions

Long View

sinfo -l adds capacity and allocation details:

PARTITION | HOSTNAMES | STATE | FEATURES | CPUS | CPU_ALLOC | MEMORY | MEM_ALLOC | GPUS | GPU_ALLOC | RUNNING | PENDING | GRES_USED

Supported Format Fields

Field names:

Partition
Hostnames, Hostname, NodeList
State
Features
CPUS
CPU_ALLOC, CPUSLOAD, CPUALLOC
Memory, Mem
MEM_ALLOC, MemoryAllocated, MemAlloc
GPUS
GPU_ALLOC, GpusAllocated, GpuAlloc
Running, RunningJobs
Pending, PendingJobs
GRES_USED, GresUsed

Supported % codes:

%P
%N
%t, %T
%f
%G

Testing

slotd is mainly verified with Rust integration tests in tests/. Each test creates an isolated temporary runtime, launches its own daemon, and drives the public Slurm-style commands through the compiled slotd binary. That keeps test state separate from your normal local SLOTD_ROOT.

How to Run the Tests

Run the full suite:

cargo test

Run one integration test file:

cargo test --test scheduling

Run one named test:

cargo test dependency_job_waits_for_prerequisite_before_running --test scheduling

What the Tests Cover

The current suite focuses on behavior that matters to end users:

core command flows for sbatch, srun, salloc, sinfo, squeue, sacct, and scontrol
scheduling rules such as dependencies, job arrays, delayed start, constraints, resource flags, and requeue behavior
interactive and foreground execution paths including srun --pty, --label, --unbuffered, and allocation or step handling
output, recovery, and lifecycle behavior including cancellation, warning signals, update processing, and output file placement
notification and reporting paths such as SLOTD_NOTIFY_CMD and parsable query output

Representative files in tests/ include:

cli_basic.rs, sbatch_options.rs, srun.rs, srun_options.rs, srun_modes.rs, salloc.rs, sinfo.rs, control.rs, query_squeue.rs, query_sacct.rs
scheduling.rs, compound_scheduling.rs, dependency_variants.rs, array.rs, begin.rs, constraint.rs, resource_flags.rs, requeue.rs, timeout.rs
srun_interactive.rs, srun_allocation.rs, cpu_bind.rs, output_files.rs, cancellation.rs, recovery.rs, update.rs, warning_signal.rs, notify.rs

Manual Smoke Testing

For a quick manual check, start the daemon:

cargo run -- daemon

Then submit a small job from another shell using the same SLOTD_ROOT:

cargo run -- sbatch --wrap 'echo hello'

Examples

CPU Batch Job

sbatch \
  -J hello \
  -p cpu \
  -c 1 \
  --mem 512M \
  -t 00:05:00 \
  -o logs/%j.out \
  --wrap 'echo hello'

Expected result:

Submitted batch job <id>
logs/<id>.out contains hello

GPU Batch Job

sbatch \
  -J gpu-demo \
  -p gpu \
  -c 4 \
  --mem 8G \
  -G 1 \
  -t 01:00:00 \
  -o logs/%j.out \
  --wrap 'nvidia-smi'

Expected result:

the job runs on the GPU partition
output contains nvidia-smi data

Interactive Foreground Run

srun --label --unbuffered -- echo hello

Expected result:

0: hello

Interactive Allocation

salloc -p gpu -c 4 --mem 8G -G 1 -t 00:30:00

Expected result:

Granted job allocation <id>
a shell starts inside the allocation

Array Job

sbatch \
  -J array-demo \
  -a 0-9%2 \
  -o logs/%A_%a.out \
  --wrap 'echo task=$SLURM_ARRAY_TASK_ID'

Expected result:

multiple task records
logs such as logs/<array_id>_0.out

Requeue Once

sbatch --requeue --wrap 'exit 1'

Expected result:

the first failure returns the job to PENDING
the second failure leaves the final state as FAILED

Delayed Start

sbatch --begin now+00:10:00 --wrap 'echo delayed'

Expected result:

the job remains pending until the begin time
squeue --start shows an estimated future start time

Explicit Export

sbatch \
  --export FOO=bar,HELLO=world \
  --wrap 'echo "$FOO $HELLO"'

Expected result:

output contains bar world

Manual Daemon Run

SLOTD_ROOT="$HOME/.local/share/slotd" ./target/release/slotd daemon

Then from another shell:

SLOTD_ROOT="$HOME/.local/share/slotd" ./target/release/slotd sbatch --wrap 'echo hello'

Troubleshooting

`Connection refused`

Symptom:

error: io error: Connection refused (os error 111)

Meaning:

the client found a socket path
but no daemon is currently accepting connections there

Check:

the daemon is actually running
the client and the daemon use the same SLOTD_ROOT
you do not have a stale socket file from an older run

Useful checks:

ls -l "$SLOTD_ROOT/run/slotd.sock"
sinfo

Wrong Runtime Root

If the daemon uses one runtime root and the client uses another, commands appear to fail randomly.

Common cause:

daemon started by systemd --user
client launched from a shell without the same SLOTD_ROOT

Best fix:

install with scripts/install.sh
use the wrapper commands from ~/.local/bin

No GPU Partition Appears

If sinfo only shows cpu, check:

nvidia-smi works in the daemon environment
SLOTD_GPU_PARTITIONS is configured as expected
the daemon has been restarted after GPU configuration changes

Manual check:

nvidia-smi
sinfo

`Text file busy` During Reinstall

The installer now replaces binaries atomically. If you still hit issues:

stop the daemon
run the installer again

Commands:

systemctl --user restart slotd.service
./scripts/install.sh

Jobs Stay Pending

Possible causes:

insufficient resources
dependency not satisfied
array concurrency limit
delayed start time not reached
job is held
an exclusive job is already running

Inspect with:

squeue
scontrol show job <job_id>

cgroup Setup Fails

If SLOTD_CGROUP_BASE is set but does not point at a writable cgroup v2 subtree, job launch fails with an explicit cgroup error.

Check:

the path exists in the daemon environment
the path is a cgroup v2 subtree, not a regular directory or file
the daemon user can create per-job subdirectories and write control files there

A Job Was Cancelled but Ended as `OUT_OF_MEMORY`

This can happen if cgroup memory events indicate an OOM during termination. In that case the final state prefers OUT_OF_MEMORY.

Output Files Are Missing

Check:

the job’s working directory
the -o/--output and -e/--error paths
pattern expansion such as %j or %A_%a

Inspect with:

scontrol show job <job_id>

Keyboard shortcuts

slotd