slotd

slotd is a Rust-based, single-node, single-user scheduler with a Slurm-style command surface.
Other languages:
It is intended for one workstation, not for a cluster. The goal is to keep common Slurm command names and familiar options while simplifying the runtime model:
- one local daemon
- one SQLite database
- one execution host
- one local user workflow
You use slotd through the same command names you would expect in Slurm:
sbatchsrunsallocsqueuesacctscontrolscancelsinfo
What It Is Good For
slotd works well for:
- local experiment queues
- long-running CPU or GPU jobs
- one-machine batch pipelines
- interactive work with resource reservation
- a lightweight Slurm-like interface on a workstation
It is not trying to provide:
- multi-node scheduling
- cluster administration
- accounts, QoS, or fairshare
- federation or reservations across hosts
Main Characteristics
- Built as a single Rust binary
- Uses a daemon plus a Unix domain socket
- Persists state in SQLite
- Schedules CPU, memory, and GPU reservations
- Supports batch jobs, arrays, interactive runs, allocations, and steps
- Supports delayed start, requeue-once, dependencies, and local feature constraints
Documentation Map
- Installation
- Quick Start
- Runtime Model
- Batch Jobs with
sbatch - Interactive Execution with
srun - Allocations with
salloc - Queue and Accounting
- Job Control
- Node and Partition View
- Testing
- Examples
- Troubleshooting
Installation
Requirements
- Linux or WSL
- Rust toolchain with
cargo systemd --userif you want automatic daemon managementnvidia-smiif you want automatic GPU detection
Clone the Repository
git clone https://github.com/ymgaq/slotd.git
cd slotd
Install with the Provided Script
From the repository root:
./scripts/install.sh
By default this will:
- build
slotdin release mode - install
slotdunder~/.local/bin - create command aliases such as
sbatchandsqueue - create a runtime root under
~/.local/share/slotd - write
~/.config/slotd/slotd.env - install and start a
systemd --userservice
Installer Options
| Option | Description | Default |
|---|---|---|
--repo-root PATH | Build from a different repository root | current repo |
--profile NAME | Cargo profile to build | release |
--install-bin-dir PATH | Install binary and alias directory | ~/.local/bin |
--runtime-root PATH | Runtime root used as SLOTD_ROOT | ~/.local/share/slotd |
--config-dir PATH | Configuration directory | ~/.config/slotd |
--systemd-user-dir PATH | User unit directory | ~/.config/systemd/user |
--cpu-partitions VALUE | Value for SLOTD_CPU_PARTITIONS | cpu |
--gpu-partitions VALUE | Value for SLOTD_GPU_PARTITIONS | gpu |
--features VALUE | Value for SLOTD_FEATURES | unset |
--notify-cmd VALUE | Value for SLOTD_NOTIFY_CMD | unset |
--cgroup-base PATH | Value for SLOTD_CGROUP_BASE | unset |
--skip-build | Reuse an existing build output | off |
--skip-systemd | Do not install or start a user service | off |
--uninstall | Remove the installed setup | off |
--purge-runtime | Remove persisted state during uninstall | off |
Example:
./scripts/install.sh \
--features cpu,gpu \
--notify-cmd 'notify-send "slotd" "$SLOTD_JOB_ID $SLOTD_JOB_STATE"'
If you set --cgroup-base, use a writable cgroup v2 subtree. Leaving it unset
keeps CPU and memory as reservation-only scheduling values.
Uninstall
Remove the installation:
./scripts/install.sh --uninstall
Remove installation and runtime state:
./scripts/install.sh --uninstall --purge-runtime
Manual Setup
If you do not want to use the installer, you can still build and run slotd directly:
cargo build --release
SLOTD_ROOT="$HOME/.local/share/slotd" ./target/release/slotd daemon
Then use the same SLOTD_ROOT in another shell:
SLOTD_ROOT="$HOME/.local/share/slotd" ./target/release/slotd sbatch --wrap 'echo hello'
Runtime Files
The default runtime root is:
~/.local/share/slotd
Important files and directories:
run/slotd.socklib/state.dblib/jobs/<job_id>/
The client and the daemon must use the same SLOTD_ROOT.
Quick Start
1. Verify the Daemon
If you installed with the script and did not use --skip-systemd, the daemon should already be running.
Check the basic commands:
sinfo
squeue
sacct
Typical first-run output:
sinfoshows one row per configured partitionsqueueis emptysacctis empty
2. Submit a Simple Batch Job
sbatch --wrap 'echo hello from slotd'
Typical output:
Submitted batch job 1
3. Inspect the Queue
squeue
Typical output while a job is active:
JOBID | PARTITION | NAME | USER | ST | TIME | NODELIST(REASON)
1 | cpu | wrap | ... | R | 0:00 | localhost
4. Inspect Completed Jobs
sacct
Typical output after the job finishes:
JobID | Partition | JobName | User | State | ExitCode
1 | cpu | wrap | ... | COMPLETED | 0:0
5. Show Detailed Job Information
scontrol show job 1
This shows:
- job identity
- job state and reason
- requested resources
- output paths
- working directory
- timestamps
6. Try an Interactive Run
srun --label --unbuffered -- echo hello
Typical output:
0: hello
Runtime Model
High-Level Model
slotd is a single-host scheduler. The runtime model is intentionally simple:
- one local daemon
- one local SQLite database
- one local execution host
- one local user workflow
There is no controller/worker split and no remote node launch protocol.
Core Resources
slotd schedules three resource types:
- CPU
- memory
- GPU
Current behavior:
- CPU reservation is
ntasks * cpus-per-task ntaskslaunches one local process per task rank for batch and foreground execution- total memory defaults to host-detected
MemTotalfrom/proc/meminfo, with a16384 MBfallback - memory is stored in MB
- GPUs are integer slots
- admission is reservation-based, not usage-based
- if
SLOTD_CGROUP_BASEis unset, CPU and memory remain reservation-only - if
SLOTD_CGROUP_BASEis set to a writable cgroup v2 subtree,slotdwritesmemory.maxandcpu.max - if cgroup setup fails after explicit configuration, launch fails instead of silently skipping enforcement
Partitions
Configured by environment:
SLOTD_CPU_PARTITIONSSLOTD_GPU_PARTITIONS
Rules:
- only configured partition names are accepted
- if there are no GPUs, no GPU partition is exposed
- if a GPU partition is selected and
--gpusis omitted, the default GPU request is1 - otherwise the default GPU request is
0 - CPU and GPU partitions are virtual views over one local host
- CPU and memory capacity stay shared across partitions; only GPU visibility/defaults differ by partition
GPU Detection
If SLOTD_GPU_COUNT is not set, slotd tries to detect GPUs from nvidia-smi.
The current implementation checks:
nvidia-smi/usr/bin/nvidia-smi/usr/lib/wsl/lib/nvidia-smi/bin/nvidia-smi
Job Types
Persisted records are one of:
- top-level batch jobs
- allocation-only jobs
- array tasks
- steps under allocations
Job States
Implemented states:
PENDINGRUNNINGCOMPLETINGCOMPLETEDFAILEDCANCELLEDTIMEOUTOUT_OF_MEMORY
Terminal states:
COMPLETEDFAILEDCANCELLEDTIMEOUTOUT_OF_MEMORY
Scheduling Rules
The daemon loop runs every 300ms.
Pending jobs are blocked by:
- dependencies
- array concurrency limits
- delayed start time
- exclusive host use
- insufficient reserved resources
- user hold state
Ordering:
- submission order is the base rule
- explicit job priority can override pure submission order
- array tasks are interleaved by array group
Runtime Files
Within SLOTD_ROOT:
run/slotd.sock: daemon socketlib/state.db: SQLite statelib/jobs/<job_id>/script.sh: batch scriptlib/jobs/<job_id>/runner.sh: daemon wrapperlib/jobs/<job_id>/exit_status: wrapper exit status
Notifications
If SLOTD_NOTIFY_CMD is set, slotd runs it for terminal top-level jobs.
Exported variables:
SLOTD_JOB_IDSLOTD_JOB_NAMESLOTD_JOB_STATESLOTD_JOB_PARTITIONSLOTD_JOB_REASON
Batch Jobs with sbatch
Forms
sbatch [options] <script>
sbatch [options] --wrap '<command>'
What sbatch Does
sbatch creates a persisted batch job record and submits it to the local daemon.
In script mode:
- it reads the script from disk
- stores the body in the job directory
- parses leading
#SBATCHdirectives
In --wrap mode:
- it creates an internal shell script around the command
- it launches one local process per task rank when
--ntasksis greater than1
Typical output:
Submitted batch job 1
With --parsable:
1
Main Options
| Option | Meaning |
|---|---|
--wrap <command> | Submit an inline shell command |
-J, --job-name <name> | Set the job name |
-p, --partition <partition> | Choose a partition |
-c, --cpus-per-task <n> | CPUs per task |
-n, --ntasks <n> | Number of concurrently launched local tasks |
--mem <size> | Requested memory, such as 512M or 8G |
-t, --time <time> | Time limit |
-G, --gpus <n> | Requested GPU slots |
-o, --output <path> | Stdout path pattern |
-e, --error <path> | Stderr path pattern |
-D, --chdir <path> | Working directory |
--constraint <feature> | Require matching local features |
-d, --dependency <spec> | Dependency expression |
-a, --array <spec> | Array specification |
--export <spec> | Export environment values into the job |
--export-file <path> | Load environment variables from a file |
--open-mode append|truncate | Append to or truncate output files |
--signal <spec> | Send a warning signal before timeout |
--begin <time> | Delay job eligibility |
--exclusive | Do not share the host with other top-level jobs |
--requeue | Requeue once after certain failure states |
--parsable | Print only the job ID |
-W, --wait | Wait for job completion |
Defaults
When not specified:
cpus-per-task = 1ntasks = 1mem = 512M- partition = configured default partition
- GPUs default to
1for GPU partitions and0otherwise
#SBATCH Support
Supported directives:
-J,--job-name-p,--partition-c,--cpus-per-task-n,--ntasks--mem-t,--time-G,--gpus-o,--output-e,--error-D,--chdir--constraint--begin--exclusive--requeue-d,--dependency-a,--array
Precedence:
- command-line options
SBATCH_*environment variables#SBATCHdirectives- built-in defaults
Example batch script:
#!/usr/bin/env bash
#SBATCH -J script-demo
#SBATCH -p cpu
#SBATCH -c 2
#SBATCH --mem 1G
#SBATCH -t 00:05:00
#SBATCH -o logs/%j.out
echo "hello from script mode"
echo "job=$SLURM_JOB_ID cpus=$SLURM_CPUS_PER_TASK"
Submit it with:
sbatch ./script-demo.sh
Expected result:
sbatchreads the script from disk and applies the leading#SBATCHdirectives- the job runs with the requested name, partition, CPU count, memory, and output path
logs/<jobid>.outcontains the echoed lines from the script body
Dependencies
Supported dependency expressions:
after:<jobid>[,<jobid>...]afterany:<jobid>[,<jobid>...]afterok:<jobid>[,<jobid>...]afternotok:<jobid>[,<jobid>...]singleton
Arrays
Supported array forms:
- single IDs
- ranges, such as
0-7 - stepped ranges, such as
0-15:2 - concurrency limits, such as
0-31%4
Example:
sbatch -a 0-9%2 --wrap 'echo task=$SLURM_ARRAY_TASK_ID'
Expected result:
- multiple persisted task records
- at most two running at the same time for that array
Delayed Start
--begin supports:
- epoch seconds
YYYY-MM-DDYYYY-MM-DDTHH:MM:SSnow+<duration>
Example:
sbatch --begin now+00:10:00 --wrap 'echo delayed'
Requeue Once
--requeue changes failure handling:
FAILEDrequeues onceTIMEOUTrequeues onceOUT_OF_MEMORYrequeues onceCOMPLETEDdoes not requeueCANCELLEDdoes not requeue
Example:
sbatch --requeue --wrap 'exit 1'
Output Paths
Pattern tokens:
%j: job ID%A: array job ID%a: array task ID%x: job name%u: user name%N: hostname%%: literal%
Defaults:
- non-array stdout:
slurm-%j.out - array stdout:
slurm-%A_%a.out - stderr defaults to stdout unless
--erroris set
Environment Export
--export supports:
ALLNONEKEY=VALUE,...
Example:
sbatch --export FOO=bar,HELLO=world --wrap 'echo "$FOO $HELLO"'
Expected result:
- the output contains
bar world
Interactive Execution with srun
Form
srun [options] -- <command...>
What srun Does
srun runs a command in the foreground by default.
Behavior depends on whether you are already inside an allocation:
- inside an allocation:
- creates a step record
- runs the command directly in the foreground
- outside an allocation:
- creates an allocation-like top-level record
- waits for it to run
- creates a step record
- runs the command in the foreground
Only --no-wait submits a daemon-managed run job.
When --ntasks is greater than 1, foreground srun launches one local
process per task rank on the same host and exports task-local ranks through
SLURM_PROCID and SLURM_LOCALID.
Main Options
| Option | Meaning |
|---|---|
-J, --job-name <name> | Set the job name |
-p, --partition <partition> | Choose a partition |
-c, --cpus-per-task <n> | CPUs per task |
-n, --ntasks <n> | Number of concurrently launched local tasks |
--mem <size> | Requested memory |
-t, --time <time> | Time limit |
-G, --gpus <n> | Requested GPU slots |
-o, --output <path> | Foreground stdout path |
-e, --error <path> | Foreground stderr path |
-D, --chdir <path> | Working directory |
--immediate | Fail if resources are not available immediately |
--pty | Reserved for PTY support; currently rejected |
--constraint <feature> | Require matching local features |
--cpu-bind <mode> | Bind CPU affinity |
--label | Prefix output with <task_id>: |
--unbuffered | Flush forwarded output eagerly |
--no-wait | Submit a daemon-managed run job |
Output Behavior
Example:
srun --label --unbuffered -- echo hello
Typical output:
0: hello
CPU Binding
Supported values:
nonecoresmap_cpu:<id,id,...>
Example:
srun --cpu-bind map_cpu:0,2 -- python train.py
Immediate Mode
--immediate fails instead of waiting if resources are not available right away.
Example:
srun --immediate -p gpu -G 1 -- nvidia-smi
--no-wait
--no-wait submits a run job to the daemon instead of waiting in the foreground.
Typical output:
Submitted run job 12
Restrictions:
--labeland--unbufferedare not supported together with--no-wait--ptyis parsed for compatibility but currently exits with a clear “not implemented yet” error until a real PTY path exists
Allocations with salloc
Form
salloc [options] [command...]
What salloc Does
salloc creates an allocation-only top-level job, waits for it to become runnable, and then starts a foreground command inside that allocation.
If no command is given, it starts your shell.
When --ntasks is greater than 1, the foreground command launches one local
process per task rank on the same host.
Typical output:
Granted job allocation 4
Main Options
| Option | Meaning |
|---|---|
-J, --job-name <name> | Set the allocation name |
-p, --partition <partition> | Choose a partition |
-c, --cpus-per-task <n> | CPUs per task |
-n, --ntasks <n> | Number of concurrently launched local tasks |
--mem <size> | Requested memory |
-t, --time <time> | Time limit |
-G, --gpus <n> | Requested GPU slots |
-D, --chdir <path> | Working directory |
--constraint <feature> | Require matching local features |
--immediate | Fail if the allocation cannot start immediately |
Example
salloc -p gpu -c 4 --mem 8G -G 1 -t 00:30:00
Expected result:
- an allocation record is created
- the command waits until the allocation is running
- your shell starts inside the allocation
- later
sruncommands become steps under that allocation - the allocation command uses the allocation task count for local multi-task execution
Queue and Accounting
squeue
squeue shows top-level queued and running jobs.
Common Options
| Option | Meaning |
|---|---|
--all | Show all states |
-t, --states | Filter by state |
-j, --jobs | Filter by job IDs |
-u, --user | Filter by user |
-p, --partition | Filter by partition |
-o, --format | Select output fields |
-S, --sort | Sort rows |
-l, --long | Long default view |
--start | Show estimated start times |
--array | Show array-style job IDs |
--noheader | Omit the header |
Default View
JOBID | PARTITION | NAME | USER | ST | TIME | NODELIST(REASON)
Long View
JOBID | PARTITION | NAME | USER | ST | TIME | TIME_LIMIT | NTASKS | CPUS | REQ_MEM | REQ_GPU | NODELIST(REASON)
Start-Time View
With --start and no explicit format:
JOBID | PARTITION | NAME | USER | ST | START_TIME | NODELIST(REASON)
Format Fields
Supported field names:
JobIDPartitionName,JobNameUserST,StateTime,ElapsedTimeLimit,Time_LimitNTasksCPUS,ReqCPUSReqMemReqGPU,ReqGPUSStart,StartTimeNodeList(Reason),NodeListReason,Reason,NodeList
Supported % codes:
%i%P%j%u%t,%T%M%S%R,%N
sacct
sacct shows persisted accounting data, including completed jobs and steps.
Common Options
| Option | Meaning |
|---|---|
-j, --jobs | Filter by job IDs |
-s, --state | Filter by state |
-S, --starttime | Filter by start time |
-E, --endtime | Filter by end time |
-u, --user | Filter by user |
-p, --partition | Filter by partition |
-o, --format | Select output fields |
-P, --parsable2 | Use ` |
-n, --noheader | Omit the header |
Default View
JobID | Partition | JobName | User | State | ExitCode
Record Types
sacct includes:
- top-level jobs
- allocation records
- step records
- completed records
ID rendering rules:
- step IDs appear as
<job_id>.<step_id> - array tasks appear as
<array_job_id>_<task_id>
Format Fields
Supported field names:
JobIDArrayJobIDArrayTaskIDJobNamePartitionUserStateReasonExitCodeElapsedAllocCPUSReqMemReqTRESAllocTRESNodeListSubmitStartEndWorkDirBatchFlagMaxRSS
Supported % codes:
%i%F%K%j%P%u%t,%T%R%X%M%C%m%b%B%N%V%S%E%Z
Job Control
scontrol
Supported forms:
scontrol show job <job_id>
scontrol hold job <job_id>
scontrol release job <job_id>
scontrol update job <job_id> KEY=VALUE...
show job
Shows detailed job information, including:
- job identity and ownership
- state and reason
- requested resources
- time limits
- dependency string
- submit, start, and end timestamps
- command and working directory
- stdout and stderr paths
- array metadata
ReqTRESAllocTRESMaxRSS- step summary
hold job
Moves a pending job into a held state.
Result:
- the job stays
PENDING - the reason becomes
JobHeldUser
release job
Releases a held pending job back into normal scheduling.
update job
Supported update keys:
| Key | Rule |
|---|---|
JobName / Name | Can be changed only while the job is PENDING |
Partition | Can be changed only while the job is PENDING |
TimeLimit / Time | Can be changed until the job reaches a terminal state |
Priority | Can be changed only while the job is PENDING |
Example:
scontrol update job 10 TimeLimit=02:00:00
scancel
Supported forms:
scancel <job_id>
scancel <job_id.step_id>
scancel --signal <sig> <job_id>
scancel --signal <sig> <job_id.step_id>
Default Cancel Behavior
- pending jobs become
CANCELLEDimmediately - running jobs transition through
COMPLETING - the runner sends
SIGTERM - after the grace period it sends
SIGKILLif needed
The recorded cancel reason is:
CancelledByUser
Signal Mode
--signal sends a specific signal instead of performing normal cancellation.
Example:
scancel --signal TERM 12
Node and Partition View
sinfo
sinfo shows partition and host state for the local node.
Common Options
| Option | Meaning |
|---|---|
-p, --partition | Filter partitions |
-N, --Node | Switch to the single-node summary view |
-l, --long | Use the long default view |
-o, --format | Select output fields |
--noheader | Omit the header |
Default View
Typical output:
PARTITION | HOSTNAMES | STATE | FEATURES | GRES_USED
cpu* | localhost | idle | cpu | N/A
gpu | localhost | idle | cpu,generic_gpu | gpu:0
Notes:
- one row is shown per configured partition
- the default partition is marked with
* - CPU partitions show only
cpuinFEATURES - GPU partitions show
cpuplus detected GPU model features, and do not include the genericgpulabel inFEATURES - CPU and GPU partitions are virtual views for convenience on the same host
- CPU and memory capacity are shared across those partitions rather than split into separate pools
Node View
sinfo -N collapses the partition rows into a single local node summary.
Typical output:
PARTITION | HOSTNAMES | STATE
cpu,gpu* | localhost | idle
Notes:
- the
PARTITIONcolumn becomes a comma-joined partition list - the default partition in that list keeps the
*marker - capacity and allocation fields in long or formatted node views are aggregated across the visible partitions
Long View
sinfo -l adds capacity and allocation details:
PARTITION | HOSTNAMES | STATE | FEATURES | CPUS | CPU_ALLOC | MEMORY | MEM_ALLOC | GPUS | GPU_ALLOC | RUNNING | PENDING | GRES_USED
Supported Format Fields
Field names:
PartitionHostnames,Hostname,NodeListStateFeaturesCPUSCPU_ALLOC,CPUSLOAD,CPUALLOCMemory,MemMEM_ALLOC,MemoryAllocated,MemAllocGPUSGPU_ALLOC,GpusAllocated,GpuAllocRunning,RunningJobsPending,PendingJobsGRES_USED,GresUsed
Supported % codes:
%P%N%t,%T%f%G
Testing
slotd is mainly verified with Rust integration tests in tests/.
Each test creates an isolated temporary runtime, launches its own daemon, and drives the public Slurm-style commands through the compiled slotd binary.
That keeps test state separate from your normal local SLOTD_ROOT.
How to Run the Tests
Run the full suite:
cargo test
Run one integration test file:
cargo test --test scheduling
Run one named test:
cargo test dependency_job_waits_for_prerequisite_before_running --test scheduling
What the Tests Cover
The current suite focuses on behavior that matters to end users:
- core command flows for
sbatch,srun,salloc,sinfo,squeue,sacct, andscontrol - scheduling rules such as dependencies, job arrays, delayed start, constraints, resource flags, and requeue behavior
- interactive and foreground execution paths including
srun --pty,--label,--unbuffered, and allocation or step handling - output, recovery, and lifecycle behavior including cancellation, warning signals, update processing, and output file placement
- notification and reporting paths such as
SLOTD_NOTIFY_CMDand parsable query output
Representative files in tests/ include:
cli_basic.rs,sbatch_options.rs,srun.rs,srun_options.rs,srun_modes.rs,salloc.rs,sinfo.rs,control.rs,query_squeue.rs,query_sacct.rsscheduling.rs,compound_scheduling.rs,dependency_variants.rs,array.rs,begin.rs,constraint.rs,resource_flags.rs,requeue.rs,timeout.rssrun_interactive.rs,srun_allocation.rs,cpu_bind.rs,output_files.rs,cancellation.rs,recovery.rs,update.rs,warning_signal.rs,notify.rs
Manual Smoke Testing
For a quick manual check, start the daemon:
cargo run -- daemon
Then submit a small job from another shell using the same SLOTD_ROOT:
cargo run -- sbatch --wrap 'echo hello'
Examples
CPU Batch Job
sbatch \
-J hello \
-p cpu \
-c 1 \
--mem 512M \
-t 00:05:00 \
-o logs/%j.out \
--wrap 'echo hello'
Expected result:
Submitted batch job <id>logs/<id>.outcontainshello
GPU Batch Job
sbatch \
-J gpu-demo \
-p gpu \
-c 4 \
--mem 8G \
-G 1 \
-t 01:00:00 \
-o logs/%j.out \
--wrap 'nvidia-smi'
Expected result:
- the job runs on the GPU partition
- output contains
nvidia-smidata
Interactive Foreground Run
srun --label --unbuffered -- echo hello
Expected result:
0: hello
Interactive Allocation
salloc -p gpu -c 4 --mem 8G -G 1 -t 00:30:00
Expected result:
Granted job allocation <id>- a shell starts inside the allocation
Array Job
sbatch \
-J array-demo \
-a 0-9%2 \
-o logs/%A_%a.out \
--wrap 'echo task=$SLURM_ARRAY_TASK_ID'
Expected result:
- multiple task records
- logs such as
logs/<array_id>_0.out
Requeue Once
sbatch --requeue --wrap 'exit 1'
Expected result:
- the first failure returns the job to
PENDING - the second failure leaves the final state as
FAILED
Delayed Start
sbatch --begin now+00:10:00 --wrap 'echo delayed'
Expected result:
- the job remains pending until the begin time
squeue --startshows an estimated future start time
Explicit Export
sbatch \
--export FOO=bar,HELLO=world \
--wrap 'echo "$FOO $HELLO"'
Expected result:
- output contains
bar world
Manual Daemon Run
SLOTD_ROOT="$HOME/.local/share/slotd" ./target/release/slotd daemon
Then from another shell:
SLOTD_ROOT="$HOME/.local/share/slotd" ./target/release/slotd sbatch --wrap 'echo hello'
Troubleshooting
Connection refused
Symptom:
error: io error: Connection refused (os error 111)
Meaning:
- the client found a socket path
- but no daemon is currently accepting connections there
Check:
- the daemon is actually running
- the client and the daemon use the same
SLOTD_ROOT - you do not have a stale socket file from an older run
Useful checks:
ls -l "$SLOTD_ROOT/run/slotd.sock"
sinfo
Wrong Runtime Root
If the daemon uses one runtime root and the client uses another, commands appear to fail randomly.
Common cause:
- daemon started by
systemd --user - client launched from a shell without the same
SLOTD_ROOT
Best fix:
- install with
scripts/install.sh - use the wrapper commands from
~/.local/bin
No GPU Partition Appears
If sinfo only shows cpu, check:
nvidia-smiworks in the daemon environmentSLOTD_GPU_PARTITIONSis configured as expected- the daemon has been restarted after GPU configuration changes
Manual check:
nvidia-smi
sinfo
Text file busy During Reinstall
The installer now replaces binaries atomically. If you still hit issues:
- stop the daemon
- run the installer again
Commands:
systemctl --user restart slotd.service
./scripts/install.sh
Jobs Stay Pending
Possible causes:
- insufficient resources
- dependency not satisfied
- array concurrency limit
- delayed start time not reached
- job is held
- an exclusive job is already running
Inspect with:
squeue
scontrol show job <job_id>
cgroup Setup Fails
If SLOTD_CGROUP_BASE is set but does not point at a writable cgroup v2
subtree, job launch fails with an explicit cgroup error.
Check:
- the path exists in the daemon environment
- the path is a cgroup v2 subtree, not a regular directory or file
- the daemon user can create per-job subdirectories and write control files there
A Job Was Cancelled but Ended as OUT_OF_MEMORY
This can happen if cgroup memory events indicate an OOM during termination. In that case the final state prefers OUT_OF_MEMORY.
Output Files Are Missing
Check:
- the job’s working directory
- the
-o/--outputand-e/--errorpaths - pattern expansion such as
%jor%A_%a
Inspect with:
scontrol show job <job_id>