Skip to content

Command Run

A run represents a single attempt at running a definition for the time period represented by its slot. A slot can have multiple runs associated with it , usually indicating a that the slot has been retried.

Attempts

It would helpful to have another field or table to group a set of runs together (e.g. command_run_attempt).

This table could include metadata including if a particular attempt was created by the engine or manually by an operator.

Database Structure

Label Type Description Notes
id UUID
slot CommandSlot Defines which slot this run relates to
status CommandRunStatus See run status
waiting_at DateTime Timestamp of when this run transitioned to Waiting status ℹ Essentially the same as a "created" timestamp, as all runs are created with the Waiting status
running_at DateTime Timestamp of when this run transitioned to Running status
timeout_at DateTime Timestamp of when this run transitioned to Timeout status
failed_at DateTime Timestamp of when this run transitioned to Failed status
success_at DateTime Timestamp of when this run transitioned to Success status
heartbeat_at DateTime This us updated by the cron agent while it is executing the run

Run Status

Type Database Value Description Notes
Waiting 0 The run was just created, and is waiting to be executed by an agent
Running 1 The run is currently being executed
Timeout 2 The agent has stopped updating this runs heartbeat_at and the engine has timed it out
Failed 3 The run is has finished, but failed
Success 4 The run has finished successfully

Status transitions

Here is a diagram indicating the various transitions the status can take, and when the various timestamps are changed (waiting_at, running_at etc):

Status transitions

A status can only ever go forward i.e. it cannot be changed from Waiting to Running and back to Waiting again.

No heartbeat?

If the agent executing a run were to crash, we lose all information about that process, and it would never transition to Success or Failed.

It is for this reason we have the heartbeat_at timestamp. If the agent stops updating this for a certain period of time, we transition the run to Timeout, so it can be retried at a later date.