Command Run¶

A run represents a single attempt at running a definition for the time period represented by its slot. A slot can have multiple runs associated with it , usually indicating a that the slot has been retried.

Attempts

It would helpful to have another field or table to group a set of runs together (e.g. command_run_attempt).

This table could include metadata including if a particular attempt was created by the engine or manually by an operator.

Database Structure¶

Label	Type	Description	Notes
id	`UUID`
slot	CommandSlot	Defines which slot this run relates to
status	`CommandRunStatus`	See run status
waiting_at	`DateTime`	Timestamp of when this run transitioned to `Waiting` status	Essentially the same as a "created" timestamp, as all runs are created with the `Waiting` status
running_at	`DateTime`	Timestamp of when this run transitioned to `Running` status
timeout_at	`DateTime`	Timestamp of when this run transitioned to `Timeout` status
failed_at	`DateTime`	Timestamp of when this run transitioned to `Failed` status
success_at	`DateTime`	Timestamp of when this run transitioned to `Success` status
heartbeat_at	`DateTime`	This us updated by the cron agent while it is executing the run

Run Status¶

Type	Database Value	Description
Waiting	`0`	The run was just created, and is waiting to be executed by an agent
Running	`1`	The run is currently being executed
Timeout	`2`	The agent has stopped updating this runs `heartbeat_at` and the engine has timed it out
Failed	`3`	The run is has finished, but failed
Success	`4`	The run has finished successfully

Status transitions¶

Here is a diagram indicating the various transitions the status can take, and when the various timestamps are changed (waiting_at, running_at etc):

A status can only ever go forward i.e. it cannot be changed from Waiting to Running and back to Waiting again.

No heartbeat?

If the agent executing a run were to crash, we lose all information about that process, and it would never transition to Success or Failed.

It is for this reason we have the heartbeat_at timestamp. If the agent stops updating this for a certain period of time, we transition the run to Timeout, so it can be retried at a later date.