Command Run¶
A run represents a single attempt at running a definition for the time period represented by its slot. A slot can have multiple runs associated with it , usually indicating a that the slot has been retried.
Attempts
It would helpful to have another field or table to group a set of runs together (e.g. command_run_attempt
).
This table could include metadata including if a particular attempt was created by the engine or manually by an operator.
Database Structure¶
Label | Type | Description | Notes |
---|---|---|---|
id | UUID |
||
slot | CommandSlot | Defines which slot this run relates to | |
status | CommandRunStatus |
See run status | |
waiting_at | DateTime |
Timestamp of when this run transitioned to Waiting status |
Essentially the same as a "created" timestamp, as all runs are created with the Waiting status |
running_at | DateTime |
Timestamp of when this run transitioned to Running status |
|
timeout_at | DateTime |
Timestamp of when this run transitioned to Timeout status |
|
failed_at | DateTime |
Timestamp of when this run transitioned to Failed status |
|
success_at | DateTime |
Timestamp of when this run transitioned to Success status |
|
heartbeat_at | DateTime |
This us updated by the cron agent while it is executing the run |
Run Status¶
Type | Database Value | Description | Notes |
---|---|---|---|
Waiting | 0 |
The run was just created, and is waiting to be executed by an agent | |
Running | 1 |
The run is currently being executed | |
Timeout | 2 |
The agent has stopped updating this runs heartbeat_at and the engine has timed it out |
|
Failed | 3 |
The run is has finished, but failed | |
Success | 4 |
The run has finished successfully |
Status transitions¶
Here is a diagram indicating the various transitions the status can take, and when the various timestamps are changed (waiting_at
, running_at
etc):
A status can only ever go forward i.e. it cannot be changed from Waiting
to Running
and back to Waiting
again.
No heartbeat?
If the agent executing a run were to crash, we lose all information about that process, and it would never transition to Success
or Failed
.
It is for this reason we have the heartbeat_at
timestamp. If the agent stops updating this for a certain period of time, we transition the run to Timeout
, so it can be retried at a later date.