Network Layout¶

Each server has roughly the same layout:

Active vs Failover¶

Generally, for every service there is an active version, and a failover version. (suffixed with either -a or -b). This is done for "high availability" reasons, as if there are any issues with the active server, we can swap it for the failover version.

The two instances are installed on separate hardware, so even in the event of hardware failure, we should still be able to switch. However, both are still within the same data centre.

Switching to the failover is either automatically triggered (by monitoring software that Positive use), or manually triggered by a Positive operator.

Some services, such as the DCS and DPS are configured where every instance is treated as active.

Server: `cubed-vip`¶

Nginx is installed on these servers, and handles routing to each individual service within CUBED.

Note

We do not have SSH access to these servers, and are wholely managed by Positive.

Server: `cubed-nfs`¶

Instead of the codebase for each service (DCS, dashboard etc) being stored locally on each server, they are instead stored in a central location on the cubed-nfs servers, and are shared using NFS.

See the documentation on NFS mounts and the utility server for more information on inspecting these mounts.

Server: `cubed-grafana`¶

This server is home to our Grafana instance.

It is publicly available at https://grafana.cubed.engineering/. Root sign-in credentials can be found in AWS Secrets Manager under Prod/Login/Grafana.

Server: `cubed-metrics`¶

The server cubed-metrics are used to house two services required for monitoring, Loki and InfluxDB.

Loki¶

Loki is a log aggregation system designed to store and query logs from applications and infrastructure. It is created by the same company that makes Grafana

We use it to capture

Loki does not capture the logs itself, that is instead done by an agent that is installed on each server, see Promtail for more details.

Common labels¶

We have a set of common labels, that each server should be configured to have:

Label	Value	Description
`environment`	`stge`	Is a staging server
	`prod`	Is a production server
`service`	`control`	Is a control server
	`dash`	Is a dashboard server
	`dcs`	Is a DCS server
	`pydps-prediction`	Is a DPS prediction/visscore server
	`pydps-segmenter`	Is a DPS segmenter server

You can use these labels to filter logs for a specific service. For example, if we wanted to find logs by production DCS servers that contained the text "requeued", we could use a query like this:

{service="dcs", environment="prod"} |= `requeued`

InfluxDB¶

InfluxDB is an open-source time series database. We use it to store metrics captured by Telegraf, i.e. CPU, Memory etc as well as custom metrics exposed by the DCS/DPS.

Web Interface¶

The InfluxDB web interface is available publicly at https://influx.cubed.picl.co.uk/. You can use this interface for inspecting the metrics we capture. Sign-in credentials can be found in AWS Secrets Manager under Prod/Login/InfluxDB.

Configuring Telegraf¶

We use the Telegraf to agent to ship metrics to InfluxDB. The configuration for Telegraf is stored remotely within InfluxDB.

There is a separate configuration for each service, as well as each environment. You can view the configurations in the InfluxDB web interface by selecting Data in the left navigation, followed by the Telegraf tab.

An example of Telegraf's config might look like this:

[global_tags]
  service = "pydcs"
  environment = "stge"

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  hostname = ""
  omit_hostname = false

[[outputs.influxdb_v2]]
  urls = ["https://influx.cubed.picl.co.uk"]
  token = "$INFLUX_TOKEN"
  organization = "Cubed"
  bucket = "DCS_Servers"

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false

[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs", "nfs", "nfs4"]

[[inputs.diskio]]
[[inputs.mem]]
[[inputs.net]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]

[[inputs.http]]
  interval = "2s"
  urls = [
    "http://127.0.0.1:8000/@server"
  ]
  method = "GET"
  data_format = "json"

Highlighted in this example is an example of how DCS metrics are captured via the /@server endpoint, using the HTTP input plugin.

Server: `cubed-util`¶

See the section on the Utility Server for more information.