Indepth¶

Cubed is a data collection platform built for website traffic. Below is the table design and how the data is structured.

We treat each user on site as a Visitor who has many Visit(s). This is important to differentiate, and you could think of a Visit as a session. We generate a unique 32 character token for each new Visitor and for each of their subsequent Visit(s). The token is stored in a cookie on the user's browser, while we store it in the DB. We drop 2 cookies to facilitate this vscr_vid (visitor id) and vscr_sid (session id). The vscr_vid cookie attempts to live forever (2 years max), while the vscr_sid will expire after 30mins of inactivity, or when the browser is closed. When someone arrives on site our Cubed tag can fire many pieces of data including: where the visitor came from via document.referer the page the tag fired on window.location.href cubed events - as defined in the account config section dash.withcubed.com

While processing this request we can read the headers and gain some extra information including: Ip address - for geo-location purposes OS details - name, version, etc.. Browser details - name, version, etc...

Once we've got this data validated and ready to be inserted we first check if the document.referer string matches any patterns that the user has created in the account config section (or the default Cubed patterns). All patterns are a simple regex string and must belong to 1 "Channel" (which in our system can be known as a "referer" due to legacy code/table names). For example we have a default pattern of "google.com", that belongs to the channel SEO. If a Visitor arrives on site and their document.referer is "https://www.google.com" - meaning they clicked a link from google's search page, then we would pattern match that and thus assign that pattern's Id to that visit. With this in mind you can now see how we can build a picture of a Visitor, arriving on site from different places across many days/weeks/months. It is not impossible for a Visitor to have a handful of Visits that first arrived via SEO, then Email, then Direct.

Finally to calculate "attribution" we need an "end point" in a Visitor's journey. What we call in the frontend a "Goal". Similar to "channel" and "referer" being inter-changeable, "Goal" and "Product" can be due to legacy code/table names. A Goal is basically a collection of "events" where one is flagged as a "sale" event. When processing all our data we use this event as a marker to say "the visitor's journey has finished, lets calculate how important all the visits were leading up to it".

Everything described above is stored in our "raw" tables all prefixed as attrib_. We started as an "attribution platform" and so we our naming convention follows suit. The "raw" tables also includes the configuration tables too. As mentioned above we can fire "cubed events" if they are defined in the account config section.

Config Tables¶

To configure a Cubed account we need Channels for Patterns to belong to, we need Events to fire, and we need to have a Goal with a designated "sale event".

table name	description	foreign key(s)
attrib_referer	The names of all the account's Referers/Channels/Buckets.
attrib_pattern	Regex patterns and the channels that they belong to	referer_id to attrib_referer.id
attrib_event	All created events for the account
attrib_product	Goals to hold events against
attrib_product_events	A link of event(s) to Goals with 1 (or more) flagged as a Sale	event_id to attrib_event.id, product_id to attrib_product.id

Raw data collection tables¶

Now lets look at the raw tables we insert into during data collection, including visitor with visits, the pages they saw, and the events they fired - along with any extra pieces of information.

table name	description	unique key(s)	foreign key(s)	extra details
attrib_visitor	The top most data point for all visitor data	token
attrib_visit	The main data point we process and aggregate to, must belong to a visitor	token	visitor_id to attrib_visitor.id, pattern_id to attrib_pattern.id
attrib_visit_details	Holds all extra info we get from the request headers including country, city, browser, and OS etc..		visit_id to attrib_visit.id
attrib_event_item	Store event(s) to a visit		visit_id to attrib_visit.id, event_id to attrib_event.id	This can also store `revenue` and `transaction_id` for any event fired
attrib_event_item_details	This will hold extra pieces of information for the fired event including `name`, `category`, `sku`.		event_item_id to attrib_event_item.id
attrib_visitor_details	This is a high level piece of data that allows us to join cubed data with external data via `client_customer_id`	client_customer_id, client_customer_type	visitor_id to attrib_visitor.id
attrib_page	Store the raw URL collected, and it's separate components (host, port etc...)		visit_id to attrib_visit.id
attrib_label	Labels are a way to colour in a visit with bespoke meta data. We allow mutliple "types" - as dictated by the columns		visit_id to attrib_visit.id, page_id to attrib_page.id, event_item_id to attrib_event_item.id, event_item_details_id to attrib_event_item.id

Validation¶

If there is an issue with the data collection (whether its the Config, the tag, or on site) we will return a message to the browser and insert a validation message. This uses 4 tables to hold the data including message, and that message's page and details (exact same as visit page and visit detail - see above.)

table name	description	foreign key(s)
attrib_validation	Base table to store a validation Id against a visitor and visit	visit_id to attrib_visit.id, visitor_id to attrib_visitor.id
attrib_validation_message	This holds the main details including `name`, `message`, and `value`	validation_id to attrib_validation.id
attrib_validation_page	Store a stripped down verison of `attrib_page` for the validation item	validation_id to attrib_validation.id
attrib_validation_details	Store a stripped down version of `attrib_visit_details` for the validation item	validation_id to attrib_validation.id

The idea here is you can check for validation items for any client and check for trends. Are more validation messages coming from a certain page? Or perhaps a specific browser/device. This is where you can find that information out.

Syncs¶

We are now tracking a Visitor on a website on their device. The next question is: what if that same Visitor accesses the website via another device? Some companies attempt to do "finger printing" by comparing device details+ip+user behaviour, but we have opted for a simple and gauranteed method.

The Cubed tag can fire a sync_token which we store per Visitor. A sync_token is an identifiable piece of information as specified by the client. Usually this is something like a username, email address, account id or customer id.

Once we have a sync we check every request against our sync table, and if we find a matching token we now know that Visitor is actually another Visitor and we nove those Visit(s) over.

All sync_tokens are hashed at the point of collection and thus we NEVER insert the actual sync_token. Though this can make it difficult to debug sync issues, this is a security and privacy protocol that we take seriously.

table name	description	unique key(s)	foreign key(s)
attrib_sync	Store the hashed sync token against a visit	token	visit_id to attrib_visit.id
attrib_sync_point	The point where the sync was created		sync_id to attrib_sync.id
attrib_sync_map	Quick lookup table of current visitor to previous visitor id	old_visitor_id	old_visitor_id to attrib_visitor.id, new_visitor_id to attrib_visitor.id
attrib_sync_visit	Link a sync_point.id with an attrib_visit.id		point_id to attrib_sync_point.id, visit_id to attrib_visit.id

Raw tables post crons¶

Below are tables that are very close to the raw data collection tables, but are currently updated from the crons.

table name	description	unique key(s)	foreign key(s)	extra details
attrib_path	Keep a row per URL, using a hashing of that string as the "token".	token		This table is currently inserted into during the crons, but can/will be moved to the DCS
attrib_page_item	Store a reference to each path seen by each visit and their original attrib_page	page_id	visit_id to attrib_visit.id, path_id to attrib_path.id, page_id to attrib_page.id	This table is currently inserted into during the crons, but can/will be moved to the DCS