Indepth¶
Cubed is a data collection platform built for website traffic. Below is the table design and how the data is structured.
We treat each user on site as a Visitor who has many Visit(s). This is important to differentiate, and you could think of a Visit as a session. We generate a unique 32 character token for each new Visitor and for each of their subsequent Visit(s). The token is stored in a cookie on the user's browser, while we store it in the DB.
We drop 2 cookies to facilitate this vscr_vid
(visitor id) and vscr_sid
(session id). The vscr_vid
cookie attempts to live forever (2 years max), while the vscr_sid
will expire after 30mins of inactivity, or when the browser is closed.
When someone arrives on site our Cubed tag can fire many pieces of data including:
where the visitor came from via document.referer
the page the tag fired on window.location.href
cubed events - as defined in the account config section dash.withcubed.com
While processing this request we can read the headers and gain some extra information including: Ip address - for geo-location purposes OS details - name, version, etc.. Browser details - name, version, etc...
Once we've got this data validated and ready to be inserted we first check if the document.referer
string matches any patterns
that the user has created in the account config section (or the default Cubed patterns). All patterns
are a simple regex string and must belong to 1 "Channel" (which in our system can be known as a "referer" due to legacy code/table names).
For example we have a default pattern of "google.com", that belongs to the channel SEO. If a Visitor arrives on site and their document.referer
is "https://www.google.com" - meaning they clicked a link from google's search page, then we would pattern match that and thus assign that pattern's Id to that visit.
With this in mind you can now see how we can build a picture of a Visitor, arriving on site from different places across many days/weeks/months. It is not impossible for a Visitor to have a handful of Visits that first arrived via SEO, then Email, then Direct.
Finally to calculate "attribution" we need an "end point" in a Visitor's journey. What we call in the frontend a "Goal". Similar to "channel" and "referer" being inter-changeable, "Goal" and "Product" can be due to legacy code/table names. A Goal is basically a collection of "events" where one is flagged as a "sale" event. When processing all our data we use this event as a marker to say "the visitor's journey has finished, lets calculate how important all the visits were leading up to it".
Everything described above is stored in our "raw" tables all prefixed as attrib_
. We started as an "attribution platform" and so we our naming convention follows suit.
The "raw" tables also includes the configuration tables too. As mentioned above we can fire "cubed events" if they are defined in the account config section.
Config Tables¶
To configure a Cubed account we need Channels for Patterns to belong to, we need Events to fire, and we need to have a Goal with a designated "sale event".
table name | description | unique key(s) | foreign key(s) | extra details |
---|---|---|---|---|
attrib_referer | The names of all the account's Referers/Channels/Buckets. | |||
attrib_pattern | Regex patterns and the channels that they belong to | referer_id to attrib_referer.id | ||
attrib_event | All created events for the account | |||
attrib_product | Goals to hold events against | |||
attrib_product_events | A link of event(s) to Goals with 1 (or more) flagged as a Sale | event_id to attrib_event.id, product_id to attrib_product.id |
Raw data collection tables¶
Now lets look at the raw tables we insert into during data collection, including visitor with visits, the pages they saw, and the events they fired - along with any extra pieces of information.
table name | description | unique key(s) | foreign key(s) | extra details |
---|---|---|---|---|
attrib_visitor | The top most data point for all visitor data | token | ||
attrib_visit | The main data point we process and aggregate to, must belong to a visitor | token | visitor_id to attrib_visitor.id, pattern_id to attrib_pattern.id | |
attrib_visit_details | Holds all extra info we get from the request headers including country, city, browser, and OS etc.. | visit_id to attrib_visit.id | ||
attrib_event_item | Store event(s) to a visit | visit_id to attrib_visit.id, event_id to attrib_event.id | This can also store revenue and transaction_id for any event fired |
|
attrib_event_item_details | This will hold extra pieces of information for the fired event including name , category , sku . |
event_item_id to attrib_event_item.id | ||
attrib_visitor_details | This is a high level piece of data that allows us to join cubed data with external data via client_customer_id |
client_customer_id, client_customer_type | visitor_id to attrib_visitor.id | |
attrib_page | Store the raw URL collected, and it's separate components (host, port etc...) | visit_id to attrib_visit.id | ||
attrib_label | Labels are a way to colour in a visit with bespoke meta data. We allow mutliple "types" - as dictated by the columns | visit_id to attrib_visit.id, page_id to attrib_page.id, event_item_id to attrib_event_item.id, event_item_details_id to attrib_event_item.id |
Validation¶
If there is an issue with the data collection (whether its the Config, the tag, or on site) we will return a message to the browser and insert a validation message. This uses 4 tables to hold the data including message
, and that message's page
and details
(exact same as visit page
and visit detail
- see above.)
table name | description | unique key(s) | foreign key(s) | extra details |
---|---|---|---|---|
attrib_validation | Base table to store a validation Id against a visitor and visit | visit_id to attrib_visit.id, visitor_id to attrib_visitor.id | ||
attrib_validation_message | This holds the main details including name , message , and value |
validation_id to attrib_validation.id | ||
attrib_validation_page | Store a stripped down verison of attrib_page for the validation item |
validation_id to attrib_validation.id | ||
attrib_validation_details | Store a stripped down version of attrib_visit_details for the validation item |
validation_id to attrib_validation.id |
The idea here is you can check for validation items for any client and check for trends. Are more validation messages coming from a certain page? Or perhaps a specific browser/device. This is where you can find that information out.
Syncs¶
We are now tracking a Visitor on a website on their device. The next question is: what if that same Visitor accesses the website via another device? Some companies attempt to do "finger printing" by comparing device details+ip+user behaviour, but we have opted for a simple and gauranteed method.
The Cubed tag can fire a sync_token
which we store per Visitor. A sync_token
is an identifiable piece of information as specified by the client.
Usually this is something like a username, email address, account id or customer id.
Once we have a sync we check every request against our sync table, and if we find a matching token we now know that Visitor is actually another Visitor and we nove those Visit(s) over.
All sync_tokens
are hashed at the point of collection and thus we NEVER insert the actual sync_token
. Though this can make it difficult to debug sync issues, this is a security and privacy protocol that we take seriously.
table name | description | unique key(s) | foreign key(s) | extra details |
---|---|---|---|---|
attrib_sync | Store the hashed sync token against a visit | token | visit_id to attrib_visit.id | |
attrib_sync_point | The point where the sync was created | sync_id to attrib_sync.id | ||
attrib_sync_map | Quick lookup table of current visitor to previous visitor id | old_visitor_id | old_visitor_id to attrib_visitor.id, new_visitor_id to attrib_visitor.id | |
attrib_sync_visit | Link a sync_point.id with an attrib_visit.id | point_id to attrib_sync_point.id, visit_id to attrib_visit.id |
Raw tables post crons¶
Below are tables that are very close to the raw data collection tables, but are currently updated from the crons.
table name | description | unique key(s) | foreign key(s) | extra details |
---|---|---|---|---|
attrib_path | Keep a row per URL, using a hashing of that string as the "token". | token | This table is currently inserted into during the crons, but can/will be moved to the DCS | |
attrib_page_item | Store a reference to each path seen by each visit and their original attrib_page | page_id | visit_id to attrib_visit.id, path_id to attrib_path.id, page_id to attrib_page.id | This table is currently inserted into during the crons, but can/will be moved to the DCS |