Lifecycle of a request¶

Introduction¶

This page tracks the lifecycle of a request from a tagged page to the database. The flowchart below shows the journey of a visitor's data.

DCS Flowchart

Tag¶

The visscore tag has it's own docs, which can be found here. Once fired from a tagged site, the tag is caught by a load balancer and then redirected to one of the DCS servers, which are automatically scaled to deal with load.

DCS¶

The DCS constantly keeps itself in an updated state - so any new account config added by clients is taken into account when processing requests. See here for more information on this process, but in brief every client has an associated Account object that stores all of the config that is required to process their associated hits from the tag. Each DCS has a queue of visits to be inserted into the database. A worker process (see below) constantly serves the queue to deal with inserts.

Main.py¶

handle()¶

The data from the tag is passed directly in it's raw state to the handle() function in main.py.

Before being validated by our VisitSchema class the VisitSchema is prepared by binding additional data to it - in this case the request itself, the Flask server which is processing the data, and the account for which the data is being processed. This is done at this stage because it could not be known at the time the VisitSchema is created - the VisitSchema could be held on one of a large number of servers, for example.

Thereafter, the tag data is validated by passing it through the VisitSchema. If validation by the VisitSchema class fails, the request falls back to the ValidationSchema, a truncated version of the VisitSchema that only validates essential data. This is so that we throw away as few hits as possible - we always want to keep a hit from our tag where we can. In addition, where validation fails we want to know why - so we have a selection of tables (attrib_validation and attrib_validation_*) that store information about why the validation failed, including the page that caused the failure, information about the browser and OS used, and a human-readable validation message generated by Colander.

If the request data fails validation both from VisitSchema and ValidationSchema it is thrown away and an exception is logged. A response is also sent to the tag to make clear that the data sent was invalid or malformed. This is vital for testing purposes.

If validation is successful, a Visit object is created. Before this visit is queued for insertion into the database however, we first deal with whether we are able to set a cookie on the browser that sent the request.

Cookies - first party vs third party¶

Generally, cookies are necessary for a number of reasons - for example tracking items in a basket or whether a user is logged in. For our purposes, we are primarily interested in tracking this visitor if they log in across multiple tabs, browsers or locations. Hence, we try to set a cookie on the visitor's browser in most circumstances.

There are two situations in which a cookie would never be set: if the browser has sent Do Not Track as a request header (ie where the visitor has specifically set their preferences so they are not tracked), and if the request has been generated by our @simulate service. Otherwise, each time we successfully create a visit we attempt to set a cookie on the browser that sent the request so we can use the information again in future.

The DCS has two cookie objects, VisitCookie and VisitThirdPartyCookie.

The two types of cookie that are relevant to us here are first-party and third-party. First party cookies are cookies that are set with the domain attribute being the same as the client site - so for example if a client's site is at www.client.com, we might set a cookie using the value cubed.client.com. As this cookie is being set using the same domain as the client's site, it will be automatically allowed by most browsers. When onboarding a new client, we ask if we can set cookies using this configuration. If permitted, we set two cookies on the visitor's browser, a visitor cookie with a long-term expiry of two years, and a session cookie with an expiry of 30 minutes. In these cookies we respectively store visitor_id (vid) and session_id (sid) to allow us to more easily identify repeat visitors.

If however we're not able to set a cookie via the client's domain, we can still attempt to set a cookie on the visitor's browser, but we cannot set the domain attribute using the client's own domain. The domain is therefore set to data.withcubed.com, which makes it a third-party cookie and as a result the cookie will be blocked by default on a variety of newer browsers - particularly those that use Webkit. The cookie we set is the same as the visitor cookie above and explicitly sets a visitor_id as value for this visitor.

Another common method of setting cookies that is worth mentioning is via banner adverts for our client on another company's site. In this scenario, www.notourclient.com would be the site domain and www.client.com the domain associated with the banner. As a result, if we try to set a cookie on the visitor's browser this will be a third-party cookie even if the client allows us to set the domain as cubed.client.com because the cookie will still be set by a domain that doesn't match the url of the site the visitor is actually using (www.notourclient.com vs cubed.client.com). This will make it difficult to track things like impressions for advertising campaigns in future as third-party cookies are disallowed by default.

Once the appropriate type of cookie is determined and generated, these are inserted into a json response before being sent back to the browser to be set using a built in set_on_reponse method from the Python requests library.

Queuing for insertion¶

After all of this is done, we know that the visit is valid and contains enough information to be inserted into our database, and we should have some manner of tracking of the visitor and their browser. Finally, the visit is added to the worker queue for processing and insertion into the database. If for some reason the database is unavailable, more and more visits will be added to the queue until it reaches maximum capacity. If this happens, an exception is raised and the visit is lost.

Worker.py¶

loop()¶

Much like the AccountList class, the Worker is extended from Runner, giving it start() and stop() functions and a loop() function that runs in perpetuity until the pydcs service is halted. The loop() function in the worker is quite simple at it's top level; it gets the next visit in the queue and attempts to insert it's data into the various tables. Where this causes an exception, a flag is set showing that the visit had to be requeued, and it is placed back into the queue for future processing. This cycle continues until the request is more than 7 days old, when it will finally be thrown away. As above, this is to try and ensure we keep as much of our collected data as possible.

insert_visit()¶

This is a simple process that nevertheless processes a lot of data. insert_visit() splits the visit into component parts and inserts them into the relevant attrib tables in the database and updates stats on some common metrics (number of insertions, how many failed, how many were requeued etc) for each client account. Before this happens however, it is necessary to confirm that the originator of this visit is not blacklisted in some way, either by their visitor_id or IP. The worker combines information from the Account model and the visit itself here to make these checks, and throws the visit away if it is blacklisted.

Pages, events, labels, redirects, endpoints, visitor details and the visit itself are all inserted by this function. Impressions are also inserted, but slightly differently: if a visit is tagged as an impression, it is inserted as an impression rather than a visit and the rest of the above information is ignored, if it has been passed.

All of the attempted inserts are protected by try/except blocks which will roll back attempted insertions if necessary. If a visitor with a matching visitor_id already exists in the database, a new visitor is not created and all data from this visit is instead associated with the existing visitor.

After all of this raw data is inserted, the worker also seeks to perform some more resource-intensive processing. Non-exhaustively, these include: syncing data from this visitor with data from another "visitor" who is deemed to be the same person; updating their sales journey to ensure that the path the visitor took through the tagged site is correct and passing the visitor ID to SQS for prediction modelling.

All of the above functions take place as part of the same worker "job", so we know that each visit is only (successfully) processed once and then deleted from memory. Even when passing data to SQS the worker waits for a successful acknowledgement from SQS to confirm that the processing is underway before moving on. This helps to keep the buffer as free as possible from visits and ensure that each DCS instance is working at capacity.