Skip to content

Model Training (rsrv)

Bi weekly, the backend command fabric_train_attribution_model calls the script srv/rsrv/attribution/train.R which is located on the rsrv. The main objective of this script is to pull data from MYSQL database, train a propensity model using Keras (predicting liklihood of defined goals), and push the resulting optimal model to the S3 bucket on AWS. The script requires seven paramters, which are passed through when calling the command with fabric.

Script Arguments

The script expects the following arguments:

Argument Description
aws_access_key AWS access key.
aws_secret_key AWS secret key.
s3_url S3 bucket URL.
prodid Product ID.
full_con Full database connection string.
start_date Start date for data retrieval.
end_date End date for data retrieval.

Database Connection

The script parses the full_con string to extract database connection details and establishes a connection using RMySQL.

Product ID Handling

  • If prodid is not 0, it uses the provided product ID.
  • Otherwise, it retrieves active product IDs from the database.

Once our product_id has been established, we find our optimal date range for pulling sales data based on the sales volume. For larger clients, fewer days of data are needed to train a model, while smaller clients may require more days of data, following, we pull our data from MYSQL.

Function Explanation: find_dates

Purpose

The find_dates function determines the optimal date range for pulling sales data based on the sales volume. For larger clients, fewer days of data are needed to train a model, while smaller clients may require more days of data.

Parameters

  • vsdb: Database connection object.
  • product_id: ID of the product for which sales data is being retrieved.
  • start_date: Initial start date for data retrieval.
  • end_date: End date for data retrieval.

Process

Step Description
Initialize Variables - Convert end_date and start_date to POSIXct format.
- Set start_date to one week before end_date.
- Initialize an empty data frame sale_volume_date_df.
Data Retrieval Loop - Loop backwards one week at a time from end_date until reaching the minimum start date (min_start_date).
- For each week, query the database to get the cumulative sales volume.
- Append the results to sale_volume_date_df.
Determine Optimal Date Range - Sort sale_volume_date_df by the number of days.
- Iterate through the data frame to find the optimal date range based on sales volume thresholds:
    - Break if more than 30 days and sales volume > 10,000.
    - Break if more than 14 days and sales volume > 50,000.
Return Results - Return the start date, end date, and sales volume for the optimal date range.

Example

result <- find_dates(vsdb, product_id, "2023-01-01 00:00:00", "2023-03-31 23:59:59")
start_date <- result[[1]]
end_date <- result[[2]]
sales_volume <- result[[3]]

Function Explanation: load_data

Purpose

Once we have worked out our start_date and sales_volume using above logic, we pass these paramaters into the load_data function.

The load_data function requires five paramaters.

  1. Start Date (calcaulted by above).
  2. End Date (passed in as inital paramter).
  3. Sales Volume (calculated by above).
  4. Product ID (passed in as inital paramater).
  5. VSDB Connection (mysql connection).

The aim of the load_data function is to gather all required data from multiple tables in our database to train a model. Also, we create a temp table tmp_train_visits which is made up of a random sample of sale visit ID's (visit_id) , with a sample size equal to our sales volume limit (paramater sales_vol ), and a random sample of non-sale visit ID's (visit_id) , with a sample size equal to our sales volume limit * 8, between the range of our date paramaters that we passed in and for said product. The reason we have chosen to select x8 the amount of non-sales compared to sales data, is due to performance of the model. Usually, we often find sales proportion for a client to lay at 1% ish, this causes the model be overly cautious in giving a high propensity for visits and sales. We use this temporary table as a join on all other agg tabes selected, so assure we only select our sample of visits.

Model Training (Keras)

Once data has been fetched and merged together from the database, we start modelling. Our models are built using Keras, on a Tensorflow backend. We test different variations of hyperparamters (nodes in each layer, dropout rates, learning rates, optimisers etc.) in order to find the optimal set of hyperparamters in which our model trains upon (see below).

We loop through each combination of hyperparamater, passing combinations into the model_tune function.

Function Documentation: model_tune

Purpose

The model_tune function trains a neural network model on sales data, normalizes the data, splits it into training and testing sets, and evaluates the model's performance.

Parameters

  • cubed_data: Data frame containing the sales data.
  • node1: Number of nodes in the first hidden layer.
  • node2: Number of nodes in the second hidden layer.
  • dropout1: Dropout rate for the first hidden layer.
  • dropout2: Dropout rate for the second hidden layer.
  • optimizer: Optimizer to use for training the model.
  • lr_annealing: Learning rate annealing factor.
  • batch_size: Batch size for training.
  • epochs: Number of epochs for training.

Process

Step Description
Data Preparation - Remove visit_id, sales_date, and product_id columns.
- Normalise the data by removing outliers and scaling features. We normalise our data using mean and standard deviation, however we initially remove outliers using IQR (only to find mean/sd, we don't remove outliers from train data), as using mean without removing outliers can get extremely skewed.
Train/Test Split - Split the data into training (80%) and testing (20%) sets.
Recipe Creation - We create a recipe, which is from the recipe package, whereby if there were any categorical columns, it would transform them into one-hot encoded columns.
Model Building - Build a sequential neural network model with specified layers and dropout rates.
- Compile the model with the specified optimizer and loss function.
Model Training - Train the model using the training data.
- Use early stopping and learning rate reduction callbacks.
Model Evaluation - Calculate loss and accuracy metrics for training and validation sets.
- Predict probabilities and classes for the test set.
- Calculate precision and recall.
Return Results - Once the Keras model has been trained using the train data, we test the performance of the model on the remaining 20% of unseen data (test data set). We store the resulting metrics of each model such as accuracy, loss and precision, alongside the combination of hyperparameters in a table named result . We'll use the results of each model to choose our optimal combination of hyperparameters.
#### Example
result <- model_tune(cubed_data, node1 = 64, node2 = 32, dropout1 = 0.5, dropout2 = 0.5, optimizer = "adam", lr_annealing = 0.1, batch_size = 32, epochs = 50)

Model Save (S3 Bucket Push)

Finally, as we now have the optimal combination of hyperparamters (the optimal structure we should build a model upon with our data) and hence our optimal model, we can save (push) our model to the S3 bucket for use (real-time propensity predictions via the DPS).

The model saved on the S3 is then used for real-time predictions (visscore).

Database Insertions

  • Inserts or updates model metadata in attrib_propensity_model.
  • Inserts model run details in attrib_propensity_model_run.
  • Inserts hyperparameter and performance metrics in agg_propensity_model_train_hyperparameter.

Deployment

For deployment of this tool/model, please see deployment.