Skip to content

Update Keyword Relevance

When we talk about keyword relevance we are referring to the relevance of a keyword to a given webpage. This is a key metric for SEO as it is a factor in determining the quality of a webpage. The higher the keyword relevance, the more likely the webpage is to rank for that keyword. Therefore, it is paramount that we map a keyword to the most relevant webpage. This is where the keyword relevance tool comes in. The tool utilises term frequency-inverse document frequency (TF-IDF) to determine the relevance of a keyword to a webpage. The tool is run on the rsrv.

TF-IDF

TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. Wikipedia

Code Structure

The keyword relevance tool is made up of three classes: Corpus, Keywords, and TfidfKeywordScorer. The Corpus class is responsible for creating a corpus of webpages and their content. The Keywords class is responsible for creating a set of keywords that we want to categorise their relevance. The TfidfKeywordScorer class is responsible for scoring keywords based on their relevance to a webpage. An illustration of the pipeline can be found below.

Corpus

This class is responsible for creating a generator of webpages and their content. The corpus is created by extracting the content from the attrib_page_content table. The class itself is an exhaustive generator, meaning that it will yield every webpage in the database. The class also has an attribute called path_ids which returns a list of path_ids from attrib_path whose content exists in attrib_page_content.

Keywords

This class is responsible for creating a set of keywords that we want to categorise their relevance. The keywords can be extracted from the database in two ways using the existing_keywords parameter. If existing_keywords is set to True then the keywords will be extracted from the seogd_market_keyword table (where active=True). If existing_keywords is set to False then the keywords will be extracted from the kwr_semrush_volume table. The class is also responsible for inserting the categorised keywords into both kwr_keyword_relevance and kwr_keyword_relevance_path post processing.

TfidfKeywordScorer

This class is responsible for scoring keywords based on their relevance to a webpage. The class is initialised with a Corpus object and a list of keywords. The class will then iterate through each keyword and webpage and calculate the TF-IDF score. The class will then return both scored_queries and ranking_paths which are nested tuples of the form (keyword_market_id, total_relevance_score) and (keyword_market_id, path_id, path_relevance_score) respectively.

The following diagram illustrates the overall pipeline of the keyword relevance tool.

Keyword Relevance Pipeline