Update Keyword Relevance¶
When we talk about keyword relevance we are referring to the relevance of a keyword to a given webpage. This is a key metric for SEO as it is a factor in determining the quality of a webpage. The higher the keyword relevance, the more likely the webpage is to rank for that keyword. Therefore, it is paramount that we map a keyword to the most relevant webpage. This is where the keyword relevance tool comes in. The tool utilises term frequency-inverse document frequency (TF-IDF) to determine the relevance of a keyword to a webpage. The tool is run on the rsrv.
TF-IDF
TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. Wikipedia
Code Structure¶
The keyword relevance tool is made up of three classes: Corpus
, Keywords
, and TfidfKeywordScorer
. The Corpus
class is responsible for creating a corpus of webpages and their content. The Keywords
class is responsible for creating a set of keywords that we want to categorise their relevance. The TfidfKeywordScorer
class is responsible for scoring keywords based on their relevance to a webpage. An illustration of the pipeline can be found below.
Corpus¶
This class is responsible for creating a generator of webpages and their content. The corpus is created by extracting the content from the attrib_page_content
table. The class itself is an exhaustive generator, meaning that it will yield every webpage in the database. The class also has an attribute called path_ids
which returns a list of path_ids
from attrib_path
whose content exists in attrib_page_content
.
Keywords¶
This class is responsible for creating a set of keywords that we want to categorise their relevance. The keywords can be extracted from the database in two ways using the existing_keywords
parameter. If existing_keywords
is set to True
then the keywords will be extracted from the seogd_market_keyword
table (where active=True
). If existing_keywords
is set to False
then the keywords will be extracted from the kwr_semrush_volume
table. The class is also responsible for inserting the categorised keywords into both kwr_keyword_relevance
and kwr_keyword_relevance_path
post processing.
TfidfKeywordScorer¶
This class is responsible for scoring keywords based on their relevance to a webpage. The class is initialised with a Corpus
object and a list of keywords. The class will then iterate through each keyword and webpage and calculate the TF-IDF score. The class will then return both scored_queries
and ranking_paths
which are nested tuples of the form (keyword_market_id, total_relevance_score)
and (keyword_market_id, path_id, path_relevance_score)
respectively.
The following diagram illustrates the overall pipeline of the keyword relevance tool.