Update Fabric Automated Keyword Research¶

We currently run a web scraping service via OxyLabs for all of our clients, this runs once a fortnight on the 1^st and 14^th of every month. The scraped content is inserted into one of our attrib tables: attrib_page_content. Great, now we have page content for all paths on a given domain, what now? Using KeyBERT - a keyword extraction tool - that leverages BERT embeddings we can analyse each page from attrib_page_content to extract keywords and keyphrases that are most similar to the page. That last point cannot be overstated, this keyword extraction tool cannot semantically infer new words based on what it has analysed, it can only extract the words that already exist within the content. This is probably the biggest limitation of model.

From this you should now be able to see why the term automation is being used here, an SEO specialist no longer has to go through each page for a domain and pull out relevant and meaningful keywords, our tool can do all that for you!

So what's going on under the hood?¶

KeyBERT¶

Lets first start with KeyBERT, the beauty with using this package is its simplicity, we can instantiate the model and call the extract_keywords function in one line to get all the functionality we require. Below is all the mandatory and optional parameters to get started with KeyBERT.

KeyBERT(model='all-MiniLM-L6-v2').extract_keywords(self, docs, candidates=None, keyphrase_ngram_range=(1, 1), stop_words='english', top_n=5, min_df=1, use_maxsum=False, use_mmr=False, diversity=0.5, nr_candidates=20, vectorizer=None, highlight=False, seed_keywords=None)

In our case, we need only a subset of these parameters to perform the analysis we require. For clarity, the parameters we utilise and their definitions will be tabulated below.

Name	Type	Description	Default
docs	Union[str, List[str]]	The document(s) for which to extract keywords/keyphrases	required
keyphrase_ngram_range	Tuple[int, int]	Length, in words, of the extracted keywords/keyphrases	(1, 1)
stop_words	Union[str, List[str]]	Stopwords to remove from the document	'english'
top_n	int	Return the top n keywords/keyphrases	5
seed_keywords	List[str]	Seed keywords that may guide the extraction of keywords by steering the similarities towards the seeded keywords.	None

Unsurprisingly, we do not use the default values highlighted above. As of writing this document keyphrase_ngram_range=(1,3), that is to say we pull keywords and keyphrases up to and including three words. Without external input from the client we use typical english stop words - which is the default value, stop_words='english' - there is scope however for the client to add blacklisted keywords which we can append to this stop word list. Finally, we use as a baseline top_n=15, but again, the client can adjust this number as they see fit. It should be noted that the higher top_n is, the lesser the quality of the output.

Lastly, the model is instantiated by default with model='all-MiniLM-L6-v2', this embedding model is targeted towards english documents and as such does not perform optimally on multilingual pages. Therefore, we have scope within our script to utilise a multilingual model: 'paraphrase-multilingual-MiniLM-L12-v2', using this second embedding requires us to install the sentence_transformer package before parsing into KeyBERTs extract_keywords() function.

Code Structure¶

The code for this particular piece is split into two files update_automated_keyword_research.py and update_automated_keyword_research_topic.py. Due to the fact that we need to handle different use cases for automated keyword research we have a base class, BaseAutomatedKWRHandler, of which all other handler classes inherit from. These classes include AutomatedKWRHandlerAll, AutomatedKWRHandlerSelected, and AutomatedKWRHandlerTopic. The first two classes are responsible for extracting keywords from a clients website, the latter is responsible for extracting keywords from a clients website under a specific topic. The following diagram illustrates the code pipeline.

The only difference between the classes is the pages that they pull from the database, perhaps unsurprisingly AutomatedKWRHandlerAll pulls from all pages, AutomatedKWRHandlerSelected pulls from a subset of pages, and AutomatedKWRHandlerTopic pulls from a subset of pages under a specific topic. Each class utilises the run_automated_keyword_research() method inherited from BaseAutomatedKWRHandler to instantiate the worker class AutomatedKWR with their particular configuration. This method is called iteratively, over each path and its corresponding market.

Note

Each path is mapped to a particular market via update_kwr_mapped_market, this script pulls configurations from kwr_market_lookup to determine which market a given path belongs to. The lookup works by us or the client manually adding regular expression patterns to the kwr_market_lookup table, these patterns are then matched against all paths to determine which market a particular path belongs to.

As mentioned in the AKWR overview, AutomatedKWR inserts the newly generated keywords into the following tables: seogd_keyword, seogd_market_keyword, kwr_keyword, and seogd_market_keyword_source.