During my Christmas vacation last year, I converted this site from WordPress to Hugo; while I’ve been happy with the change, a couple of features are missing. One of these is that there was a section with related content at the bottom of each post. I wanted to get it back.
Thankfully Hugo has native support for Related Content, so while I was hoping this would be a simple task, there’s a note that made things substantially more complicated:
We currently do not index Page content.
You see, Hugo uses the front matter of each post to determine what posts are related, this includes keywords, tags, and the published date — it does not use the content of the post itself. So if you’ve carefully added keywords to all of your posts, then there’s no issue, and you can stop reading now.
For me, this is a challenge. Over the years, this site has been manually curated HTML, WordPress (multiple times), Octopress, Jeykll, and now Hugo. While I maintained keywords in the past, they’ve not made it through all of the conversions — and I’ve stopped adding them to new posts. The prospect of manually adding keywords to all of my posts going back to 2003 wasn’t exactly exciting.
After a bit of research and being disappointed that I couldn’t find anyone that had already solved this, I set out to find a viable solution to the problem. Thankfully, academia has largely solved this problem already with keyword extraction, a branch of Natural Language Processing.
There are a variety of methods and techniques available, as well as libraries in a wide variety of languages. Given that Python is popular in the academic data science community, I focused on that. I started looking into TF-IDF (Term Frequency — Inverse Document Frequency) which initially appeared to be a viable choice. However, after finding a useful comparison of the results, I looked at RAKE (popular library, but abandoned and due to a bug, does better at short demos than real tasks), BERT (another popular library, I moved on after digging into test cases), and finally settled on using TextRank (PDF) as implemented in Gensim1.
Using TextRank, I could easily read all of my posts, generate keywords, then, using python-frontmatter, write the updated front matter back to the file. Then use Hugo’s native Related Content feature.
This feature is relatively simple and adds little time to the building process — so there’s little impact on workflow, and the live reload feature works just as well. While the feature isn’t particularly sophisticated, it works well enough to get the job done.
Using it can be as simple as adding a related.html file to partials with some simple code:
{{ $related := .Site.RegularPages.Related . | first 5 }}
{{ with $related }}
<h3>See Also</h3>
<ul>
{{ range . }}
<li><a href="{{ .RelPermalink }}">{{ .Title }}</a></li>
{{ end }}
</ul>
{{ end }}
Then include it with {{ partial "related" . }} in your partials/article.html file. Simple as can be.
I wrote a relatively simple Python script to generate keywords for all posts, and this can be run before building the site (or committing changes, if you have an automated workflow). This script allows me to have exactly what I wanted: content-based related content, without manually managing keywords.
Important: This will replace all keywords; if you already have keywords on your posts, this will replace them.
This script is set up for my needs and my workflow; if you wish to use it, you may need to make some changes to ensure that it works correctly for you.
#!/usr/bin/env python3
###############################################################################
# build_keywords.py
# Copyright (c) 2021 Adam Caudill <[email protected]>
# Released under the MIT license.
#
# This script will scan for *.md files and generate a list of likely keywords.
# It is intended to be used for automatic keyword generation for Hugo blogs,
# and makes certain assumptions based on that use case. It may need to be
# for other use cases.
#
# To install requirements:
# pip3 install --user numpy scipy gensim==3.8.3 markdown beautifulsoup4 /
# requests python-frontmatter
#
###############################################################################
import os, glob, re, requests, frontmatter
from gensim.summarization import keywords
from markdown import markdown
from bs4 import BeautifulSoup
def get_keywords(file):
with open(file, encoding='utf8') as f:
content = f.read()
# string the frontmatter – it doesn't help us
content = html = re.sub(r'---\n(.*?)---\n', '', content,
flags=re.MULTILINE | re.DOTALL)
# cleanup the markdown, otherwise https will be a popular keyword
content = markdown_to_text(content)
kw = keywords(content, words=20, lemmatize=True).split('\n')
# remove some additional stop words that add noise to the results
kw = strip_stop_words(kw)
# very short articles produce junk matches, so don't give them keywords
# using split() this way isn't quite accurate, but it's close enough
if len(content.split()) < 250:
kw = None
update_yaml(file, kw)
def markdown_to_text(content):
# this is a fairly bad hack, but works
html = markdown(content)
# get rid of code blocks, as it's noise
html = re.sub(r'<code>(.*?)</code>', '', html,
flags=re.MULTILINE | re.DOTALL)
soup = BeautifulSoup(html, "html.parser")
text = ''.join(soup.findAll(text=True))
return text
def update_yaml(file, keywords):
# sort the list of keywords, this means we've lost the order of what's
# most important but this makes it more deterministic, so that it results
# in less noise when ran in the future.
if keywords: keywords.sort()
with open(file, 'r', encoding='utf8') as f:
post = frontmatter.load(f)
# set new keywords, overwriting any that are already present
post['keywords'] = keywords
update = frontmatter.dumps(post)
f.close()
with open(file, 'w', encoding='utf8') as f:
f.write(update)
def strip_stop_words(words):
# first, split the keywords, as this yields better results
words = [word for keyword in words for word in keyword.split()]
words = list(dict.fromkeys(words))
return [x for x in words if x not in stopwords]
if __name__=="__main__":
# source: https://gist.github.com/sebleier/554280
stopwords_list = requests.get("https://web.archive.org/web/20210906074043/https://gist.githubusercontent.com/rg089" +
"/35e00abf8941d72d419224cfd5b5925d/raw/12d899b70156fd0041fa9778d657" +
"330b024b959c/stopwords.txt").content
stopwords = set(stopwords_list.decode().splitlines())
# add any custom keywords to be stripped here
stopwords.add("caudill")
stopwords.add("adam")
stopwords.add("adamcaudill")
for file in glob.glob(os.path.join(os.getcwd(), "content/posts/*.md")):
get_keywords(file)
Hopefully, others will find this helpful, and help to address a limitation in Hugo.
