Hugo & Content-Based Related Content

Image: Photo by John Barkiple on Unsplash

During my Christmas vacation last year, I converted this site from WordPress to Hugo; while I’ve been happy with the change, a couple of features are missing. One of these is that there was a section with related content at the bottom of each post. I wanted to get it back.

Thankfully Hugo has native support for Related Content, so while I was hoping this would be a simple task, there’s a note that made things substantially more complicated:

We currently do not index Page content.

You see, Hugo uses the front matter of each post to determine what posts are related, this includes keywords, tags, and the published date — it does not use the content of the post itself. So if you’ve carefully added keywords to all of your posts, then there’s no issue, and you can stop reading now.

For me, this is a challenge. Over the years, this site has been manually curated HTML, WordPress (multiple times), Octopress, Jeykll, and now Hugo. While I maintained keywords in the past, they’ve not made it through all of the conversions — and I’ve stopped adding them to new posts. The prospect of manually adding keywords to all of my posts going back to 2003 wasn’t exactly exciting.

Finding a Solution #

After a bit of research and being disappointed that I couldn’t find anyone that had already solved this, I set out to find a viable solution to the problem. Thankfully, academia has largely solved this problem already with keyword extraction, a branch of Natural Language Processing.

There are a variety of methods and techniques available, as well as libraries in a wide variety of languages. Given that Python is popular in the academic data science community, I focused on that. I started looking into TF-IDF (Term Frequency — Inverse Document Frequency) which initially appeared to be a viable choice. However, after finding a useful comparison of the results, I looked at RAKE (popular library, but abandoned and due to a bug, does better at short demos than real tasks), BERT (another popular library, I moved on after digging into test cases), and finally settled on using TextRank (PDF) as implemented in Gensim¹.

Using TextRank, I could easily read all of my posts, generate keywords, then, using python-frontmatter, write the updated front matter back to the file. Then use Hugo’s native Related Content feature.

This feature is relatively simple and adds little time to the building process — so there’s little impact on workflow, and the live reload feature works just as well. While the feature isn’t particularly sophisticated, it works well enough to get the job done.

Using it can be as simple as adding a related.html file to partials with some simple code:

{{ $related := .Site.RegularPages.Related . | first 5 }}
{{ with $related }}
    <h3>See Also</h3>
    <ul>
    {{ range . }}
    <li><a href="{{ .RelPermalink }}">{{ .Title }}</a></li>
    {{ end }}
    </ul>
{{ end }}

Then include it with {{ partial "related" . }} in your partials/article.html file. Simple as can be.

Bringing it to Life #

I wrote a relatively simple Python script to generate keywords for all posts, and this can be run before building the site (or committing changes, if you have an automated workflow). This script allows me to have exactly what I wanted: content-based related content, without manually managing keywords.

Important: This will replace all keywords; if you already have keywords on your posts, this will replace them.

This script is set up for my needs and my workflow; if you wish to use it, you may need to make some changes to ensure that it works correctly for you.

Hopefully, others will find this helpful, and help to address a limitation in Hugo.

To be more accurate, was implemented in Genesis. It was removed in version 4.0 - thus requiring finding another library or using an older version. I opted for the latter. While they aren’t that happy with it, it seems to work well enough for this use case. ↩︎

Adam Caudill

Generating Content Stats for Hugo

I recently became curious just how much time I had spent working on content for this site, which led me to an idea: it would be great to have a page that listed some useful data about the content, and how much effort was put into it. I had some hope that I could pull some of this directly out of Hugo, though unfortunately it didn’t expose the information I wanted (and certainly not in an efficient way).

Win by Building for Failure

Systems fail; it doesn’t matter what the system is. Something will fail sooner or later. When you design a system, are you focused on the happy path, or are you building with the possibility of failure in mind? If you suffered a data breach tomorrow, what would the impact be? Does the system prevent loss by design, or does it just fall apart? Can you easily minimize loss and damage, or would an attacker have free rein once they get in?

The (Questionable) Future of YAWAST

The last release of YAWAST was on January 1, 2020; while the release history was sometimes unpredictable, the goal was a new release each month with new features and bug fixes. I intentionally took January off from the project. In February, I left the company I was at; the team of penetration testers there had helped to inspire new features while looking for ways to make them more productive. But something else happened in February, an issue was opened – something that appeared to be simple, but in fact, made me realize that the entire project was in doubt.

Write Like You Are Running Out of Time

The cultural phenomenon that is Hamilton, brought back to the forefront due to its streaming release, is an artistic feat, but it also serves as an opportunity to refresh our memories on the history behind these characters, and look for opportunities to learn lessons that apply today. This is exactly what I’ve been doing. For all of his flaws, one thing that I have to respect about Alexander Hamilton (as well as his wife, Eliza) is the understanding of the long-term impact of the written word.

Looking for value in EV Certificates

When you are looking for TLS (SSL) certificates, there are three different types available, and vary widely by price and level of effort required to acquire them. Which one you choose impacts how your certificate is treated by browsers; the question for today is, are EV certificates worth the money? To answer this, we need to understand what the differences are just what you are getting for your money. The Three Options For many, the choice of certificate type has more to do with price than type – and for that matter, not that many people even understand that there are real differences in the types of certificates that a certificate authority (CA) can issue.

Adam Caudill

Hugo & Content-Based Related Content

Finding a Solution #

Bringing it to Life #

Related Posts

Generating Content Stats for Hugo

Win by Building for Failure

The (Questionable) Future of YAWAST

Write Like You Are Running Out of Time

Looking for value in EV Certificates

About Me

Hugo & Content-Based Related Content

Finding a Solution #

Related Content in Hugo #

Bringing it to Life #

Related Posts

Generating Content Stats for Hugo

Win by Building for Failure

The (Questionable) Future of YAWAST

Write Like You Are Running Out of Time

Looking for value in EV Certificates

About Me