How we use GitHub Pages as a backend without a CNAME

We built this blog on GitHub pages, but wanted it to live on wistia.com/engineering. This is how we made it work.

Max Schnur

Engineering

But I want our blog at wistia.com/engineering!

If you’re not familiar with GitHub pages, go have a look; it’s pretty sweet. It’s super convenient for developers because you don’t need to maintain your own servers. If you’re using Jekyll, you just push it up to the gh-pages branch, and it’s deployed.

For these reasons, we really wanted to use it for engineering blogging at Wistia. That part was easy: I got a blog set up at wistia.github.io/engineering with little effort.

But to go live, we had a few constraints that made it slightly more difficult:

  1. We want to serve the blog via our wistia.com domain, not a github.io subdomain.
  2. To follow the pattern of our other properties, it should be scoped by path, not by subdomain.
  3. All the content on wistia.com is served via Fastly’s CDN, and we need to make sure our posts are updated when we push.

Serving on a non-github.io domain isn’t unheard of, and the solution is usually to use a CNAME. But that screws up condition #2: that is, we want this to be accessible at wistia.com/engineering. I’ll come back to condition #3, but it turns out we can use HAProxy to solve this.

HAProxy Config

The basic front-end config is super simple:

acl url_engineering path_beg /engineering
use_backend engineering if url_engineering

And the backend config is also quite straightforward, though that Host header requires some explaining:

backend engineering
  http-request set-header Host wistia.github.io
  server s1 wistia.github.io:80

Normally HAProxy routes directly to an IP address, not a host, even though it does support it. In that vain, HAProxy will resolve wistia.github.io to an IP and forward traffic to that IP directly. But if you were to get the IP for wistia.github.io and try accessing http://the.ip.add.ress/engineering, it would block you. I’m not sure how GitHub has implemented that (probably with HAProxy!), but it’s pretty common; you want people visiting your site via the hostname, not the IP.

As you can see, it’s pretty easy to get around this. Just set the Host header so it looks to GitHub like we’re hitting their domain.

Trailing slashes are the worst

After I got HAProxy configured, I felt pretty good. But then I started clicking around on the blog and noticed something funny. If I clicked any link on the blog, even if it was an absolute URL, I’d end up on the wistia.github.io domain! What the heck!

To cut to the chase, I traced this back to a foible of Jekyll. That is, since Jekyll is so simple and just serves static HTML files, any path you visit must end in a trailing /. That’s how it knows to load the corresponding index.html.

GitHub handles this sanely by sending back a 301 Permanently Moved redirect with the trailing slash if you visit the page without it. This makes total sense, but because the domain in our backend request is wistia.github.io–not wistia.com–it redirects to that domain. Not good!

The solution I arrived at was to handle the trailing slash redirect on our side. If we can perform the redirect on our side before it goes back to GitHub, then GitHub will never have reason to hand us back these off-domain redirects.

Here’s the HAProxy front-end config that handles that:

acl url_engineering path_beg /engineering
acl path_ends_in_slash path_end /
acl path_has_dot path_sub .
redirect code 301 prefix / drop-query append-slash if url_engineering !path_ends_in_slash !path_has_dot

Note the path_has_dot ACL is a quick and dirty way of testing if the path has an extension. Since we don’t expect any parts of the path to have a dot unless it’s, say, a .css or .js file, this makes sense.

CDN Purging

We can now update our blog just by committing and pushing to the gh-pages branch, and it lives at wistia.com/engineering. Looking pretty good. But I need immediate satisfaction! I don’t want to wait forever for old posts to fall out of cache!

Up til now, you might think we should just turn off CDN caching on our blog. After all, GitHub uses Fastly for GitHub Pages too. But without a CDN fronting our posts, then every single request to our blog would actually be going into our load balancer data center, out to GitHub’s, back to ours, and finally back to the viewer. It’s probably fast enough, but the blog is static content, so it’ll be way faster if we just cache it.

Fortunately, Fastly has a feature where you can assign several keys to a request, which you can then use later to perform bulk purges. Again, we have no dynamic backend, but we can use HAProxy to add those keys to all requests. I modified the backend config to look like this:

backend engineering
  http-request set-header Host wistia.github.io
  http-response set-header Surrogate-Key wistia_engineering
  http-response set-header Server GitHub.com_via_wistia.com
  server s1 wistia.github.io:80

Finally, when GitHub Pages builds the site, I want to issue this purge automatically. There is a convenient hook called “Page Build” for this in the GitHub repo’s Settings > Web Hooks & Services section. For that hook, I added a private post URL which will trigger a Fastly purge of wistia_engineering.

Idiosyncrasies using Fastly as a backend and a frontend

One more interesting note to add. Because GitHub uses Fastly for github.io and we use Fastly for wistia.com, we are using Fastly as both a frontend and a backend. I wonder how they feel about that. :) Therefore, that Server header above is actually very important. By default, it would return GitHub.com, but we change it to GitHub.com_via_wistia.com.

Without that header change, Fastly sees no discernible difference between the two requests, and therefore will not add a new index for our surrogate key in their cache. This puzzled me at first because it manifested as purges having no effect. Once I modified the Server header, everything started working as it should.

And with that… the blog is up!