Touhou-Project.com

Seek and Ye Shall Maybe Find

Added 2021-12-26 11:05:05 +0000 UTC

Hey all, hope you’ve been well. In my last post I alluded to things that were half-finished or in-progress, many which affected the user experience of the site. Those remain in-progress for the most part. However, I have bothered to do something arguably more important than all that, though less glamorous: rework how information is presented to bots.

If you’re one step ahead of me, well, then you’ll know that I mean web crawlers and indexers. These are part of the silent majority of web traffic, countless programs that scrape data that is then used by other companies and services. I mean to say that I’m talking about search engines and their results. Specifically, how they sift data and return meaningful results for humans.

While the field of SEO is a very competitive one with plenty of advice on the internet as well as premium services that promise better rankings (a lot of snake oil abounds therein), the basic idea is that when you have a website you can structure your data to make it easier for the various search algorithms to figure out whether a user query is related to your page. I’m sure every one of you has used a search engine and have seen that the order of the results change depending on the keywords. Results also include a bit of information on the page so you can make up your mind if that’s what you are looking for.

The original board software did very little to provide meaningful data. It may be due to the supposed transitory nature of image boards; threads typically last only a few hours on bigger boards at best. Once they were gone, they were gone. Sometimes another would be made with a similar topic. There was, therefore, no need to make it friendly to search engines.

Of course, THP has a lot of rather long-form fanfiction that is meant to stick around perpetually. In that case, having something more than a bit of text randomly chosen by a confused bot is probably good. That way you have more relevant data that can be found easily. Or suggested automatically. That’s why many years ago I added proper title elements to each page. Typically these were the subject of the first post in a thread or, if there was nothing to go on, a thread number and the board name. Slap a little universal title about THP on all pages and you get something rudimentary that positively contributes to page rankings.

Still, this was not without its limitations. It’s difficult to tell what THP was about as a title and, often enough, mashed up bits from posts didn’t say much about the content. Even searching for words like “touhou” and “fanfiction” wouldn’t necessarily get you one of those stories. So the basic idea of the latest changes was to expand this system as much as possibly as well as give consistent results.

This is something that has been on my task list for quite some time and depended on other components as well. The story list overhaul with the new tag system was fundamental. Why? Well, things like tags and synopses are pretty handy to refer to programatically when you need to present information. But I’m getting ahead of myself.

Here’s a quick recap of how THP works: whenever a post is made or deleted, the board software triggers a bunch of commands that ultimately query the database, get relevant data about the board and thread, passes it through more filters and transforms it in places, passes all that to a bunch of templates that hold the page structure, then that is ultimately spat out into a static HTML file output.

I’ve talked about the various entangled systems in the board software and even in the generation steps this holds true. A story thread passes through various templates and functions that it might share with, say, an archived thread or a board page, but it will differ in key places on occasion. Bringing it back to inserting relevant information for search engines, there was no necessary guarantee that I could set a value somewhere and that it would always produce the results I wanted.

So, first thing’s first, I had to overhaul some of the generation process. Specifically, bring together the archival process with regular thread building. Archived threads used their own header elements (which largely include the data that’s useful to bots) in their templates whereas most everything else shares at least one part. So, as I had already simplified templates recently, I was able to cleanly take away the relevant parts and include the global header template in the archival process as well. A new parameter in a class method makes sure that other parts of code needed by most of the other templates but not relevant to archiving is not executed. Relatively simple thanks to previous groundwork.

With that out of the way, it was time for the real work.

There’s several tags that can be set in HTML that a web browser (and bots) understand to be page information. I aimed to fill at least three for every thread. First and most obvious: the title tag which is self-explanatory and shows up on your browser’s title bar, on a tab, and as the title of a link in search results. Then there’s the meta description tag, which is normally invisible but it’s what search engine bots prefer to include in the description of a result if it exists. Unique ones help get your pages more results. Finally, the meta author tag, which isn’t really used that much but I figured that it couldn’t hurt to include.

I created a new method for the class that runs the page generation process and called that in the header section, basically meaning that I had to create a new block of code with instructions to run. Keeping in mind the sort of information I wanted, it makes sense to consult the database data on various stories. Starting from a thread id, querying up if it’s part of a story and then fetching other bits of info like synopses and tags. A casual look at the story list will reveal that not every story has an author or a proper title, much less tags nor a synopsis. So half of my work was therefore to cobble together what I wanted in a generic way, so that every story would have “something” unique and meaningful for those three fields.

In the case of a description, a synopsis will be used if it exists. Since there’s a 300 character limit on Google, this synopsis is first cut with a small function I made that looks for the last bit of punctuation before the limit, so things aren’t left mid-sentence. A a generic “Touhou fanfiction about:” before it and you’ve got a meaningful search description in my opinion.

If a synopsis doesn’t exist, tags will be preferred instead. If tags exist, up to two character and genre tags each will be output with different variations (accounting for the number of tags/grammar) so that the description of a story (like say, this one) will read a little something like this: “A touhou fanfiction action story featuring Kasodani Kyouko by Clear Your Sights.”.

And if there’s no info at all, it’ll simply say that it’s a fanfic on a specific board by a specific author. Both of these may not be as ideal as a manually-input description but I think it does a good job of getting the point across a range of different scenarios.

Now what if the thread isn’t in the story list for some reason or isn’t a story at all? It’s still desirable to have as much data about it served up. So we salvage the old system, taking the title from the OP if it has a subject, otherwise just noting a thread number. A description is forcefully taken from the OP message, trimming it down to 300 characters. Slightly better than the smushed up data including posting time that tended to automatically be used by search engines, I feel.

Oh, and as an aside—to make things more standardized, titles on the various threads are changed to be the story’s proper name instead of the thread’s subject line and with a little number indicating the number if applicable. The reason for this both for organization and as a stepping stone for something else may be done in the future.

I could have just left it there and I think that it would have improved THP’s appearance to the wider web. But I went one step farther and soldiered on.

I decided to include robot-exclusive vocabulary in the pages. Basically hidden data that’s formatted in a specific way so that machines have a greater “understanding” of what a page is about. There’s many different ways to do this but I decided to tentatively go with JSON-LD as it can be applied generally to a page, as opposed to things like RDFa or Microdata which apply more to specific elements. A little confusing, trust me I know, but the gist is that you can describe a page in a way that may help pages get ranked by algorithms favorably.

The syntax isn’t so hard to learn but the sheer number of possibilities and combinations out there are incredibly overwhelming. I have no idea what would be best to include so I limited myself in mirroring the same elements as before. In short, I wanted to have a title, description, information about THP, an author and tags. Even so, I wasn’t sure how to best organize things and, for the time being, split the schema to the front page and threads.

On the front page, I added data pretty much manually, filling in information on THP as a whole. In the other threads, I reused a lot of the same information I was using in that aforementioned new class’ method. I created another function to process the data into the very strict format required by machines while making sure to include things like tags as keywords. In addition to keywords like “fanfiction”, the schema will include several character names (currently up to 4) and genres (3) but I’m not sure if I should include more tags or the other kinds of tags (the technical ones).

I haven’t found reliable info on what these keywords actually accomplish in a practical sense nor if there’s a limit beyond which they’re ignored or, worse, decrease the relevance of the page in search results. In the late 90s and early 00s, webmasters abused the meta keywords tag to an extent that all search engines now ignore it.

To be completely honest, I don’t know if all of this additional work will make much of a practical difference but seeing as I was already in for a penny…

With a few days of testing things out, it seemed that I got things working in reasonable order. It’s not an ideal system and there’s much room for improvement. There’s a lot more search engine optimization that can be done. I just think that it may be a little pointless to go all-in because of how the site is currently structured. It would mean a lot of work done in a hacky fashion. And, possibly, work that would need to be redone later no matter what. That’s why I’m taking things one step at a time and splitting my attention between different areas of work.

All that I’ve mentioned here will have been live on the site for at least a few days by the time you read this. I’ve further tweaked code and fixed bugs in the interim and will likely continue to do so as more oversights are spotted. Further, the system will work better the more stories get synopses and tags.

That’s it for 2021. I want to thank everyone for their continuing support and wish you all the best. I hope that 2022 will be a great year for THP and our community.

Until next time, take it easy!