I have been wondering for a while why my jokes pages don’t list in search engines. They are WordPress pages which have some PHP in them to read from a separate database table containing my jokes. Every now and then I’ve tried to fix the problem, and I think I’ve finally found the issue. I’d like to explain how I didn’t solve the problem too, so you can see the troubleshooting steps I took.
First of all, I wondered if the content on the pages was too similar. I have a page that lists the categories for jokes, and for each link on that there is a page with a list of jokes in that category. I thought maybe the list of links wasn’t search engine friendly enough. So I added a bit of introductory text, and changed the <title> of each category list page to include the name of the category. I also added links to the next joke in the category on each joke page and changed the position of the breadcrumbs (e.g. Jokes by Category > True Stories jokes) to after the joke so the top of the page wouldn’t always contain very similar data. That didn’t work, but the pages are now a bit better to read and each joke in a category links to another joke which allows for better navigation. Perhaps people will read two or three jokes rather than being stuck in a dead end especially if they land on the site on a specific joke page.
Then I wondered about submitting a sitemap of the joke category pages so that Google, Yahoo and Bing would know about them. It didn’t work either, but having a sitemap listed in Google’s webmaster tools allowed me to see that none of the category pages were actually being listed.
I scratched my head for a while on this one, before realising that perhaps the use of a querystring (e.g. in the URL the only thing that changes is the bit after ?CAT= or ?JOKEID=) as the only difference in URLs was a problem. I even went so far as to set up a .htaccess file to rewrite URLs from ?JOKEID=123 to /joke/123/ where 123 would be the actual number the joke is referenced as in the database. Funnily enough, that made no difference either – and it was difficult to get it to cohabit with WordPress because there are internal rewrites within WordPress and my custom joke pages are kind of sidestepping some of that.
Webmaster tools in Google were still saying my pages weren’t listed. I tried another content update, going so far as to add a paragraph describing each category on the initial list page, and then re-using the category description on the pages that show jokes from only one category. This too was in vain, though the pages now look a lot more interesting. Perhaps they’re slightly busier and less easy to read but they have extra unique content which is a bonus for those two people who want to work out how I categorised the jokes.
Somewhere along these two last steps, I also noted that Google allow you to specify which parameters in your URL you don’t want them to ignore. It’s in a section called parameter handling – these parameters refer to the querystring I mentioned before. So I set up CAT and JOKEID not to be ignored. Still no joy.
In a final step, within which my eureka moment was to come, I thought that there was perhaps too much guff in the <head> section of my joke pages. The standard stuff that WordPress adds to each header is useful for the homepage and perhaps posts, but not for my custom pages. So I set most of it to display only on the homepage (a lot of it isn’t of enormous use elsewhere) by changing the template files to include a condition to add that data only for the homepage. When I was testing that on my jokes pages, I suddenly noticed something in the source code which made me shake my head in disbelief. How could I have missed it? There was a line in the <head> which said :
<link rel=’canonical’ href=’https://www.caperet.com/joke-database/’ />
WordPress automatically adds a canonical reference to all pages, so that there is one master URL which the search engines will use to index the page. This makes sense to avoid having the same page listed multiple times. Except that my custom pages all display different content depending on whether CAT or JOKEID is in the URL but WordPress was effectively saying they are all the same – to search engines. I soon found a page on how to disable canonical links but I didn’t want to apply it to my whole blog, just the jokes section. Since my jokes pages are all “Pages” in WordPress and not “Posts”, I added the code at the top of my Page template, and was pleased to note that it works just fine there.
So now my jokes pages should, next time they are crawled, finally all be listed. All that work so that a few people each month might find a joke in my database. At least I’ve learned something, and perhaps given you an idea of how I’ve troubleshooted this issue.
[ED] The pages were finally listed in Google on the 19th October, about 3 weeks after the initial modifications.