Don’t forget that Robots.txt Sitemap entries need to be absolute

Robots.txt files tells visiting robots – such as Google’s crawler – what they should and should not crawl. Part of this should be a Sitemap directive:

Sitemap

The Sitemap element(s) points to an XML file of all your pages of content, and when they were last updated. That way, the crawler can quickly and efficiently find new content.

Technically, the URL to the sitemap(s) must be absolute. Yes, there’s no earthly reason why this must be so, but that’s what the specification says. Fortunately, Google will handle relative URLs – so the definition shown above should work for Google – but it might not work for other robots. Continue reading “Don’t forget that Robots.txt Sitemap entries need to be absolute”

Don’t forget that Robots.txt Sitemap entries need to be absolute

Problems with SitemapXML module for Sitecore

Previously, I’ve looked at some of the problems with the SitemapXML module for Sitecore. Well, here are another couple that have caught me a few times… Continue reading “Problems with SitemapXML module for Sitecore”

Problems with SitemapXML module for Sitecore

Awkward Sitemap XML module

So, I was reviewing a few Sitecore log files for a customer of ours, and I kept coming across the following entries.

ManagedPoolThread #15 10:26:56 WARN The serachengine "Http://google.com/webmasters/sitemaps/ping?sitemap=sitemap.xml" returns an 404 error
ManagedPoolThread #15 10:26:57 WARN The serachengine "http://search.yahooapis.com/SiteExplorerService/V1/ping?sitemap=sitemap.xml" returns an 404 error

Capture2
This struck me as interesting. These are calls to different search providers to tell them that your site’s sitemap file has been updated, and should be read again (so that new content could be indexed). This is all from a nice little module on Sitecore Marketplace. We use it quite a lot. However, I spotted a few issues… Continue reading “Awkward Sitemap XML module”

Awkward Sitemap XML module

Configure Sitecore to update Sitemaps.xml on ALL servers

In a robust web-based system one would expect to have multiple servers serving content. In Sitecore, this is typically that you have a content management server, and then one or more content delivery servers.

Sitecore also offers a ‘Sitemap XML module’, which is pretty neat. When you publish content it will generate an updated Sitemap.xml file – basically, a listing of pages that a web-crawler like Googlebot can use to crawl and index a site. It can also ping the major search engines, telling them that the sitemap has been updated, and that they should recrawl it at their leisure.

SiteMap

However, things get trickier where you have content delivery servers. Continue reading “Configure Sitecore to update Sitemaps.xml on ALL servers”

Configure Sitecore to update Sitemaps.xml on ALL servers