Robots.txt files tells visiting robots – such as Google’s crawler – what they should and should not crawl. Part of this should be a Sitemap directive:
The Sitemap element(s) points to an XML file of all your pages of content, and when they were last updated. That way, the crawler can quickly and efficiently find new content.
Technically, the URL to the sitemap(s) must be absolute. Yes, there’s no earthly reason why this must be so, but that’s what the specification says. Fortunately, Google will handle relative URLs – so the definition shown above should work for Google – but it might not work for other robots. Continue reading “Don’t forget that Robots.txt Sitemap entries need to be absolute”
Just a note of example robots.txt file that I’m using in Sitecore:
Disallow: /sitecore modules/
Also, don’t forget to set up your Sitecore.Analytics.ExcludeRobots.config
This one was a real Sherlock Holmes case. The site I’ve been working on has Azure App Insights monitoring it, and on the live site we started to see lots of exceptions that hadn’t appeared in testing.
Digging into the exceptions, I noticed that:
- The exception was System.Web.HttpUnhandledException with an inner System.NullReferenceException
- The method throwing the exception was Sitecore.Form.Core.Ascx.Controls.SimpleForm.OnAddInitOnClient, which is part of the Web Forms for Marketers suite.
- The user agent was always Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) – so it was Googlebot making the request. There’s no reason that should cause an exception though!
I realised after a while that Googlebot is a robot, and so is excluded from analytics, so this problem probably had the same cause as the exceptions our testing was starting to throw up. The OnAddInitOnClient method isn’t one of mine, though – it’s part of Sitecore’s Web forms for Marketers – so I opened up Reflector and down the rabbit hole I went… Continue reading “Web Forms For Marketers blows up if Analytics disabled”
Right, so this is a lesson of “It’s obvious when you think about it”.
The new company that I work for
have had a policy of adding the IP addresses of their gateways to the Sitecore.Analytics.ExcludeRobots.config file. The idea behind this is that we don’t want to pollute customer’s analytics data with traffic from ourselves, particularly if we are running health monitoring service that occasionally polls their website. That all seems sensible enough.
Here’s the problem – our tester was trying to test some of the pages we’d built on a live site. There was a problem with that – more in a later post – but the upshot was that it appeared that on two identical devices (iPhones) one would work, and the other would receive the yellow screen of death. A straw poll of iPhones in the office showed 3 working, and 3 failures. Curious… Continue reading “Take care with Sitecore.Analytics.ExcludeRobots.config and fixed IP addresses”