Webmasters Asked by Jim Mischel on February 23, 2021
I run a fairly large-scale Web crawler. We try very hard to operate the crawler within accepted community standards, and that includes respecting robots.txt. We get very few complaints about the crawler, but when we do the majority are about our handling of robots.txt. Most often the Webmaster made a mistake in his robots.txt and we kindly point out the error. But periodically we run into grey areas that involve the handling of Allow
and Disallow
.
The robots.txt page doesn’t cover Allow
. I’ve seen other pages, some of which say that crawlers use a “first matching” rule, and others that don’t specify. That leads to some confusion. For example, Google’s page about robots.txt used to have this example:
User-agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html
Obviously, a “first matching” rule here wouldn’t work because the crawler would see the Disallow
and go away, never crawling the file that was specifically allowed.
We’re in the clear if we ignore all Allow
lines, but then we might not crawl something that we’re allowed to crawl. We’ll miss things.
We’ve had great success by checking Allow
first, and then checking Disallow
, the idea being that Allow
was intended to be more specific than Disallow
. That’s because, by default (i.e. in the absence of instructions to the contrary), all access is allowed. But then we run across something like this:
User-agent: *
Disallow: /norobots/
Allow: /
The intent here is obvious, but that Allow: /
will cause a bot that checks Allow
first to think it can crawl anything on the site.
Even that can be worked around in this case. We can compare the matching Allow
with the matching Disallow
and determine that we’re not allowed to crawl anything in /norobots/. But that breaks down in the face of wildcards:
User-agent: *
Disallow: /norobots/
Allow: /*.html$
The question, then, is the bot allowed to crawl /norobots/index.html
?
The “first matching” rule eliminates all ambiguity, but I often see sites that show something like the old Google example, putting the more specific Allow
after the Disallow
. That syntax requires more processing by the bot and leads to ambiguities that can’t be resolved.
My question, then, is what’s the right way to do things? What do Webmasters expect from a well-behaved bot when it comes to robots.txt handling?
One very important note: the Allow statement should come before the Disallow statement, no matter how specific your statements are.
So in your third example - no, the bots won't crawl /norobots/index.html
.
Generally, as a personal rule, I put allow statements first and then I list the disallowed pages and folders.
Correct answer by Vergil Penkov on February 23, 2021
Google expanded the documentation for robots.txt for user agents that support the Allow
directive. The rule that Googlebot uses (and what Google is trying to make standard) is that the longest matching rule wins.
So when you have:
Disallow: /norobots/
Disallow: /nobot/
Allow: /*.html$
Allow: /****.gif$
/norobots/index.html
is blocked because it matches two rules and /norobots/
is longer (10 characters) than /*.html$
(8 characters)./nobot/index.html
is allowed because it matches two rules and /nobot/
is shorter (7 characters) than /*.html$
(8 characters)./norobots/pic.gif
is allowed because it matches two rules and /norobots/
is equal length (10 characters) to /****.gif$
(10 characters). Google's spec says that the "less restrictive" rule should be used for rules of equal length, ie. the one that allows crawling.Answered by Stephen Ostermiller on February 23, 2021
Here's my take on what I see in those three examples.
Example 1
I would ignore the entire /folder1/
directory except the myfile.html
file. Since they explicitly allow it I would assume it was simply easier to block the entire directory and explicitly allow that one file as opposed to listing every file they wanted to have blocked. If that directory contained a lot of files and subdirectories that robots.txt file could get unwieldy fast.
Example 2
I would assume the /norobots/
directory is off limits and everything else is available to be crawled. I read this as "crawl everything except the /norobots/ directory".
Example 3
Similar to example 2, I would assume the /norobots/
directory is off limits and all .html
files not in that directory is available to be crawled. I read this as "crawl all .html files but do not crawl any content in the the /norobots/ directory".
Hopefully your bot's user-agent contains a URL where they can find out more information about your crawling habits and make removal requests or give you feedback about how they want their robots.txt interpreted.
Answered by John Conde on February 23, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP