Webmasters Asked by Quillion on November 6, 2021
I have a project and I would like to disallow everything starting with root.
From what I understand I think I can do so by doing this
Disallow: /
Disallow: /*
However I would like to allow 4 subdirectories and everything under those subdirectories.
This is how I think it should be done
Allow: /directory_one/
Allow: /directory_one/*
Allow: /directory_two/
Allow: /directory_two/*
Allow: /directory_six/
Allow: /directory_six/*
Allow: /about/
Allow: /about/*
So how would I go about disallowing everything starting from root but allowing only those 4 directories and everything under them?
Also if I want to allow specific directory and everything under it, do I have to declare it twice?
Will webcrawler be able to navigate to those subdirectories if root is disallowed?
I would not recommend trying to set up your site to disallow everything except certain directories.
Allow
directives. Most crawlers will be disallowed from crawling your entire site. Only a few crawlers will know they are allowed to crawl those subdirectories. Luckily the major search engine bots process Allow
directives.I would recommend moving all disallowed content into its own subdirectory and disallowing it. For example, put all your blocked content into /private/
and use the following in robots.txt
:
User-Agent: *
Disallow: /private
If you are comfortable blocking most bots other than search engine crawlers and with having worse SEO because your home page is not crawled, your idea of using Allow
for specific directories could work. There is no need to use the wildcard *
at the end of any directive. All robots.txt
rules without a wildcard are "starts with" rules. There is an implied *
at the end of each and every one. All you would need would be:
User-Agent: *
Disallow: /
Allow: /directory_one
Allow: /directory_two
Allow: /directory_six
Allow: /about
The order of the rules shouldn't matter. When multiple rules match, the longest rule should apply. So:
/
(the home page) matches Disallow: /
and crawling is not allowed/foo
matches Disallow: /
and crawling is not allowed/about/bob
matches both Disallow: /
and Allow: /about
. The longer rule will apply and crawling would be allowed.You can test this with Google's robots.txt testing tool that is part of search console.
Answered by Stephen Ostermiller on November 6, 2021
What you have would seem to be "about" correct, assuming you have the appropriate User-agent
directive that precedes this?
Disallow: / Disallow: /*
However, you don't need to repeat the same directive, one with a trailing *
and one without. robots.txt
is prefix matching. The trailing *
is superfluous and these match the same URLs.
User-agent: *
Allow: /directory_one/
Allow: /directory_two/
Allow: /directory_six/
Allow: /about/
Disallow: /
Whilst Google (and the big search engines) match using the longest-path-matched method, I believe you should put the Allow
directives first for those search engines that use the first-path-matched method.
Will webcrawler be able to navigate to those subdirectories if root is disallowed?
Yes, that is the reason for the overriding Allow
directives. Strictly speaking, the Allow
directive is a more recent addition to the "standard", but all main search engines support it.
Answered by DocRoot on November 6, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP