Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

It looks like there are two mainstream solutions for instructing crawlers what to index and what not to index: adding an X-Robot-Tag HTTP header, or indicating a robots.txt.

Is there any advantage to using the former?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
1.1k views
Welcome To Ask or Share your Answers For Others

1 Answer

With robots.txt you cannot disallow indexing of your documents.

They have different purposes:

  • robots.txt can disallow crawling (with Disallow)
  • X-Robots-Tag 1 can disallow indexing (with noindex)

(And both offer additional different features, e.g., linking to your Sitemap in robots.txt, disallowing following links in X-Robots-Tag, and many more.)

Crawling means accessing the document. Indexing means providing a link to (and possibly metadata from or about) the document in an index. In the typical case, a bot indexes a document after having crawled it, but that’s not necessary.

A bot that isn’t allowed to crawl a document may still index it (without ever accessing it). A bot that isn’t allowed to index a document may still crawl it. You can’t disallow both.

1 Note that the header is called X-Robots-Tag, not X-Robot-Tag. By the way, the metadata name robots (for the HTML meta element) is an alternative to the HTTP header.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...