What is robots.txt?
The Robots Exclusion Protocol(REP) is a group of web standards that regulate web robot behavior and search engine indexing.
Here in webpage this file gets merged which tells crawlers which of the page of the website is accessible to them and which is not.We can see our websites robot file by typing out website name/robots.txt. for example see the following image:
We can decide which content is to allow or disallow from the google. There are certain rules or methods:
By writing :
User agent: *
Disallow: /
It will block all the web crawlers from all content.
User agent: Googlebot
Disallow: /no google/
It will block a specific web crawler from a specific folder
User agent: Googlebot
Disallow:/no google/blocked-page.html
Block a specific Web crawler from a specific web page
Sitemap Paramaeter
User agent: *
Disallow:
Sitemap: http://www.example.com/none-standard-location/sitemap.xml
Important Rules:
- In most cases meta robots with parameters "no-index, follow" should be employed as a way to restrict crawling or indexation.
- Is is to be noted that malicious crawlers completely ignore robots.txt and so this protocol is not a good security mechanism.
- Only one" Disallow:" line is allowed for each URL.
- The filename of "robots.txt" is case sensitive. We have to use "robots.txt" and not "Robots.TXT".
- Spacing will not be accepted to separate any parameters.
0 comments:
Post a Comment