Example robots txt for Yandex. Recommendations for setting up the robots txt file. "Host:" and "Sitemap:" directives

Quick navigation on this page:

The modern reality is that in Runet not a single self-respecting site can do without a file called robots.txt - even if you have nothing to prohibit from indexing (although almost every site has technical pages and duplicate content that require closing from indexing), then At a minimum, it is definitely worthwhile to register a directive with www and without www for Yandex - this is what the rules for writing robots.txt, which will be discussed below, are for.

What is robots.txt?

A file with this name dates back to 1994, when the W3C consortium decided to introduce such a standard so that sites could provide search engines with indexing instructions.

A file with this name must be saved in the root directory of the site; placing it in any other folders is not allowed.

The file performs the following functions:

  1. prohibits any pages or groups of pages from indexing
  2. allows any pages or groups of pages to be indexed
  3. indicates to the Yandex robot which site mirror is the main one (with www or without www)
  4. shows the location of the sitemap file

All four points are extremely important for search engine optimization site. Blocking indexing allows you to block from indexing pages that contain duplicate content - for example, tag pages, archives, search results, pages with printable versions, and so on. The presence of duplicate content (when the same text, even in the size of several sentences, is present on two or more pages) is a minus for the site in search engine rankings, therefore there should be as few duplicates as possible.

The allow directive has no independent meaning, since by default all pages are already available for indexing. It works in conjunction with disallow - when, for example, a certain category is completely closed from search engines, but you would like to open this or a separate page in it.

Pointing to the main mirror of the site is also one of the most important elements in optimization: search engines view the sites www.yoursite.ru and yoursite.ru as two different resources, unless you directly tell them otherwise. The result is a doubling of content - the appearance of duplicates, a decrease in the strength of external links (external links can be placed both with www and without www) and as a result, this can lead to lower ranking in search results.

For Google, the main mirror is registered in the Webmaster tools (http://www.google.ru/webmasters/), but for Yandex these instructions can only be registered in the same robots.tkht.

Pointing to an xml file with a sitemap (for example, sitemap.xml) allows search engines to detect this file.

Rules for specifying User-agent

The user-agent in this case is the search engine. When writing instructions, you must indicate whether they will apply to all search engines (in which case an asterisk is indicated - *) or whether they are intended for a specific search engine - for example, Yandex or Google.

In order to set a User-agent indicating all robots, write the following line in your file:

User-agent: *

For Yandex:

User-agent: Yandex

For Google:

User-agent: GoogleBot

Rules for specifying disallow and allow

First, it should be noted that the robots.txt file must contain at least one disallow directive to be valid. Now let's look at the application of these directives using specific examples.

Using this code, you allow indexing of all pages of the site:

User-agent: * Disallow:

And with this code, on the contrary, all pages will be closed:

User-agent: * Disallow: /

To prohibit indexing of a specific directory called folder, specify:

User-agent: * Disallow: /folder

You can also use asterisks to substitute an arbitrary name:

User-agent: * Disallow: *.php

Important: the asterisk replaces the entire file name, that is, you cannot specify file*.php, only *.php (but all pages with the .php extension will be prohibited; to avoid this, you can specify a specific page address).

The allow directive, as stated above, is used to create exceptions in disallow (otherwise it has no meaning, since pages are already open by default).

For example, we will prohibit pages in the archive folder from being indexed, but we will leave the index.html page from this directory open:

Allow: /archive/index.html Disallow: /archive/

Specify the host and sitemap

The host is the main mirror of the site (that is, the domain name plus www or the domain name without this prefix). The host is specified only for the Yandex robot (in this case, there must be at least one disallow command).

To specify a host, robots.txt must contain the following entry:

User-agent: Yandex Disallow: Host: www.yoursite.ru

As for the sitemap, in robots.txt the sitemap is indicated by simply writing the full path to the corresponding file, indicating the domain name:

Sitemap: http://yoursite.ru/sitemap.xml

It is written about how to make a sitemap for WordPress.

Example robots.txt for WordPress

For WordPress, instructions must be specified in such a way as to close all technical directories (wp-admin, wp-includes, etc.) for indexing, as well as duplicate pages created by tags, RSS files, comments, and search.

As an example of robots.txt for wordpress, you can take the file from our website:

User-agent: Yandex Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-login.php Disallow: /wp-register.php Disallow: /xmlrpc.php Disallow: /search Disallow: */trackback Disallow: */feed/ Disallow: */feed Disallow: */comments/ Disallow: /?feed= Disallow: /?s= Disallow: */page/* Disallow: */comment Disallow: */tag/* Disallow: */ attachment/* Allow: /wp-content/uploads/ Host: www..php Disallow: /wp-register.php Disallow: /xmlrpc.php Disallow: /search Disallow: */trackback Disallow: */feed/ Disallow: * /feed Disallow: */comments/ Disallow: /?feed= Disallow: /?s= Disallow: */page/* Disallow: */comment Disallow: */tag/* Disallow: */attachment/* Allow: /wp -content/uploads/ User-agent: * Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-login.php Disallow: /wp-register.php Disallow: /xmlrpc.php Disallow: /search Disallow: */trackback Disallow: */feed/ Disallow: */feed Disallow: */comments/ Disallow: /?feed= Disallow: /?s= Disallow: */page/* Disallow: */comment Disallow: */tag/ * Disallow: */attachment/* Allow: /wp-content/uploads/ Sitemap: https://www..xml

You can download the robots.txt file from our website using .

If after reading this article you still have any questions, ask in the comments!

1) What is a search robot?
2) What is robots.txt?
3) How to create robots.txt?
4) What and why can be written to this file?
5) Examples of robot names
6) Example of finished robots.txt
7) How can I check if my file is working?

1. What is a search robot?

Robot (English crawler) keeps a list of URLs that it can index and regularly downloads documents corresponding to them. If the robot finds a new link while analyzing a document, it adds it to its list. Thus, any document or site that has links can be found by a robot, and therefore by Yandex search.

2. What is robots.txt?

Search robots look for the robots.txt file on websites first. If you have directories, content, etc. on your site that you, for example, would like to hide from indexing (the search engine did not provide information on them. For example: admin panel, other page panels), then you should carefully study the instructions for working with this file.

robots.txt- This text file(.txt), which is located in the root (root directory) of your site. It contains instructions for search robots. These instructions may prohibit certain sections or pages on the site from being indexed, indicate correct “mirroring” of the domain, recommend that the search robot observe a certain time interval between downloading documents from the server, etc.

3. How to create robots.txt?

Creating robots.txt is very simple. We go to a regular text editor (or right mouse button - create - text document), for example, Notepad. Next, create a text file and rename it robots.txt.

4. What and why can be written to the robots.txt file?

Before you specify a command to a search engine, you need to decide which Bot it will be addressed to. There is a command for this User-agent
Below are examples:

User-agent: * # the command written after this line will be addressed to all search robots
User-agent: YandexBot # access to the main Yandex indexing robot
User-agent: Googlebot # access to the main Google indexing robot

Allowing and disabling indexing
To enable and disable indexing there are two corresponding commands - Allow(possible) and Disallow(it is forbidden).

User-agent: *
Disallow: /adminka/ # prohibits all robots from indexing the adminka directory, which supposedly contains the admin panel

User-agent: YandexBot # the command below will be addressed to Yandex
Disallow: / # we prohibit indexing of the entire site by the Yandex robot

User-agent: Googlebot # the command below will call Google
Allow: /images # allow all contents of the images directory to be indexed
Disallow: / # and everything else is prohibited

The order doesn't matter

User-agent: *
Allow: /images
Disallow: /

User-agent: *
Disallow: /
Allow: /images
# both are allowed to index files
# starting with "/images"

Sitemap Directive
This command specifies the address of your sitemap:

Sitemap: http://yoursite.ru/structure/my_sitemaps.xml # Indicates the sitemap address

Host directive
This command is inserted AT THE END of your file and denotes the main mirror
1) is written AT THE END of your file
2) is indicated only once. otherwise only the first line is accepted
3) indicated after Allow or Disallow

Host: www.yoursite.ru # mirror of your site

#If www.yoursite.ru is the main mirror of the site, then
#robots.txt for all mirror sites looks like this
User-Agent: *
Disallow: /images
Disallow: /include
Host: www.yoursite.ru

# by default Google ignores Host, you need to do this
User-Agent: * # index all
Disallow: /admin/ # disallow admin index
Host: www.mainsite.ru # indicate the main mirror
User-Agent: Googlebot # now commands for Google
Disallow: /admin/ # ban for Google

5. Examples of robot names

Yandex robots
Yandex has several types of robots that solve a variety of problems: one is responsible for indexing images, others are responsible for indexing rss data to collect data on blogs, and others are responsible for multimedia data. Foremost - YandexBot, it indexes the site in order to compile a general database of the site (headings, links, text, etc.). There is also a robot for fast indexing (news indexing, etc.).

YandexBot-- main indexing robot;
YandexMedia-- a robot that indexes multimedia data;
YandexImages-- Yandex.Images indexer;
YandexCatalog-- "tapping" of Yandex.Catalog, used for temporary removal from publication of inaccessible sites in the Catalog;
YandexDirect-- Yandex.Direct robot, interprets robots.txt in a special way;
YandexBlogs-- blog search robot that indexes posts and comments;
YandexNews-- Yandex.News robot;
YandexPagechecker-- micro markup validator;
YandexMetrika-- Yandex.Metrica robot;
YandexMarket-- Yandex.Market robot;
YandexCalendar-- Yandex.Calendar robot.

6. Example of finished robots.txt

Actually we came to the example of a finished file. I hope after the above examples everything will be clear to you.

User-agent: *
Disallow: /admin/
Disallow: /cache/
Disallow: /components/

User-agent: Yandex
Disallow: /admin/
Disallow: /cache/
Disallow: /components/
Disallow: /images/
Disallow: /includes/

Sitemap: http://yoursite.ru/structure/my_sitemaps.xml

This is a text file (document in .txt format) containing clear instructions for indexing a specific site. In other words, this file indicates to search engines which pages of a web resource need to be indexed, and which not – to be prohibited from indexing.

It would seem, why prohibit the indexing of some site content? They say, let the search robot index everything indiscriminately, guided by the principle: the more pages, the better! Only an amateur CEO can reason this way.

Not all the content that makes up a website is needed by search robots. There are system files, there are duplicate pages, there are categories keywords and there is a lot more that does not necessarily need to be indexed. Otherwise, the following situation cannot be ruled out.

When a search robot comes to your site, the first thing it does is try to find the notorious robots.txt. If this file is not detected by it or is detected, but it is compiled incorrectly (without the necessary prohibitions), the search engine “messenger” begins to study the site at its own discretion.

In the process of such studying, he indexes everything and it is far from a fact that he starts with those pages that need to be entered into the search first (new articles, reviews, photo reports, etc.). Naturally, in this case, the indexing of the new site may take some time.

In order to avoid such an unenviable fate, the webmaster needs to take care of creating correct file robots.txt.

“User-agent:” is the main directive of robots.txt

In practice, directives (commands) are written in robots.txt using special terms, the main one of which can be considered the directive “ User-agent: " The latter is used to specify the search robot, which will be given certain instructions in the future. For example:

  • User-agent: Googlebot– all commands that follow this basic directive will relate exclusively to the Google search engine (its indexing robot);
  • User-agent: Yandex– the addressee in this case is the domestic search engine Yandex.

The robots.txt file can be used to address all other search engines combined. The command in this case will look like this: User-agent: *. The special character “*” usually means “any text”. In our case, any search engines other than Yandex. Google, by the way, also takes this directive personally, unless you contact it personally.

“Disallow:” command – prohibiting indexing in robots.txt

The main “User-agent:” directive addressed to search engines can be followed by specific commands. Among them, the most common is the directive “ Disallow: " Using this command, you can prevent the search robot from indexing the entire web resource or some part of it. It all depends on what extension this directive will have. Let's look at examples:

User-agent: Yandex Disallow: /

This kind of entry in the robots.txt file means that the Yandex search robot is not allowed to index this site at all, since the prohibitory sign “/” stands alone and is not accompanied by any clarifications.

User-agent: Yandex Disallow: /wp-admin

As you can see, this time there are clarifications and they concern the system folder wp-admin V . That is, the indexing robot, using this command (the path specified in it), will refuse to index this entire folder.

User-agent: Yandex Disallow: /wp-content/themes

Such an instruction to the Yandex robot presupposes its admission to a large category " wp-content ", in which it can index all contents except " themes ».

Let’s explore the “forbidden” capabilities of the robots.txt text document further:

User-agent: Yandex Disallow: /index$

In this command, as follows from the example, another special sign “$” is used. Its use tells the robot that it cannot index those pages whose links contain the sequence of letters " index " At the same time, index separate file site with the same name " index.php » the robot is not prohibited. Thus, the “$” symbol is used when a selective approach to prohibiting indexing is necessary.

Also, in the robots.txt file, you can prohibit indexing of individual resource pages that contain certain characters. It might look like this:

User-agent: Yandex Disallow: *&*

This command tells the Yandex search robot not to index all those pages on a website whose URLs contain the “&” character. Moreover, this sign in the link must appear between any other symbols. However, there may be another situation:

User-agent: Yandex Disallow: *&

Here the indexing ban applies to all those pages whose links end in “&”.

If there should be no questions about the ban on indexing system files of a site, then such questions may arise regarding the ban on indexing individual pages of the resource. Like, why is this necessary in principle? An experienced webmaster may have many considerations in this regard, but the main one is the need to get rid of duplicate pages in the search. Using the "Disallow:" command and group special characters, discussed above, you can deal with “undesirable” pages quite simply.

“Allow:” command – allowing indexing in robots.txt

The antipode of the previous directive can be considered the command “ Allow: " Using the same clarifying elements, but using this command in the robots.txt file, you can allow the indexing robot to enter the site elements you need into the search database. To confirm this, here is another example:

User-agent: Yandex Allow: /wp-admin

For some reason, the webmaster changed his mind and made the appropriate adjustments to robots.txt. As a consequence, from now on the contents of the folder wp-admin officially approved for indexing by Yandex.

Even though the Allow: command exists, it is not used very often in practice. By and large, there is no need for it, since it is applied automatically. The site owner just needs to use the “Disallow:” directive, prohibiting this or that content from being indexed. After this, all other content of the resource that is not prohibited in the robots.txt file is perceived by the search robot as something that can and should be indexed. Everything is like in jurisprudence: “Everything that is not prohibited by law is permitted.”

"Host:" and "Sitemap:" directives

The overview of important directives in robots.txt is completed by the commands “ Host: " And " Sitemap: " As for the first, it is intended exclusively for Yandex, indicating to it which site mirror (with or without www) is considered the main one. For example, a site might look like this:

User-agent: Yandex Host: website

User-agent: Yandex Host: www.site

Using this command also avoids unnecessary duplication of site content.

In turn, the directive “ Sitemap: » indicates to the indexing robot the correct path to the so-called Site Map - files sitemap.xml And sitemap.xml.gz (in the case of CMS WordPress). A hypothetical example might be:

User-agent: * Sitemap: http://site/sitemap.xml Sitemap: http://site/sitemap.xml.gz

Writing this command in the robots.txt file will help the search robot index the Site Map faster. This, in turn, will also speed up the process of getting web resource pages into search results.

The robots.txt file is ready - what next?

Let's assume that you, as a novice webmaster, have mastered the entire array of information that we have given above. What to do after? Create Text Document robots.txt, taking into account the characteristics of your site. To do this you need:

  • take advantage text editor(for example, Notepad) to compile the robots.txt you need;
  • check the correctness of the created document, for example, using this Yandex service;
  • using an FTP client, upload the finished file to the root folder of your site (in the case of WordPress, we are usually talking about system folder public_html).

Yes, we almost forgot. A novice webmaster, without a doubt, will want to first look at ready-made examples of this file performed by others. Nothing could be simpler. To do this, just enter in the address bar of your browser site.ru/robots.txt . Instead of “site.ru” - the name of the resource you are interested in. That's all.

Happy experimenting and thanks for reading!

Hello! There was a time in my life when I knew absolutely nothing about creating websites, and certainly had no idea about the existence of the robots.txt file.

When a simple interest grew into a serious hobby, the strength and desire to study all the intricacies appeared. On the forums you can find many topics related to this file, why? It's simple: robots.txt regulates access search engines to the site, managing indexing and this is very important!

Robots.txt is a text file designed to limit search robots’ access to sections and pages of the site that need to be excluded from crawling and search results.

Why hide certain website content? It is unlikely that you will be happy if a search robot indexes site administration files, which may contain passwords or other sensitive information.

There are various directives to regulate access:

  • User-agent - user agent for which access rules are specified,
  • Disallow - denies access to the URL,
  • Allow - allows access to the URL,
  • Sitemap - indicates the path to,
  • Crawl-delay - sets the URL crawling interval (only for Yandex),
  • Clean-param - ignores dynamic URL parameters (only for Yandex),
  • Host - indicates the main mirror of the site (only for Yandex).

Please note that as of March 20, 2018, Yandex officially stopped supporting the Host directive. It can be removed from robots.txt, and if left, the robot will simply ignore it.

The file must be located in the root directory of the site. If the site has subdomains, then its own robots.txt is compiled for each subdomain.

You should always remember safety. This file can be viewed by anyone, so there is no need to specify an explicit path to administrative resources (control panels, etc.) in it. As they say, the less you know, the better you sleep. Therefore, if there are no links to a page and you do not want to index it, then you do not need to register it in robots, no one will find it anyway, not even spider robots.

When a search robot crawls a site, it first checks for the presence of the robots.txt file on the site and then follows its directives when crawling pages.

I would like to note right away that search engines treat this file differently. For example, Yandex unconditionally follows its rules and excludes prohibited pages from indexing, while Google perceives this file as a recommendation and nothing more.

To prohibit indexing of pages, it is possible to use other means:

  • redirect or to a directory using the .htaccess file,
  • noindex meta tag (not to be confused with the to prohibit indexing of part of the text),
  • attribute for links, as well as removing links to unnecessary pages.

At the same time, Google can successfully add pages that are prohibited from indexing to search results, despite all the restrictions. His main argument is that if a page is linked to, then it can appear in search results. In this case, it is recommended not to link to such pages, but excuse me, the robots.txt file is precisely intended to exclude such pages from the search results... In my opinion, there is no logic 🙄

Removing pages from search

If the prohibited pages were still indexed, then you need to use Google Search Console and its included URL removal tool:

A similar tool is available in Yandex Webmaster. Read more about removing pages from the search engine index in a separate article.

Checking robots.txt

Continuing the theme with Google, you can use another Search Console tool and check the robots.txt file to see if it is correctly compiled to prevent certain pages from being indexed:

To do this, simply enter the URLs that need to be checked in the text field and click the Check button - as a result of the check, it will be revealed whether this page is prohibited from indexing or whether its content is accessible to search robots.

Yandex also has a similar tool located in Webmaster, the check is carried out in a similar way:

If you don’t know how to create a file correctly, then simply create an empty text document with the name robots.txt, and as you study the features of the CMS and site structure, supplement it with the necessary directives.

For information on how to properly compile a file, please follow the link. See you!

Consistently fill in all required fields. As you direct, you will see your Robots.txt filled with directives. All directives in the Robots.txt file are described in detail below.

Flag, copy and paste the text into a text editor. Save the file as "robots.txt" in the root directory of your site.

Description of the robots.txt file format

The robots.txt file consists of entries, each of which consists of two fields: a line with the name of the client application (user-agent), and one or more lines starting with the Disallow directive:

Directive ":" meaning

Robots.txt must be created in Unix text format. Most good text editors already know how to convert translation characters Windows strings on Unix. Or your FTP client should be able to do this. To edit, do not try to use an HTML editor, especially one that does not have text mode code display.

Directive User-agent:

For Rambler: User-agent: StackRambler For Yandex: User-agent: Yandex For Google: User-Agent: googlebot

You can create instructions for all robots:

User-agent: *

Directive Disallow:

The second part of the entry consists of the Disallow lines. These lines are directives (instructions, commands) for this robot. Each group entered by the User-agent line must have at least one Disallow statement. The number of Disallow instructions is unlimited. They tell the robot which files and/or directories the robot is not allowed to index. You can prevent a file or directory from being indexed.

The following directive disables indexing of the /cgi-bin/ directory:

Disallow: /cgi-bin/ Note the / at the end of the directory name! To prohibit visiting the directory "/dir" specifically, the instruction should look like: "Disallow: /dir/" . And the line “Disallow: /dir” prohibits visiting all server pages whose full name (from the server root) begins with “/dir”. For example: "/dir.html", "/dir/index.html", "/directory.html".

The directive written as follows prohibits indexing of the index.htm file located in the root:

Disallow: /index.htm

Directive Allow Only Yandex understands.

User-agent: Yandex Allow: /cgi-bin Disallow: / # prohibits downloading everything except pages starting with "/cgi-bin" For other search engines you will have to list all closed documents. Consider the structure of the site so that documents closed for indexing are collected in one place if possible.

If the Disallow directive is empty, this means that the robot can index ALL files. At least one Disallow directive must be present for each User-agent field for robots.txt to be considered valid. A completely empty robots.txt means the same as if it didn’t exist at all.

The Rambler robot understands * as any symbol, so the Disallow: * instruction means prohibiting indexing of the entire site.

Allow, Disallow directives without parameters. The absence of parameters for the Allow and Disallow directives is interpreted as follows: User-agent: Yandex Disallow: # same as Allow: / User-agent: Yandex Allow: # same as Disallow: /

Using special characters "*" and "$".
When specifying the paths of the Allow-Disallow directives, you can use the special characters "*" and "$", thus specifying certain regular expressions. The special character "*" means any (including empty) sequence of characters. Examples:

User-agent: Yandex Disallow: /cgi-bin/*.aspx # prohibits "/cgi-bin/example.aspx" and "/cgi-bin/private/test.aspx" Disallow: /*private # prohibits not only " /private", but also "/cgi-bin/private" Special character "$".
By default, a “*” is appended to the end of each rule described in robots.txt, for example: User-agent: Yandex Disallow: /cgi-bin* # blocks access to pages starting with “/cgi-bin” Disallow: /cgi- bin # the same thing, to cancel the "*" at the end of the rule, you can use the special character "$", for example: User-agent: Yandex Disallow: /example$ # prohibits "/example", but does not prohibit "/example.html" User -agent: Yandex Disallow: /example # disallows both "/example" and "/example.html" User-agent: Yandex Disallow: /example$ # disallows only "/example" Disallow: /example*$ # the same like "Disallow: /example" disallows both /example.html and /example

Directive Host.

If your site has mirrors, a special mirror robot will identify them and form a group of mirrors for your site. Only the main mirror will participate in the search. You can specify it using robots.txt using the "Host" directive, specifying the name of the main mirror as its parameter. The "Host" directive does not guarantee the selection of the specified main mirror, however, the algorithm takes it into account with high priority when making a decision. Example: #If www.glavnoye-zerkalo.ru is the main mirror of the site, then robots.txt for #www.neglavnoye-zerkalo.ru looks like this User-Agent: * Disallow: /forum Disallow: /cgi-bin Host: www.glavnoye -zerkalo.ru For compatibility with robots that do not fully follow the standard when processing robots.txt, the "Host" directive must be added in the group starting with the "User-Agent" entry, immediately after the "Disallow" ("Allow") directives . The argument to the "Host" directive is a domain name followed by a port number (80 by default) separated by a colon. The Host directive parameter must consist of one valid host name (that is, one that complies with RFC 952 and is not an IP address) and a valid port number. Incorrectly composed "Host:" lines are ignored.

Examples of ignored Host directives:

Host: www.myhost-.ru Host: www.-myhost.ru Host: www.myhost.ru:100000 Host: www.my_host.ru Host: .my-host.ru:8000 Host: my-host.ru. Host: my..host.ru Host: www.myhost.ru/ Host: www.myhost.ru:8080/ Host: 213.180.194.129 Host: www.firsthost.ru,www.secondhost.ru # in one line - one domain! Host: www.firsthost.ru www.secondhost.ru # in one line - one domain!! Host: crew-communication.rf # need to use punycode

Directive Crawl-delay

Sets the timeout in seconds with which the search robot downloads pages from your server (Crawl-delay).

If the server is heavily loaded and does not have time to process download requests, use the "Crawl-delay" directive. It allows you to set the search robot a minimum period of time (in seconds) between the end of downloading one page and the start of downloading the next. For compatibility with robots that do not fully follow the standard when processing robots.txt, the "Crawl-delay" directive must be added in the group starting with the "User-Agent" entry, immediately after the "Disallow" ("Allow") directives.

The Yandex search robot supports fractional Crawl-Delay values, for example, 0.5. This does not guarantee that the search robot will visit your site every half second, but it gives the robot more freedom and allows it to crawl the site faster.

User-agent: Yandex Crawl-delay: 2 # sets the timeout to 2 seconds User-agent: * Disallow: /search Crawl-delay: 4.5 # sets the timeout to 4.5 seconds

Directive Clean-param

Directive for excluding parameters from the address bar. those. requests containing such a parameter and those not containing them will be considered identical.

Blank lines and comments

Blank lines are allowed between groups of instructions entered by the User-agent.

The Disallow statement is only taken into account if it is subordinate to any User-agent line - that is, if there is a User-agent line above it.

Any text from the hash sign "#" to the end of the line is considered a comment and is ignored.

Example:

Next simple file robots.txt prohibits all robots from indexing all pages of the site, except the Rambler robot, which, on the contrary, is allowed to index all pages of the site.

# Instructions for all robots User-agent: * Disallow: / # Instructions for the Rambler robot User-agent: StackRambler Disallow:

Common mistakes:

Inverted syntax: User-agent: / Disallow: StackRambler And it should be like this: User-agent: StackRambler Disallow: / Several Disallow directives in one line: Disallow: /css/ /cgi-bin/ /images/ Correctly like this: Disallow: / css/ Disallow: /cgi-bin/ Disallow: /images/
    Notes:
  1. It is unacceptable to have empty line breaks between the "User-agent" and "Disallow" ("Allow") directives, as well as between the "Disallow" ("Allow") directives themselves.
  2. According to the standard, it is recommended to insert an empty line feed before each "User-agent" directive.