What is a robots.txt file? - Sourabh Nagori

Recent Update

Post Top Ad

Post Top Ad

Monday, 4 July 2016

What is a robots.txt file?

What is robots.txt file by theseofeed

  • The robots.txt file is a simple text file placed on your web server which tells webcrawlers like Googlebot if they should access a file or not.
  • Basic robots.txt examples

    Here are some common robots.txt setups (they will be explained in detail below).
    Allow full access     
    User-agent: *
    Disallow:
    Block all access
    User-agent: *
    Disallow: /
    Block one folder
    User-agent: *
    Disallow: /folder/
    Block one file
    User-agent: *
    Disallow: /file.html

    Why should you learn about robots.txt?

    • Improper usage of the robots.txt file can hurt your ranking
    • The robots.txt file controls how search engine spiders see and interact with your webpages
    • This file is mentioned in several of the Google guidelines
    • This file, and the bots they interact with, are fundamental parts of how search engines work
    Tip: To see if your robots.txt is blocking any important files used by Google, use the Google guidelines tool.

    Search engine spiders

    The first thing a search engine spider like Googlebot looks at when it is visiting a page is the robots.txt file.
    Googlebot and robots.txt file
    It does this because it wants to know if it has permission to access that page or file. If the robots.txt file says it can enter, the search engine spider then continues on to the page files.
    If you have instructions for a search engine robot, you must tell it those instructions. The way you do so is the robots.txt file. 

    Priorities for your website

    There are three important things that any webmaster should do when it comes to the robots.txt file.
    • Determine if you have a robots.txt file
    • If you have one, make sure it is not harming your ranking or blocking content you don't want blocked
    • Determine if you need a robots.txt file

    Determining if you have a robots.txt

    You can enter a website below, click go and it will detect if the site has a robots.txt file and display what the file says (it shows results here on this page).
    If you do not want to use the tool above, you can check from any browser. The robots.txt file is always located in the same place on any website, so it is easy to determine if a site has one. Just add "/robots.txt" to the end of a domain name as shown below.
    www.yourwebsite.com/robots.txt
    If you have a file there, it is your robots.txt file. You will either find a file with words in it, find a file with no words in it, or not find a file at all.

    Determine if your robots.txt is blocking important files

    You can use the Google guidelines tool, which will warn you if you are blocking certain page resources that Google needs to understand your pages.
    If you have access and permission you can use the Google search console to test your robots.txt file. Instructions to do so are found here (tool not public - requires login).
    To fully understand if your robots.txt file is not blocking anything you do not want it to block you will need to understand what it is saying. We cover that below.

    Do you need a robots.txt file?

    You may not even need to have a robots.txt file on your site. In fact it is often the case you do not need one.
    Reasons you may want to have a robots.txt file:
    • You have content you want blocked from search engines
    • You are using paid links or advertisements that need special instructions for robots
    • You want to fine tune access to your site from reputable robots
    • You are developing a site that is live, but you do not want search engines to index it yet
    • They help you follow some Google guidelines in some certain situations
    • You need some or all of the above, but do not have full access to your webserver and how it is configured
    Each of the above situations can be controlled by other methods, however the robots.txt file is a good central place to take care of them and most webmasters have the ability and access required to create and use a robots.txt file.
    Reasons you may not want to have a robots.txt file:
    • It is simple and error free
    • You do not have any files you want or need to be blocked from search engines
    • You do not find yourself in any of the situations listed in the above reasons to have a robots.txt file
    It is okay to not have a robots.txt file.
    When you do not have a robots.txt file the search engine robots like Googlebot will have full access to your site. This is a normal and simple method that is very common.

    How to make a robots.txt file

    If you can type or copy and paste, you can also make a robots.txt file.
    The file is just a text file, which means that you can use notepad or any other plain text editor to make one. You can also make them in a code editor. You can even "copy and paste" them.
    Instead of thinking "I am making a robots.txt file", just think, "I am writing a note" they are pretty much the same process.

    What should the robots.txt say?

    That depends on what you want it to do.
    disallow all and disallow none
    All robots.txt instructions result in one of the following three outcomes
    • Full allow: All content may be crawled.
    • Full disallow: No content may be crawled.
    • Conditional allow: The directives in the robots.txt determine the ability to crawl certain content.
    Let's explain each one.

    Full allow - all content may be crawled

    Most people want robots to visit everything in their website. If this is the case with you, and you want the robot to index all parts of your site, there are three options to let the robots know that they are welcome.
    1) Do not have a robots.txt file
    If your website does not have a robots.txt file then this is what happens...
    A robot like Googlebot comes to visit. It looks for the robots.txt file. It does not find it because it isn't there. The robot then feels free to visit all your web pages and content because this is what it is programmed to do in this situation.
    2) Make an empty file and call it robots.txt
    If your website has a robots.txt file that has nothing in it then this is what happens...
    A robot like Googlebot comes to visit. It looks for the robots.txt file. It finds the file and reads it. There is nothing to read, so the robot then feels free to visit all your web pages and content because this is what it is programmed to do in this situation.
    3) Make a file called robots.txt and write the following two lines in it...
    User-agent: *
    Disallow:
    If your website has a robots.txt with these instructions in it then this is what happens...
    A robot like Googlebot comes to visit. It looks for the robots.txt file. It finds the file and reads it. It reads the first line. Then it reads the second line. The robot then feels free to visit all your web pages and content because this is what you told it to do (I explain this below).

    Full disallow - no content may be crawled

    Warning: This means that Google and other search engines will not index or display your webpages.
    To block all reputable search engines spiders from your site you would have these instructions in your robots.txt:
    User-agent: *
    Disallow: /
    It is not recommended to do this as it will result in none of your web pages being indexed.

    The robot.txt instructions and their meanings

    Here is an explanation of what the different words mean in a robots.txt file

    User-agent

    User-agent:
    The "User-agent" part is there to specify directions to a specific robot if needed. There are two ways to use this in your file.
    If you want to tell all robots the same thing you put a " * " after the "User-agent" It would look like this...
    User-agent: *
    The above line is saying "these directions apply to all robots".
    If you want to tell a specific robot something (in this example Googlebot) it would look like this...
    User-agent: Googlebot
    The above line is saying "these directions apply to just Googlebot".

    Disallow:

    The "Disallow" part is there to tell the robots what folders they should not look at. This means that if, for example you do not want search engines to index the photos on your site then you can place those photos into one folder and exclude it.
    Lets say that you have put all these photos into a folder called "photos". Now you want to tell search engines not to index that folder.
    Here is what your robots.txt file should look like in that scenario:
    User-agent: *
    Disallow: /photos
    The above two lines of text in your robots.txt file would keep robots from visiting your photos folder. The "User-agent *" part is saying "this applies to all robots". The "Disallow: /photos" part is saying "don't visit or index my photos folder".

    Googlebot specific instructions

    The robot that Google uses to index their search engine is called Googlebot. It understands a few more instructions than other robots.
    In addition to "User-name" and "Disallow" Googlebot also uses the Allow instruction.

    Allow

    Allow:
    The "Allow:" instructions lets you tell a robot that it is okay to see a file in a folder that has been "Disallowed" by other instructions. To illustrate this, let's take the above example of telling the robot not to visit or index your photos. We put all the photos into one folder called "photos" and we made a robots.txt file that looked like this...
    User-agent: *
    Disallow: /photos
    Now let's say there was a photo called mycar.jpg in that folder that you want Googlebot to index. With the Allow: instruction, we can tell Googlebot to do so, it would look like this...
    User-agent: *
    Disallow: /photos
    Allow: /photos/mycar.jpg
    This would tell Googlebot that it can visit "mycar.jpg" in the photo folder, even though the "photo" folder is otherwise excluded.

    Testing your robots.txt file

    To find out if an individual page is blocked by robots.txt you can use Webmaster tool which will tell you if files important to Google are being blocked and also display the content of the robots.txt file.
    Subscribe to the blog to know more about the digital marketing tips. Follow our Blog to Get update from the SEO Expert in Indore

    No comments:

    Post a Comment

    Post Top Ad