What is Robots.txt? How is it used for search engines?
“Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it’s current for your site so that you don’t accidentally block the Googlebot crawler.” – from the Google webmaster guidelines
Web Robots (also affectionaly known as Web Wanderers, Crawlers and Spiders), are programs that ‘crawl’ the interwebs automatically. Search engines like Google use them to index web content from websites. Generally, this is okay as it allows search engines to find your webpage but, like all things in life, there is a dark side. Spammers (ew!) can also use them to scan for email addresses and other malicious activities. Use robots.txt for good and you won’t be disappointed.
You can control the robots.txt coding to allow or disallow certain robots from crawling your site. Website owners (like yourself) can use the /robots.txt file to provide instructions about the website to web robots. In technical terms, this is called The Robots Exclusion Protocol. To you and me, it just means that these little files let search engines find websites so I can find your business and buy your products/services, which makes my life better/easier and makes you money. Win!
The /robots.txt file is pretty much a publically accepted standard and is not owned by any standards body. Here are a few historical descriptions:
The original 1994 document A Standard for Robot Exclusion
How Robots Work
It works like this: a robot wants to visit a website URL, something like insert http://www.example.com/oh-hey.html. Before the robot does this, though, it first checks for http://www.example.com/robots.txt and finds this snippet of info:
The “User-agent: *” means that this section of code applies to all robots. The “Disallow: /” tells the robot that it should not visit any pages on the website.
To allow robots to index everything on your site, you would use:
To stop robots from indexing a specific folder, you would use:
And so on.
Robots.txt essentially inform search engine spiders how to interact with your website hence why they are easily controllable.
If you do not have a robots.txt file in your coding, your site will return with a 404 error whenever a spider tries to crawl your site. You do not want this. You want to keep your website tight and right so make sure you include the robots.txt file to ensure your site is working and optimised so people can find and access your website.
Note: Robots can ignore your /robots.txt file, especially malware robots that scan the interwebs for security weaknesses and spammers on the hunt for email addresses. The /robots.txt file is also publically available so anyone can see what section of your server you don’t want robots to use. Therefore, you cannot use /robots.txt to hide info.
Not that you would need to, of course. 🙂
Where to put it
The robots.txt file goes in the top-level section of your web server, usually in the same place where you put your website’s main “index.html” welcome page.
When a robot looks for the “/robots.txt” file for your URL, it strips everything from the first single slash of a URL and puts “/robots.txt” in its place.
For example, “http://www.example.com/store/index.html” will become “http://www.example.com/robots.txt”.
So, as the owner of your glorious website, you need to put the robots.txt file in the right place on your web server for that “http://www.example.com/robots.txt” to work.
Things to Remember
Robots.txt controls how search engines interact with your site so they be a close ally for you.
Always use lower case letters for the filename “robots.txt”. Otherwise it won’t work!
Using robots.txt incorrectly can hurt your ranking
This is a fundamental part of how search engines work so it’s important to include robots.txt
Without robots.txt, no one would be able to find or access your site!
To learn more, check out these super informative articles all about robots.txt:
Want a company to take care of it all? Give us a call or click below to contact our team. We won’t bite, promise. 🙂