How to hide your beta site from Google and not shoot yourself in the foot

When we work on a new application for a client, we give them a place such as beta.project.com, where they can follow the project's progress. This beta site should remain hidden from crawlers, so it doesn't accidentally appear in Google before the application has launched.

The standard way to dismiss crawlers is robots.txt. The problem with robots.txt is when you forget to remove it as the site launches. Since both beta and production site are often deployed from the same repository, this can happen easily during the excitement surrounding a product launch. Now you screwed up big time: The production site will remain hidden from Google until someone notices the missing traffic. Then it can take weeks to get back into Google's index.

Here is a suggestion how to fail less. Instead of adding a robots.txt to your document root, name it robots.beta.txt. In order to dismiss all crawlers, the file should look like this:

User-Agent: *
Disallow: /

We can now tell our web server to redirect requests from robots.txt to robots.beta.txt, but only those that refer to the beta site. The production site should rightfully return a 404 not found error when asked for robots.txt. To achieve this using Apache, add the following lines to the virtual host of the beta site:

RewriteEngine On
RewriteCond %{HTTP_HOST} ^beta.project.com$
RewriteRule ^/robots.txt$ /robots.beta.txt

Now you won't have to remember removing the robots.txt file when the product launches.

You can follow any response to this post through the Atom feed.

Avatar

Sat, 10 Apr 2010 23:30:00 GMT

by henning

Tags:

  • Matt said about 1 month later:

    My preference is to use simple HTTP authenticate in the vhost configuration (AuthType Basic, etc.) and give my testers one ID/password to use. That way not even an end user on the production site could guess the beta subdomain and try it out. Since the vhost is per-machine it never would accidentally go live with code.

  • Henning said about 1 month later:

    Yes, an HTTP login will also work great for this!

    We have clients who want to show their beta site to friends, selected contacts and potential investors. They prefer to not make their product-in-making appear more complicated than it is by adding a layer of authentication. That’s why we often opt for the robots.txt as a good compromise.

  • Patricio Mg said 2 months later:

    thanks for the hint :)

    @Matt aside to the answer from Henning, you just can’t use a simple HTTP login if you have an e-commerce site, it will block the payment layer dataflow between you and your business partner and at the end you will have to open the access to your site anyway ;-)

  • PavelT said 12 months later:

    As for me, I prefer to do http basic auth in Rails. That way a can control what user agents to allow access to, disable it in some controllers, etc. Here is my code-snippet for this: http://code-snippets.paveltyk.info/snippets/58

Leave a comment