Marrying systems the hard way

Occasionally you will feel pressed to connect your code to a closed system that offers no API for machine-to-machine interaction. Inevitably a plan will hatch, proposing to solve the problem with brute force: "We are going to poll the closed system every minute and screenscrape their HTML."

Doubts regarding the solution's fragility are swiftly dispelled: "The adapter code will be easy enough. We will simply patch our code whenever they update theirs. This should take us a few minutes per month tops!"

While going down that road might be worth the rewards, doing so requires substantial commitment on your part. The web is riddled with abandoned projects promising to bolt an API onto a closed system. For a while they delivered on their promise, but eventually the maintainer lost interest. And when that monthly patch didn't come, the adapter broke forever.

The next time you feel tempted to marry two systems the hard way, try asking your software vendor for a proper API. Using tools like HTTP, JSON or Atom, a simple API can often be hacked together in an afternoon. Your vendor might happily accomodate you in exchange for your continued business.

When we launched our job board for German web developers last year, several employment sites asked for a machine-readable interface to our data. Building the requested feeds would only take us a few minutes, so we agreed to do it. Today these sites are our number one source of traffic.

Avatar

Sun, 06 Jun 2010 13:56:00 GMT

by henning

Tags:

Speeding up website load times: Using a public CDN for Javascript libraries such as jQuery

As you might have noticed Google repeats over and over again that they want fast loading websites.

One way to achieve that is by delivering Javascript libraries such as jQuery from servers at Google, Microsoft or the like. Technically this means you put the URL offered by one of those so-called Content Delivery Networks (CDN) into the head of your website. Clients visiting the site will fetch the library from the CDN instead of your server. This is said to decrease load time of your website which has some benefits:

  • Browsers have a limit of concurrent connections to one server. RFC 2616 specifies HTTP and says: "A single-user client SHOULD NOT maintain more than 2 connections with any server or proxy." Seamonkey 2.0.3 and Firefox 3.6.2 seem to open up to 15 connections to one host, Opera 10.54 opens up to 16 as default. With the arrival of Internet Explorer 8 Microsoft changed that number from 2 connections in IE 7 (for HTTP 1.1) to 6 connections in IE 8 (HTTP 1.1).
  • As your servers will not deliver the library, your servers are less loaded and you consume less bandwidth.
  • You can benefit from improved caching when using CDN hosted versions of Javascript libraries: Say your client visits example.com which has the library linked to the Google CDN. If the client visits your site afterwards and you're using Google CDN as well, the browser will notice that the file was already downloaded for example.com and will not fetch the library again. As even the minified version of jQuery is roughly 70K there is some benefit for sure.

But there are some drawbacks, too: You're dependent on the CDN infrastructure. If they are down, your website will be hardly usable, too. You cannot influence bandwidth and carriers of the delivery prodiver's network although the sense of a CDN is to have good connectivity and only few hops to clients.

Independent of the drawbacks (which are minor in my opinion) I wondered about the effectiveness of the caching-bonus. There are many people out there that don't want a CDN to deliver arbitrary code to clients and therefore still host the libraries themselves. Another problem is that there are many versions of such libraries. If example.com uses `lib-1.1.js` and you linked `lib-1.2.js`, the caching will fail.

Facts about public CDN usage

In order to shed light on this, I checked the top 500 websites globally as well as the top 220 websites in Germany according to Alexa. So I wrote a script that fetches the domains listed at Alexa and another one that parses those websites. I wanted to know which of the top sites use a hosted version of jQuery. Here come the results:

Out of the german sites merely six use a CDN. All of them make use of Google. Looking at the top 500 sites globally I recognized that 13 use Google while Edgecast is used by two sites. Microsoft CDN is not used at all.

Open questions

I wondered that only few sites make use of the free CDN services. Maybe the top websites tune their homepages for faster load times and therefore exclude the libraries there? Maybe I should not only crawl the homepage? Is jQuery not used at all?

The sites I got from Alexa are a nice starting point, but in order to gain more precise numbers we should crawl more websites probably. In Germany an association called IVW measures traffic for many German websites. I don't know if there is something similar for your country or even on a global scope. If someone has a good list: just drop me a line.

Conclusion

It seems that even though Edgecast might be a bit faster to load Google is more common at the sites I crawled. Maybe you should pay special attention to your audience: If you have a regional website, you should look at popular, relevant sites that might be in the history of your visitors.

If you have no high-security (Ajax-) banking website or other reasons for not using public CDNs the Google one seems to be a quite good choice.

  • James Skemp said 17 days later:

    (Saw one of your comments on StackOverflow.)

    “Is jQuery not used at all?”

    These seems like something rather easy to check for, since you’re already parsing the content returned from the sites.

  • Joshua Wedekind said 7 months later:

    @James Skemp,

    LOL. I was thinking the exact same thing!

Avatar

Fri, 11 Jun 2010 12:05:00 GMT

by thomas

Tags:

How to create an image from a running Amazon EC2-Instance

Maybe you already joined the cool kids out there and utilize Cloud Computing to scale your application to infinity while reducing IT-budget? Just kidding.

As using the (increasingly hyped) cloud infrastructure indeed does make sense for some applications out there, some of our customers use it. A common task when working on Amazon EC2 is to launch instances (virtual machines for the old-fashioned) in order to get increased computing power, to add another slave for whatever, etc.. Of course you need some kind of image to boot your virtual machine - and you can’t just drive past the Amazon datacenter in Dublin to insert your Ubuntu CD somewhere.

Ideally this image already includes the basic stuff you usually need. As we work with Ruby on Rails most of the time we need a Ruby interpreter, a bunch of gems we usually use, maybe a database server, Java run-time-environment for using Solr or Memcached. Additionally you might have some basic configuration like security-stuff, monitoring or SSH keys for your team.

In order to create such an instance that includes your individual stuff, boot one of the offered basic AMIs at Amazon, do your configuration homework and follow the next steps to create your own private individual AMI:

First of all you need the X.509 keys. Go to the AWS Management Console, click “Account” in the very upper menu line, “Security credentials”, enter your login data and click on “X.509 Certificates” located in “Access Credentials”. Copy these certificates to your instance:

$ scp -i your_keyfile_for_the_instance.pem directory_where_both_x509_certs_are/*.pem root@$hostname.compute.amazonaws.com:

Log into the instance and move the keys to a separate directory, as you want to exclude those keys from the image to be created:

$ ssh into instance
$ mkdir x509_certs
$ mv *.pem x509_certs
$ cd x509_certs

Hints:

  • ec2-bundle-image is very likely not what you want if you intend to create an AMI from a running instance, use ec2-bundle-vol (as described in the following) instead!
  • To ensure you get a clean state of your machine, try to disable as much services (read: database, application server, etc.) as possible.
  • Keep in mind that if you connect to an instance through an elastic IP, your SSH-connection will die if you disassociate the IP.

You need your Amazon account number for the next step. To find out, click Account -> Personal Information within the AWS MC. Your Account number is on the upper right side, separated by dashes: 2342-4242-1234

Remove the slashes from your account number and run ec2-bundle-vol with the following (necessary) parameters:

$ ec2-bundle-vol -k ./pk-$pk-keyfile.pem  -c ./cert-$cert.pem -u 234242421234
$ ec2-bundle-vol -k ./pk-$pk-keyfile.pem  -c ./cert-$cert.pem -u 234242421234 -e /root/x_509_certificates/ -d /mnt/

I ran into trouble with Ruby Enterprise Edition installed on the machine while running ec2-bundle-vol:

root@hostname:~# ec2-bundle-vol 
/usr/lib/site_ruby/ec2/amitools/bundlevol.rb:11:in `require': no such file to load -- ec2/amitools/bundle (LoadError)
        from /usr/lib/site_ruby/ec2/amitools/bundlevol.rb:11

Setting a new path does the trick:

root@hostname:~# PATH=/usr/bin:$PATH

Take two more minutes to think about whether you want to use the optional parameters:

  • -s size in MB (have a look at `df -h` what your instances consumes at the moment)
  • -e directory exclude directories from your image. AT LEAST exclude the directory your keys are copied to! (-e ~/x509_certs/ in our example)

ec2-bundle-vol asks for the architecture you are running. Have a look at uname -a if in doubt. Just hit enter if you’re running an i386-instance.

To keep the image persistent, upload it to S3:

ec2-upload-bundle -b $some_name -m /tmp/image.manifest.xml -a $AWS_access_ID -s $AWS_secret_key --location EU

Replace $some_name with something that identifies and describes the image. It will be used as the bucket (newfangled “directory”) on S3. Replace $AWS_access_ID and $AWS_secret_key accordingly.

To make the image accessible from within the Management Console, click “Register new AMI” and enter the path to the manifest on S3, which should be $some_name/image.manifest.xml.

That was easy, wasn’t it?

  • Aleksandar said about 1 month later:

    The most frustraiting thing is using 15 user identification strings/numbers IMHO

  • evan said 5 months later:

    Thanks! Great post. I’m using Amazon Linux AMI(their new flavor), and I needed to run the following shell script to execute ec2-bundle-vol as ec2-user (using sudo).

    !/bin/sh

    export EC2_HOME=/opt/aws/bin

    export EC2_AMITOOL_HOME=/opt/aws/amitools/ec2

    /opt/aws/bin/ec2-bundle-vol -d /mnt -k /home/ec2-user/.ec2/pk-XXX.pem -c /home/ec2-user/.ec2/cert-XXX -u XXX -e /home/ec2-user/.ec2 exit 0

  • Igor Ganapolsky said 7 months later:

    When I do this, there is no image.manifest.xml in my tmp directory. I don’t understand how that file gets generated.

Avatar

Thu, 29 Apr 2010 09:49:00 GMT

by thomas

Tags:

How to hide your beta site from Google and not shoot yourself in the foot

When we work on a new application for a client, we give them a place such as beta.project.com, where they can follow the project's progress. This beta site should remain hidden from crawlers, so it doesn't accidentally appear in Google before the application has launched.

The standard way to dismiss crawlers is robots.txt. The problem with robots.txt is when you forget to remove it as the site launches. Since both beta and production site are often deployed from the same repository, this can happen easily during the excitement surrounding a product launch. Now you screwed up big time: The production site will remain hidden from Google until someone notices the missing traffic. Then it can take weeks to get back into Google's index.

Here is a suggestion how to fail less. Instead of adding a robots.txt to your document root, name it robots.beta.txt. In order to dismiss all crawlers, the file should look like this:

User-Agent: *
Disallow: /

We can now tell our web server to redirect requests from robots.txt to robots.beta.txt, but only those that refer to the beta site. The production site should rightfully return a 404 not found error when asked for robots.txt. To achieve this using Apache, add the following lines to the virtual host of the beta site:

RewriteEngine On
RewriteCond %{HTTP_HOST} ^beta.project.com$
RewriteRule ^/robots.txt$ /robots.beta.txt

Now you won't have to remember removing the robots.txt file when the product launches.

  • Henning said about 1 month later:

    Yes, an HTTP login will also work great for this!

    We have clients who want to show their beta site to friends, selected contacts and potential investors. They prefer to not make their product-in-making appear more complicated than it is by adding a layer of authentication. That’s why we often opt for the robots.txt as a good compromise.

  • Patricio Mg said 2 months later:

    thanks for the hint :)

    @Matt aside to the answer from Henning, you just can’t use a simple HTTP login if you have an e-commerce site, it will block the payment layer dataflow between you and your business partner and at the end you will have to open the access to your site anyway ;-)

  • PavelT said 12 months later:

    As for me, I prefer to do http basic auth in Rails. That way a can control what user agents to allow access to, disable it in some controllers, etc. Here is my code-snippet for this: http://code-snippets.paveltyk.info/snippets/58

Avatar

Sat, 10 Apr 2010 23:30:00 GMT

by henning

Tags: