As my final installment in the Nuts & Bolts series, I want to hit a few of the questions that were sent in that I didn’t get a chance to get to earlier in the week. I hope you’ve enjoyed reading these as much as I’ve enjoyed writing them.
What colocation provider did you choose, and why?
After an exhaustive (and exhausting!) selection process, we chose ServerCentral to host our infrastructure. They have an awesome facility that has some of the most thoughtful and redundant datacenter design I’ve ever seen. On top of top notch facilities they have a great network via their sister company nLayer.
Finding a partner who could manage the hardware for us without us having to be onsite was a big deal for us too. The quality of “remote hands” support from datacenter to datacenter is, well let’s just call it inconsistent and be generous. ServerCentral has a great reputation with its customers in that regard and we’ve found their support to be excellent. They manage all of the physical installations, hardware troubleshooting, and maintenance for us.
They do a mean cabling job too.
While driving home this afternoon, my wife’s car was rear-ended. The car that she was following stopped suddenly, forcing her to stop suddenly. The next driver in the chain wasn’t quite able to stop in time. Fortunately, nobody in either car was hurt, but, it was pretty traumatic for my wife and kids and my 6 year old son, Noah, was crying.
Enter the North Carolina Highway Patrol. Police officers often get a reputation for being cold or unsympathetic, and I’ve certainly met some of that type. The officer that helped my wife today, though, was the exact opposite. Very kind and patient, particularly with my boys. After the paperwork was completed, she went to her car and returned with a stuffed puppy that she gave to Noah. She explained that she’d been carrying it around in her car for a while but wanted him to have it because he’d had a rough day.
As simple as that, a single small act of kindness turned completely changed the complexion of the afternoon, at least for one little boy. Tears were replaced with a smile by applying a little empathy to the situation. Sure, it will be annoying going through the process of getting the car repaired, but my lasting memory won’t be of the accident. It will be of the compassionate police officer who made my son’s day just a little bit better.
Next up in the Nuts & Bolts series, I want to cover storage. There were a number of questions about our storage infrastructure after my new datacenter post asking about the Isilon storage cluster that is pictured.
To set the stage, I’ll share some file statistics from Basecamp. On an average week day, there are around 100,000 files uploaded to Basecamp with an average file size that is currently 2MB for a total of about 200GB per day of uploaded content in Basecamp. And that’s just Basecamp! We have a number of other apps that handle tens of thousands of uploaded files per day as well. Based on that, you’d expect we’d need to handle maybe 60TB of uploaded files over the next 12 months, but those numbers don’t take into account the acceleration in the amount of data uploaded. Just since January we’ve seen an increase in the average file size uploaded from 1.88MB to 2MB and our overall storage consumption rate has increased by 50% with no signs of slowing down.
When I sat down to begin planning our move from Rackspace to our new environment, I looked at a variety of options. Our previous environment consisted of a mix of MogileFS and Amazon S3. When a customer uploaded a file to one of our applications we would immediately store the file in our local MogileFS cluster and it would be immediately available for download. Asynchronously, we would upload the file to S3, and after around 20 minutes, we would begin serving it directly from S3. The staging of files in MogileFS was necessary to account for the eventually consistent nature of S3.
While we’ve been generally happy with that configuration, I thought that we could save money over the long term by moving our data out of S3 and onto local storage. S3 is a phenomenal product, and it allows you to expand storage without having to worry much about capacity planning or redundancy, but it is priced at a comparative premium. With that premise in mind I crunched some numbers and was even more convinced that we could save money on our storage needs without sacrificing reliability and while reducing the complexity of our file workflow at the same time.
The main contenders for our new storage platform were either an expanded MogileFS cluster or a commercial NAS. We knew that we did not want to have to juggle LUNs or a layer like GFS to manage our storage, so we were able to eliminate traditional SAN storage as a contender fairly early on. We have had generally good luck with MogileFS, but have had some ongoing issues with memory growth on some of our nodes and have had at least a couple of storage related outages over the past couple of years. While the user community around MogileFS is great, the lack of commercial support options raises its head when you have an outage.
After weighing all of the options, we decided to purchase a commercial solution and we settled on Isilon as the vendor for our storage platform. Protecting our customer’s data is our most important job and we wanted a system that we could be confident in over the long term. We initially purchased a 4 node cluster of their 36NL nodes, each with a raw capacity of 36TB. The usable capacity of our current cluster with the redundancy level we have set is 108TB. We’ve already ordered another node to expand our usable space to 144TB in order to keep pace with the storage growth that took place between the time we planned the move and when we implemented it.
The architecture of the Isilon system is very interesting. The individual nodes interconnect with one another over an InfiniBand network (SDR or 10 Gbps right now) to form a cluster. With the consistency level we chose, each block of data that is written to the cluster is stored on a minimum of two nodes in the cluster. This means that we’re able to lose an entire node without affecting the operation of our systems. In addition, the nodes cooperate with one another to present the pooled storage to our clients as a single very large filesystem over NFS. Isilon also has all the features like snapshots, replication, quotas, and so on that you would expect from a commercial NAS vendor. These weren’t absolute requirements, but they certainly make management simpler for us and are a welcome addition to the toolbox.
As we grow, it’s very simple to expand the capacity of the cluster. You just rack up another node, connect it to the InfiniBand backend network and to the network your NFS clients are connected to and push a button. The node configures itself into the existing cluster, its internal storage is added to the global OneFS filesystem, its onboard memory is added to the globally coherent cache, and its CPU is available to help process I/O operations. All in about a minute. It’s pretty awesome stuff, and we had fun testing these features in our datacenter when we were deploying it.
For now, we continue to use Amazon S3 as a backup, but we intend to replace it with a second Isilon cluster in a secondary datacenter which we’ll keep in sync via replication within the next several months.
As a part of our ongoing Nuts & Bolts series I asked for questions from readers about the kinds of things they’d like to see covered. One of the topics that came up several times was how we manage our database servers.
All of our applications, with the exception of Basecamp, follow a pretty similar model: We take a pair of Dell R710 servers, load them up with memory and disks, and setup a master/slave pair of MySQL servers. We use the excellent Percona Server for all of our MySQL instances and couldn’t be happier with it.
Here’s an example of one of our MySQL servers. In this case, the Highrise master database server:
- Dell R710
- 2 x Intel Xeon E5530 Processors
- 96GB RAM
- 6×146GB 15,000 RPM SAS drives
For the disk configuration we take the first two drives and put them into a RAID1 volume that is shared between the root filesystem and MySQL binary logs. The remaining drives are placed into a RAID10 volume which is used for the InnoDB data files.
We only use RAID controllers that have a battery backup for the cache, disable read-ahead caching, and turn on write-back caching. With this setup we’re able to configure MySQL to immediately flush all writes to the disk rather than relying on the operating system to periodically write the data to the drives. In reality, the writes will be staged to the controller’s cache, but with the battery backup we are protected from unexpected power outages which could otherwise cause data loss. In addition, since the controller is caching the writes in memory, it can optimize the order and number of writes that it makes to the physical disks to dramatically improve performance.
As far as MySQL configuration is concerned, our configuration is pretty standard. The most important tips are to maximize the InnoDB buffer pool and make sure that you have a BBU enabled RAID card for writes. There are other important configuration options, but if you do those two things you’re probably 75% of the way to having a performant MySQL server.
Here are some of the most important configuration options in the Highrise MySQL config file:
sync_binlog = 1 innodb_file_per_table innodb_flush_log_at_trx_commit = 1 innodb_flush_method = O_DIRECT innodb_buffer_pool_size = 80G
I’m not going to talk much about backups other than to say you should be using XtraBackup, also from our friends at Percona. It is far and away the best way to do backups of MySQL.
For Basecamp, we take a somewhat different path. We are on record about our feelings about sharding. We prefer to use hardware to scale our databases as long as we can, in order to defer the complexity that is involved in partitioning them as long as possible—with any luck, indefinitely.
With that in mind, we went looking for an option to host the Basecamp database, which is becoming a monster. As of this writing, the database is 325GB and handles several thousand queries per second at peak times. At Rackspace, we ran this database on a Dell R900 server with 128GB of RAM and 15×15,000 RPM SAS drives in a Dell MD3000 storage array.
We considered building a similar configuration in the new datacenter, but were concerned that we were hitting the limits of I/O performance with this type of configuration. We could add additional storage arrays or even consider SAN options, but SSD storage seemed like a much better long term answer.
We explored a variety of options from commodity SSD drives to PCI-express based flash memory cards. In the end, we decided to purchase a pair of MySQL appliances produced by Schooner Information Technology. They produce a pretty awesome appliance that is packed with a pair of Intel Nehalem processors, 64GB of RAM, 4×300GB SAS drives, 8 x Intel X25-E SSD drives. Beyond the hardware, Schooner has done considerable work optimizing the I/O path from InnoDB all the way down through the system device drivers. The appliances went into production a few weeks ago and the performance has been great.
I sat down with Jeremy Cole of Schooner a few weeks ago and recorded a couple of videos that go into considerably more detail about our evaluation process and some thoughts on MySQL scaling. You can check them out here and here.
With all the recent talk about the fabulous new office space that the Chicago crew just moved into, I wanted to share a little bit about another long term move that is nearing completion. For the last four years our infrastructure has been hosted with Rackspace. As of last weekend, the vast majority of our traffic is now being served out of our own colocated server cluster.
Some of our Dell R710 Servers, we have a bunch of these.
We’re excited to welcome John Williams to the operations team at 37signals this week. John will be joining Joshua and me to make sure our servers and infrastructure are as reliable as possible. John impressed us with a job application similar to the one that Craig submitted and went the extra mile by driving 5 hours to Chicago to meet with us face to face for a couple of hours.
John has past experience doing Rails development and as a systems engineer at Contegix. He’s also a pretty good cyclist, and was part of the National Collegiate Cycling Association Track Championship Team in 2003 and 2006 while at Marian College.
He’s a great addition and we’re thrilled to have him join the team. Welcome, John!
While working on a new project, we came across some compatibility problems in a plugin that we want to use in that project. We have a known solution that works and doesn’t require the plugin, but if we can make the plugin work without too much additional work it’s worth using it. We have a limited amount of time to finish this project due to our new iteration system, so we’re feeling some time pressure, but we don’t really have enough information to make a good decision yet.
Rather than making an immediate decision, Jeff decided to spend another 30 minutes with the plugin to see if he could make it work. If we can solve the compatibility problems in those 30 minutes, it will be a nice win and we can make use of the plugin that we want to. On the flip side, we already have a known solution to the problem. Even if we’re not able to solve the problems we’ll only lose a half an hour, so it’s worth the time to do a very short spike to see if we can fix it.
Some of you may have noticed over the past week or so that Basecamp has felt a bit zippier. Good news: it wasn’t your imagination.
Let’s set the stage. Below, I’ve included a chart showing our performance numbers for Monday from four weeks ago. We can see that at the peak usage period between 11 AM and Noon Eastern, Basecamp was handling around 9,000 requests per minute. In the same time period, it was responding in around 320 ms on average or roughly 1/3 of a second. I know quite a few people who would be very pleased with a 320ms average response time, but I’m not one of them.
June 29, 2009 – 09:00 – 21:00 EDTContinued…
I have a couple of technical presentations coming up that I wanted to let everyone know about. I’ll be talking about a couple of things that have been holding my interest for the past few months: Chef and Erlang.
For the first talk, I’ll be giving an overview of Chef, a tool that we’ve been using for few months to automate our system administration tasks. If you’re in or around the Raleigh, NC area tomorrow night (June 16th), come out to the Raleigh.rb meetup and join us.
The second talk will be at Erlang Factory in London next week. I’ll be talking about our use of the Erlang programming language in the Campfire poll server that we recently discussed right here on SvN. I had the good fortune to attend the first Erlang Factory in the United States at the beginning of May in Palo Alto and it was one of the best conferences I’ve been to in years. If you’re at all interested in Erlang, it would be worth your while to think about attending the conference.