Today I made intermission public. As I mentioned in my post about mysql_role_swap we’ve been working hard to limit / eliminate the impact our operations maintenance tasks have on our customer’s experience.
A few people noticed the /tmp/hold “leftover” in mysql_role_swap script. intermission is a product of that early exploration with coordinating database maintenance with request pausing in the web application tier. I’ve done a good bit of non production testing with intermission, but only limited production testing.
Last Friday we used intermission with mysql_role_swap to move Writeboard’s database to a new server. We had a single user facing exception, and we think it was likely caused by something other than the maintenance. For Friday’s maintenance we enabled request pausing via intermission, ran mysql_role_swap, restarted the unicorn (rails) processes, and then unpaused the requests. Total maintenance time was just a few seconds!
Signal vs. Noise is a founding member of The Deck advertising network
We’ve come a long way in the last year in the way we operate our sites. We’ve stabilized our applications, improved their response time, and increased their availability.
To accomplish these improvements we’ve done a series of database maintenances that varied from upgraded hardware, to new database servers, to configuration changes that required restart. In each of these operations we had one common goal: minimize the interruption to our customers.
Today we are releasing a small script that has made our lives, and our customer’s lives a whole lot better. We use this script to change the roles of our databases from replication masters to slaves, and vice versa. The fact that the script does all the steps previously performed by a human in a more timely and perfect manner is where we achieve all the gain.
Without this script we used to spend minutes accomplishing these maintenance tasks. With the script we’ve swapped databases under production load with no user noticeable interruption!
The script has lots of hard coded paths and users and other assumptions. But this is too good to keep to ourselves. We’re sharing it with you with the hope that it will improve your operations experience, and that you will contribute back changes that make it even better.
Since I joined 37signals, I have been working to improve our monitoring infrastructure. We use Nagios for the majority of our monitoring. Nagios is like an old Volvo – it might not be the prettiest or the fastest, but it’s easy to work on and it won’t leave you stranded.
To give you some context, in January 2009 we had 350 Nagios services. By September of 2010 that had grown to 797, and currently we are up to 7,566. In the process of growing that number, we have also drastically reduced the number of alerts that have escalated to page someone in the middle of the night. There have certainly been some bumps along the road to better monitoring, and in this post I hope to provide some insight into how we use Nagios and some helpful hints for folks out there who want to expand and improve their monitoring systems.Continued…
Find more opportunities on the 37signals Job Board.
We launched the new Basecamp on March 6. Since then we’ve deployed 891 new versions with all sorts of new features, bug fixes, and tweaks. Through all of that we’ve had just six incidents of either scheduled or unscheduled downtime for a total of 19 minutes offline.
Today, that means we’ve been available 99.99% of the time since launch. That’s worth celebrating! Our fantastic operations teams consisting of Anton, Eron, John, Matt, Will, and Taylor have worked tirelessly to eliminate interruptions and they deserve our applaud.
Since we count “scheduled” downtime the same as “unscheduled” (have you ever met a customer who cared about the difference?), that has meant making good progress on stuff like database migrations.
In the past, when we focused mainly on unscheduled downtime as a measure of success, we wouldn’t think too much of taking a 30-minute window to push a major new feature. Not so these days. Thanks to Percona’s pt-online-schema-change, we’re able to migrate the database much easier without any downtime or master-slave swappero.
So three cheers to the four 9’s! Our next target is five 9’s, but that only allows for 5 minutes of downtime in a whole year, so we have our work cut out for us.
You can follow along and see how we’re doing on basecamp.com/uptime.
A few of us recently attended Velocity Conference in San Jose, CA. In the “hallway track sessions” a number people asked about the hardware that powers Basecamp, Campfire and Highrise.
All of our Ruby/Rails application roles run on Dell C Series 5220 servers. We chose C5220 servers because they provide high density, high performance, and low cost compute sleds at a decent cost point. The C5220 sleds replaced invidual Dell R710 servers which consumed a greater amount of power and rack space in addition to offering expandability we were not utilizing.
We use an 8 sled configuration with E31270 3.40GHz processors, 32/16G of ram, an LSI raid card and 2 non Dell SSDs. (For those of you thinking of ordering these … get the LSI raid card. The built in Intel raid is unreliable.) Each chassis with 8 sleds takes up 4U of rackspace: 3 for the chassis and 1 for cabling.
Job / Utility Servers
We use a combination of C6100 and C6220 servers to power our utility/jobs and API roles. We exclusively use the 4 sled version (of each) which means we get 4 “servers” in 2U. Each sled has 2x X5650 processors, 48-96G of ram, 2-6 ssds, and 4×1G or 1×10G network interfaces. This design allows to have up to 24 disks in a single chassis while consuming the same space as a single R710 server (which holds 8 disks max).
For Solr we run R710s filled with SSDs. Each instance varies, but a common configuration is 2x E5530 processors, 48G of ram, 4-8 ssds, and 4×1g network interfaces. For Elastic Search we run a mix of Poweredge 2950 servers and C 5220 sleds with 12-16G of ram and 2×400G ssds in a raid 1.
Database and Memcache/Redis Servers
For Database roles we use R710s with 2x X5670 processors, 1.2TB Fusion-IO duo cards and varying amounts of memory. (Varies based on the database size.) We also have a number of older R710s powering Memcache and Redis instances. Each of these has has 2x E5530 processors and 2-4 disks with 4×1G network interfaces.
We have around 400TB / 9 nodes of Isilon 36 and 72NL storage. We serve all of the user uploaded content off this storage with backups to S3.
Database servers run RHEL or Centos 6 while application and utility servers run Ubuntu LTS.
Take a tour to see why others use Basecamp every day.
A common obstacle we face when releasing new features is making production schema changes in MySQL. Many new features require additional columns or indexes. Running an “ALTER TABLE” in MySQL to add the needed columns and indexes locks the table, hanging the application. We need a better solution.
Option 1: Schema Change in Downtime This is the simplest option. Put the application into downtime and perform the schema change. It requires us to have the application down for the duration of the “ALTER TABLE”. We’ve successfully used this option for smaller tables that can be altered in seconds or minutes. However, for large tables the alter can take hours making it less than desirable.Continued…
I’ve been with 37signals as a remote system administrator for a month now and thought I would share some observations on Operations:
I’ve been thoroughly impressed by how much everyone genuinely cares about the user experience with the applications they maintain. Everything from page response times increasing by a few milliseconds to minimizing interruptions during deployments or even the impact of the number of http redirects on load times – it’s all constantly being discussed and debated. This is often supplemented with hard data in logs and pretty graphs (thanks Noah!) from the multitude of resources made available internally to help diagnose issues.
It’s more than just observation and discussion, though – frequently these chats morph into immediate direct action in code or configuration changes.
37signals definitely embraces remote employees and treats them like any other. In my first month I’ve used Campfire, Basecamp, Jabber, Skype, Gmail, Github, Confluence, Google Docs, and more to stay connected and learn how some of the pieces fit together. Thanks to the resources made available from day one and the accessibility of the other Operations team members, I honestly haven’t felt left out in the slightest when working from home.
Operations definitely sets the bar high for accountability to users and the rest of the company. Every issue I’ve seen raised by support, programmers, or other admins is looked into and taken as seriously as any other.
When larger production issues are dealt with, detailed postmortems are made available so the same mistakes aren’t repeated. These have been great for learning what not to do.
The amount and depth of monitoring at 37signals is impressive. They have thousands of checks for everything from hardware issues, application error rates, backups, and scheduled jobs. On top of this, we have remote monitoring and systems in place for notifying users of ongoing issues.
Overall I’ve been impressed with the high standard of work in the Operations group and I’ll be doing my best to live up to it.
It’s been an interesting and enjoyable first month with 37signals, largely thanks to everyone who’s made me feel so welcome and valued, especially Taylor, John, Eron, Will and Anton. Thank you!
As we announced at the beginning of the month, we’re always on a mission to improve our uptime. Inaccessible apps are the cause of much frustration and users don’t care whether that’s because they’re scheduled or not.
While publishing our own uptimes have been a great step towards getting everyone in the company focused on improving, we also wanted to compare ourselves to others in the industry. So since December 16, we’ve been tracking five other applications through Pingdom to compare and contrast.
The goal is to have the least amount of downtime and here are the results from the period December 16 to January 31:
- Github, down for 6 minutes
- Freshbooks, down for 14 minutes
- Basecamp, down for 16 minutes
- Campaign Monitor, down for 21 minutes
- Shopify, down for 1 hour and 53 minutes
- Assistly (now Desk), down for 6 hours and 46 minutes
Congratulations to Github for the number one spot on the list. We are definitely going to be gunning for them! We’ll publish another edition of this list in a month or so.
From the very start, we wanted Basecamp Next to be fast. Really, really fast. To do so we built a russian-doll architecture of nested caching that I’ll write up in detail soon. But for now I just wanted to share where all this caching is going to live as we just installed it at the hosting center.
It kinda reminds me of what pictures of a drug raid look like when they lay out all the coke and cash on the table, but this is what 864GB of RAM looks like:
Cost of the loot was $12,000.
- Our support team responded to 100,000 cases
- Our syslog server logged an average of 1,500,000,000+ messages each day
- Our solr indexer processed 428,000,000 indexing requests for Highrise alone
- We hit our 100,000,000th person/company creation in Highrise
- And a Basecamp user uploaded the 100,000,000th file (It was a picture of a cat!)
In the first 10 business days of 2012
- We stored 100,000,000 unique statsd measurements
- Our syslog server logged over 20,000,000,000 messages
- Our applications sent more than 2,000 email notifications per minute
- And Basecamp accepted an average of 75,000 file uploads from users per day
Interested in numbers other than revenue and profit? Leave a comment and we (mostly Noah) will dig them up for a future post.