This is Signal vs. Noise, a weblog by 37signals about design, business, experience, simplicity, the web, culture, and more. Established 1999 in Chicago. Follow us on Twitter for more information on our products.
As we announced at the beginning of the month, we’re always on a mission to improve our uptime. Inaccessible apps are the cause of much frustration and users don’t care whether that’s because they’re scheduled or not.
While publishing our own uptimes have been a great step towards getting everyone in the company focused on improving, we also wanted to compare ourselves to others in the industry. So since December 16, we’ve been tracking five other applications through Pingdom to compare and contrast.
The goal is to have the least amount of downtime and here are the results from the period December 16 to January 31:
Congratulations to Github for the number one spot on the list. We are definitely going to be gunning for them! We’ll publish another edition of this list in a month or so.
From the very start, we wanted Basecamp Next to be fast. Really, really fast. To do so we built a russian-doll architecture of nested caching that I’ll write up in detail soon. But for now I just wanted to share where all this caching is going to live as we just installed it at the hosting center.
It kinda reminds me of what pictures of a drug raid look like when they lay out all the coke and cash on the table, but this is what 864GB of RAM looks like:
Every week, a few of our programmers focus on responding to customer problems that might indicate a bigger issue with one of our applications. In addition, we’re constantly looking at issues (Rails exceptions, performance “hotspots”, etc.) that don’t bubble up to the customer level, but that are just “good housekeeping”. Taking care of these issues can have both real and measurable impacts — for example, we reduced application exceptions by 43% in December, allowing us to proactively fix even more problems before customers noticed.
Lets enter the house of Campfire, our real time chat product. Campfire uses memcache in a number of ways, one of which is to store fragments of messages that have been “spoken” in a room. When a user requests the transcript for that room we attempt to build a list of these messages from the cache before we hit the database. To do this, we build an array of message keys which are dispatched to a memcache-client multi_get request.
For a long time Campfire has generated exceptions showing multi_get failures, but the errors aren’t bringing down the site or causing users problems. When I first came to 37signals the exceptions were described as being caused by a bug in older versions of memcached (server), but no one knew “for sure”. Lacking any open issues or similar bug reports, as a work around our developers had patched the memcache-client to rescue ENOPIPE errors, raise by these failed get requests. The patch didn’t make the problem go away, but it kept it from being customer impacting…
Identifying the Problem
While testing a new branch of Campfire in staging, one of our developers was repeatedly encountering this error to the point where an exception was being raised for almost every page. Remember how I said these things pop up at the most inconvenient times? I decided to try and figure out these exceptions once and for all. (Ultimately, it turns out we hadn’t bundled the patched gem and that’s why the exception was being raised so often.)
After reviewing the code for memcache-client, I decided there wasn’t anything obviously wrong with its implementation for multi_get requests. To start my debugging, I fired up telnet, set two keys, and issued a “get keyone keytwo”. This worked fine. I had established that multi gets in the most basic form do work.
Next I used strace to attach to the unicorn (Ruby on Rails) process and see what the client was sending:
Here’s a basic run down of what you see above: The unicorn process (Rails) opens a socket to our memcache server in staging, and then issues a get command which is broken up into multiple 4096 byte packets. However, instead of receiving a reply from the server, our client socket gets a Broken Pipe error and it’s closed.
“OK…” I thought. I need to reference the memcache protocol specification and make sure this request conforms to the “get” protocol command. Was this request within in the protocol description? Yes.
At this point I knew the client was sending a legitimate request, so I needed to debug the server. Since our application uses multiple instances, I downed all but one instance, and fired up strace (strace -s 64000 -fp 7133).
(Strace Tip: I used -f to follow the forked processes, and -s 64000 to make sure I got enough string output to see the full read/write data sent to the socket.)
You can see that the memcache server reads the first 2048 bytes, then reads another 2048 bytes, then reads another 4096 bytes, then reads 8192 bytes, and finally reads another 16384 bytes. So after the first read, it doubles the amount of data it’s reading for each iteration. However, after the fourth read, the epoll call returns 0, and the client’s connection is closed.
Identifying the Root Cause
In this case, I knew Jeremy had worked on this bug before and I thought he would understand the all the data I had gathered. So with Jeremy’s help, I started looking at the memcache code on Github. Within minutes Jeremy pinged me saying he had noticed this code that’s designed to nuke junk connections. Which makes sense, but also doesn’t help us here!
This code path does exactly what I had seen in the strace above: It reads from buffer 4 times, while doubling the amount read each time. After 4 times, it stops reading and returns.
Conclusion and Resolution
Although memcache’s behavior is by design, it’s not compatible with the expected behavior of the protocol. In the issue I opened, I recommended a configuration option to set the amount of data to read, or the number of times to attempt reading more data. I’m not sure if this is a realistic solution though.
Hypothetically we could work around this by increasing the initial buffer size and recompiling memcache so that in 4 iterations the server would receive all of the get command. We could also patch the memcache-client multi_get method to break the multi_get requests into multiple requests with fewer keys. For now we’ve left our patched client in place, with hope that the memcache team will be able to get this resolved soon.
As I said previously, the goal is to identify a root cause. And, with Jeremy’s help, I’ve done just that. It feels great to have a complete understanding of this exception, and where the break down is in our stack.
We now know, we don’t have to be afraid of upgrading memcache with a fear of causing more multi_get problems. We also know that this is a legitimate problem that’s causing a significant performance penalty in our application. (Since large multi_gets fail, that means we aren’t getting the data back from memcache and hitting the database more than we need to.) Finally, for me, it shows that I don’t have to be bothered by debugging with strace. I’m no C code expert, but even with my limited understanding, strace provided enough clues for me to follow this all the way to the source in less than one hour. And since you all have an extra hour to spare, I hope this encourages you to take a look at that problem that’s been lingering in your stack.
Today, we are announcing Anton Koldaev as the newest member of our operations team! Anton hails from Moscow and recently worked at the first Russian-based cloud hosting company.
In David’s recent post on remote working, there were a number of comments about investing in more junior or less experienced individuals. Anton is 24 and a great example of that and quickly won us over during his 30 day trial as a junior member of our team.
How we hired Anton
Anton saw our job board post via an old colleague’s tweet and responded with a well-written email to us. We reviewed his resume and gave a traditional operations task to complete: Deploy Redmine to EC2 using Chef, and document the process.
Anton not only did an excellent job from the operational side of things, he was the only applicant to use the Redmine instance to document his work. In addition, he was the only applicant to take a page from tally and build a little Campfire integration for code pushes, deployment notifications, etc.
Impressed with Anton’s work on the trial project, each member of the team had a chat with him on Skype. (We also talked to his references who all had great things to say.) As a team we agreed that Anton would be a good contractor, and offered him a 30 day trial.
Anton started with documentation, and worked his way through projects such as upgrading our Chef server, deploying new hardware for Basecamp and rebuilding our staging database servers. Near the end of his trial, Anton was able to get a visa to visit Will in the UK for a week of co-working and in depth learning about our operations.
Ma Bell engineered their phone system to have 99.999% reliability. Just 5 minutes of downtime per year. We’re pretty far off that for most internet services.
Sometimes that’s acceptable. Twitter was fail whaling for months on end and that hardly seem to put a dent in their growth. But if Gmail is down for even 5 minutes, I start getting sweaty palms. The same is true for many customers of our applications.
These days most savvy companies have gotten pretty good about keeping a status page updated during outages, but it’s much harder to get a sense of how they’re doing over the long run. The Amazon Web Services Health Dashboard only lets you look at a week at the time. It’s the same thing with the Google Apps Status Dashboard.
Zooming in like that is a great way to make things look peachy most of the time, but to anyone looking to make a decision about the service, it’s a lie by omission.
Since I would love to be able to evaluate other services by their long-term uptime record, I thought it only fair that we allow others to do the same with us. So starting today we’re producing uptime records going back 12 months for our four major applications:
Backpack: 99.98% or just under two hours of downtime.
Note that we’re not juking the stats here by omitting “scheduled” downtime. If you’re a customer and you need a file on Basecamp, do you really care whether we told you that we were going to be offline a couple of days in advance? No you don’t.
While we, and everyone else, strive to be online 100%, we’re still pretty proud of our uptime record. We hope that this level of transparency will force us to do even better in 2012. If we could hit just 4 nines for a start, I’d be really happy.
I hope this encourages others to present their long-term uptime record in an easily digestible format.
Last year, we suffered a number of service outages due to network problems upstream. In the past 9 months we have diligently worked to install service from additional providers and expand both our redundancy and capacity. This week we turned up our third Internet provider, accomplishing our goals of circuit diversity, latency reduction and increased network capacity.
We now have service from Server Central / Nlayer Networks, Internap and Level 3 Communications. Our total network capacity is in excess of 1.5 gigabits per second, while our mean customer facing bandwidth utilization is between 500 megabits and 1 gigabit. In addition, we’ve deployed two Cisco ASR 1001 routers which aggregate our circuits and allow us to announce our /24 netblock (our own IP address space) via each provider.
Keeping Basecamp, Highrise, Backpack, and Campfire available to you at all times is our top priority, and we’re always looking for ways to increase redundancy and service performance. This setup has prevented at least 4 significant upstream network issues from becoming customer impacting… which we can all agree is great!
We are looking for two more people to join our operations team. We would prefer individuals interested in both application development and systems engineering. We’ve got 100’s of servers running Ubuntu, 400+ terabytes of Isilon storage, and we’re building out a second site. We use lots of “bare metal” in addition to VMware virtualization and some of the Amazon and Rackspace Cloud services.
Come work with us at 37signals and do the best work of your career.
Want more information? Check out this post on the Job Board.
Yesterday Eron Nicholson joined John, Will, and me on our operations team!
Recently Eron’s work included building and maintaining several multi-data center installations for a green energy startup. In addition to wearing the hardware hat, Eron was responsible for networking, monitoring and systems administration. As if those duties weren’t enough, Eron also developed internal tools and did other software design (including embedded systems).
We are really looking forward to Eron’s work on our monitoring and high availability systems.
Welcome Eron!
PS Hat tip to Nic at Newrelic for introducing us to Eron.
PPS Eron hails from North Carolina and enjoys racing … of a different variety (24 Hours of Lemons).
We’re looking for another person to join our operations team. We would prefer someone in the United States who is interested in both application development and systems engineering.
You will automate Sysadmin tasks with Chef, troubleshoot application performance issues with New Relic, and create new tools to make things run more smoothly and efficiently. You’ll also be expected to participate in an on-call rotation after you’re fully up to speed (traditionally 1 to 2 months).
Want more information? Check out this post on the Job Board.
We’re excited to announce another new addition to our team: Will Jessop, who pings from Manchester in the UK, is now the second EU engineer on our Operations team.
Will comes from Engine Yard where he provided scaling and support expertise for Rails sites such as Seeking Alpha, Groupon and KgbKgb. Recently Will worked on a senior team tasked with live migrating more than half of Engine Yard’s private cloud customers to a new cloud platform. It’s going to be great to have another member of the team with such deep understanding of application development and systems implementation practices.
With the addition of Will, we’ll have a lot more ability to conduct off-hours maintenance and scalability testing. He will also be helping us improve deployment practices, standardize our applications on the latest Ruby, and improve performance of our applications in Europe and the rest of the non-US world.
As my final installment in the Nuts & Bolts series, I want to hit a few of the questions that were sent in that I didn’t get a chance to get to earlier in the week. I hope you’ve enjoyed reading these as much as I’ve enjoyed writing them.
What colocation provider did you choose, and why?
After an exhaustive (and exhausting!) selection process, we chose ServerCentral to host our infrastructure. They have an awesome facility that has some of the most thoughtful and redundant datacenter design I’ve ever seen. On top of top notch facilities they have a great network via their sister company nLayer.
Finding a partner who could manage the hardware for us without us having to be onsite was a big deal for us too. The quality of “remote hands” support from datacenter to datacenter is, well let’s just call it inconsistent and be generous. ServerCentral has a great reputation with its customers in that regard and we’ve found their support to be excellent. They manage all of the physical installations, hardware troubleshooting, and maintenance for us.
Next up in the Nuts & Bolts series, I want to cover storage. There were a number of questions about our storage infrastructure after my new datacenter post asking about the Isilon storage cluster that is pictured.
To set the stage, I’ll share some file statistics from Basecamp. On an average week day, there are around 100,000 files uploaded to Basecamp with an average file size that is currently 2MB for a total of about 200GB per day of uploaded content in Basecamp. And that’s just Basecamp! We have a number of other apps that handle tens of thousands of uploaded files per day as well. Based on that, you’d expect we’d need to handle maybe 60TB of uploaded files over the next 12 months, but those numbers don’t take into account the acceleration in the amount of data uploaded. Just since January we’ve seen an increase in the average file size uploaded from 1.88MB to 2MB and our overall storage consumption rate has increased by 50% with no signs of slowing down.
When I sat down to begin planning our move from Rackspace to our new environment, I looked at a variety of options. Our previous environment consisted of a mix of MogileFS and Amazon S3. When a customer uploaded a file to one of our applications we would immediately store the file in our local MogileFS cluster and it would be immediately available for download. Asynchronously, we would upload the file to S3, and after around 20 minutes, we would begin serving it directly from S3. The staging of files in MogileFS was necessary to account for the eventually consistent nature of S3.
While we’ve been generally happy with that configuration, I thought that we could save money over the long term by moving our data out of S3 and onto local storage. S3 is a phenomenal product, and it allows you to expand storage without having to worry much about capacity planning or redundancy, but it is priced at a comparative premium. With that premise in mind I crunched some numbers and was even more convinced that we could save money on our storage needs without sacrificing reliability and while reducing the complexity of our file workflow at the same time.
The main contenders for our new storage platform were either an expanded MogileFS cluster or a commercial NAS. We knew that we did not want to have to juggle LUNs or a layer like GFS to manage our storage, so we were able to eliminate traditional SAN storage as a contender fairly early on. We have had generally good luck with MogileFS, but have had some ongoing issues with memory growth on some of our nodes and have had at least a couple of storage related outages over the past couple of years. While the user community around MogileFS is great, the lack of commercial support options raises its head when you have an outage.
After weighing all of the options, we decided to purchase a commercial solution and we settled on Isilon as the vendor for our storage platform. Protecting our customer’s data is our most important job and we wanted a system that we could be confident in over the long term. We initially purchased a 4 node cluster of their 36NL nodes, each with a raw capacity of 36TB. The usable capacity of our current cluster with the redundancy level we have set is 108TB. We’ve already ordered another node to expand our usable space to 144TB in order to keep pace with the storage growth that took place between the time we planned the move and when we implemented it.
The architecture of the Isilon system is very interesting. The individual nodes interconnect with one another over an InfiniBand network (SDR or 10 Gbps right now) to form a cluster. With the consistency level we chose, each block of data that is written to the cluster is stored on a minimum of two nodes in the cluster. This means that we’re able to lose an entire node without affecting the operation of our systems. In addition, the nodes cooperate with one another to present the pooled storage to our clients as a single very large filesystem over NFS. Isilon also has all the features like snapshots, replication, quotas, and so on that you would expect from a commercial NAS vendor. These weren’t absolute requirements, but they certainly make management simpler for us and are a welcome addition to the toolbox.
As we grow, it’s very simple to expand the capacity of the cluster. You just rack up another node, connect it to the InfiniBand backend network and to the network your NFS clients are connected to and push a button. The node configures itself into the existing cluster, its internal storage is added to the global OneFS filesystem, its onboard memory is added to the globally coherent cache, and its CPU is available to help process I/O operations. All in about a minute. It’s pretty awesome stuff, and we had fun testing these features in our datacenter when we were deploying it.
For now, we continue to use Amazon S3 as a backup, but we intend to replace it with a second Isilon cluster in a secondary datacenter which we’ll keep in sync via replication within the next several months.
As a part of our ongoing Nuts & Bolts series I asked for questions from readers about the kinds of things they’d like to see covered. One of the topics that came up several times was how we manage our database servers.
All of our applications, with the exception of Basecamp, follow a pretty similar model: We take a pair of Dell R710 servers, load them up with memory and disks, and setup a master/slave pair of MySQL servers. We use the excellent Percona Server for all of our MySQL instances and couldn’t be happier with it.
Here’s an example of one of our MySQL servers. In this case, the Highrise master database server:
Dell R710
2 x Intel Xeon E5530 Processors
96GB RAM
6×146GB 15,000 RPM SAS drives
For the disk configuration we take the first two drives and put them into a RAID1 volume that is shared between the root filesystem and MySQL binary logs. The remaining drives are placed into a RAID10 volume which is used for the InnoDB data files.
We only use RAID controllers that have a battery backup for the cache, disable read-ahead caching, and turn on write-back caching. With this setup we’re able to configure MySQL to immediately flush all writes to the disk rather than relying on the operating system to periodically write the data to the drives. In reality, the writes will be staged to the controller’s cache, but with the battery backup we are protected from unexpected power outages which could otherwise cause data loss. In addition, since the controller is caching the writes in memory, it can optimize the order and number of writes that it makes to the physical disks to dramatically improve performance.
As far as MySQL configuration is concerned, our configuration is pretty standard. The most important tips are to maximize the InnoDB buffer pool and make sure that you have a BBU enabled RAID card for writes. There are other important configuration options, but if you do those two things you’re probably 75% of the way to having a performant MySQL server.
Here are some of the most important configuration options in the Highrise MySQL config file:
I’m not going to talk much about backups other than to say you should be using XtraBackup, also from our friends at Percona. It is far and away the best way to do backups of MySQL.
For Basecamp, we take a somewhat different path.
We are on record about our feelings about sharding. We prefer to use hardware to scale our databases as long as we can, in order to defer the complexity that is involved in partitioning them as long as possible—with any luck, indefinitely.
With that in mind, we went looking for an option to host the Basecamp database, which is becoming a monster. As of this writing, the database is 325GB and handles several thousand queries per second at peak times. At Rackspace, we ran this database on a Dell R900 server with 128GB of RAM and 15×15,000 RPM SAS drives in a Dell MD3000 storage array.
We considered building a similar configuration in the new datacenter, but were concerned that we were hitting the limits of I/O performance with this type of configuration. We could add additional storage arrays or even consider SAN options, but SSD storage seemed like a much better long term answer.
We explored a variety of options from commodity SSD drives to PCI-express based flash memory cards. In the end, we decided to purchase a pair of MySQL appliances produced by Schooner Information Technology. They produce a pretty awesome appliance that is packed with a pair of Intel Nehalem processors, 64GB of RAM, 4×300GB SAS drives, 8 x Intel X25-E SSD drives. Beyond the hardware, Schooner has done considerable work optimizing the I/O path from InnoDB all the way down through the system device drivers. The appliances went into production a few weeks ago and the performance has been great.
I sat down with Jeremy Cole of Schooner a few weeks ago and recorded a couple of videos that go into considerably more detail about our evaluation process and some thoughts on MySQL scaling. You can check them out here and here.
With all the recent talk about the fabulous new office space that the Chicago crew just moved into, I wanted to share a little bit about another long term move that is nearing completion. For the last four years our infrastructure has been hosted with Rackspace. As of last weekend, the vast majority of our traffic is now being served out of our own colocated server cluster.
Some of our Dell R710 Servers, we have a bunch of these.
Mark, Joshua, and John on life as a 37signals Sys Admin
The Sys Admin team discusses hosting the 37signals apps, working with programmers, helping support, telecommuting, dealing with vendors, improving speeds in Europe, and more.
More episodes
Subscribe to the podcast via iTunes or RSS. Related links and previous episodes available at 37signals.com/podcast.
Spread the word
Like this episode? Please share it with your friends:
Configuration management doesn’t sound sexy, but it’s the single most important thing we do as sysadmins at 37signals. It’s about documenting an entire infrastructure setup in a single code base, rather than a set of disparate files, scripts and commands. This has been our biggest sysadmin pain point.
Recently we hit a milestone of easing this pain across our infrastructure by adopting Chef, the latest in a long line of configuration management tools.
We struggled with a few other tools before settling on Chef. We love it. It’s open source, easy to hack on, opinionated, written in Ruby and has a great community behind it. It’s really changed the way we work. I think of it as a Rails for Sysadmins.
Here’s a snippet of all the data required to make Chef configure a bare Ubuntu Linux install as a Basecamp application server.
As an early adopter, we’ve helped Chef grow and opened our repository of Chef recipes. If you’re interested in using Chef, take a look there for some example uses. Please fork and provide feedback on Github.
A couple of years ago a lot of buzz started in the Ruby community about Erlang, a functional programming language developed by Ericsson originally for use in telecommunications systems. I was intrigued by the talk of fault tolerance and concurrency, two of the cornerstones that Erlang was built on, so I ordered the Programming Erlang book written by Joe Armstrong and published by the Pragmatic Programmers and spent a couple of weeks working through it.
A year later, Kevin Smith began producing his excellent Erlang in Practice screencast series in partnership with the Pragmatic Programmers. It’s amazing how much difference it made for me to be able to watch someone develop Erlang applications while talking through his thought process along the way.
As I was learning Erlang, I kept threatening to rewrite the poller service that handles updating Campfire chat rooms when someone speaks in room. At some point my threats motivated Jamis, who was also playing with Erlang in his free time, to port our C based polling service to Erlang. Jamis invited me to look at the code and I couldn’t help myself from refactoring it within an inch of it’s life.
The code that Jamis wrote worked fine, but it was not very idiomatic Erlang. While I didn’t have much more experience developing Erlang code than Jamis, I had definitely seen more real Erlang code. I tried to pattern our work after what I had been exposed to, making improvements along the way. We ended up with 283 lines of pretty decent Erlang code.
parameter(Parameter, Parameters) ->
case lists:keysearch(Parameter, 1, Parameters) of
{value, {Parameter, Key}} -> Key;
false -> undefined
end.
For the curious, here’s a very simple example function from the real Campfire poller service. This function takes two arguments, the name of a parameter to search for, and the list of parameters. If it finds a matching parameter it returns the associated value, otherwise it returns the atom undefined. Atoms are like symbols if you’re a Ruby programmer.
Last Friday we rolled out the Erlang based poller service into production. There are three virtual instances running a total of three Erlang processes. Since Friday, those 3 processes have returned more than 240 million HTTP responses to Campfire users, averaging 1200-1500 requests per second at peak times. The average response time is hovering at around 2.8ms from the time the request gets to the Erlang process to the time we’ve performed the necessary MySQL queries and returned a response to our proxy servers. We don’t have any numbers to compare this with the C program that it replaced, but It’s safe to say the Erlang poller is pretty fast. It’s also much easier to manage 3 Erlang processes than it was the 240 processes that our C poller required.
Erlang definitely isn’t a replacement for Rails, but it is a fantastic addition to our collective toolbox for problems that Rails wasn’t designed to address. It’s always easier to work with the grain than against it, and adding more tools makes that more likely.
A common request we get from readers is to describe in more detail how our server infrastructure is setup. That question is so incredibly broad that it’s hard to answer it in any kind of comprehensive way, so I’m not going to try to. Instead, I’m keeping the general desire for more technical details in mind as I work through day-to-day issues with our configuration, and I’ll try to occasionally write about things that I think might be of interest. The topic for today is HAproxy.
Our awesome systems administrator Mark Imbriaco participated in the first episode of the SA Pro podcast recently. You’ll hear Mark talk about how formal education is the only way to become a good sys admin and that Perl is better than Rails.
This is the third in a series of posts showing how we use Campfire as our virtual office. All screenshots shown are from real usage and were taken during one week in September.
This time we’ll take a look at how Campfire is an integral part of our sysadmin and development efforts.
Discover and fix a code failure
Whenever someone checks in a piece of code, CIA (Continous Integration Agent) automatically runs our test suites and reports on any failed tests.
Analyze a server problem
David and Mark discuss a server issue.
Subversion shows changes to the code Subversion tracks changes Ryan recently uploaded. Jason offers kudos on the copy edit made.
Tell everyone about a server change
Sam deploys changes to Backpack and details what was changed.