SEASON 2 - EPISODE 0070 - AUG 9, 2023

Software Has Bugs

Bugs are an inevitable part of complex software and aiming for complete bug-free perfection is not only unrealistic, but it hinders progress and product delivery.

In this episode of REWORK, host Kimberly Rhodes sits down with 37signals founders, Jason Fried and David Heinemeier Hansson to discuss why you need to be realistic about bugs in software development.

Listen in as David and Jason offer a behind-the-scenes look at the two-tiered approach to handling bugs in their software at 37signals and their triage plan for determining which should be fixed, when, and by whom.

Tune in to uncover strategies to manage your customers’ expectations while dealing with bugs WITHOUT deviating from your product roadmap

Check out the full video episode on YouTube.

Show Notes

00:00 - Kimberly opens the show and shares the topic for discussion, that bugs in software are normal.
00:27 - David shares why you need to be realistic about bugs in software development.
01:18 - Bugs, a natural side effect of software.
02:08 - What makes users abandon a piece of software (hint: it’s usually not a few bugs)?
05:26 - The two-tiered (non-emotional) approach to handling the vast spectrum of bugs in complicated pieces of software.
06:13 - When is a “bug” not a “bug”?
07:00 - Handling customer expectations without screwing up your product roadmap.
07:52 - You need a filter: the double-edged sword of founders operating in customer support.
08:32 - David shares a behind-the-scenes look triage at 37signals.
09:31 - The novel QA approach of the Toyota production line that 37signals tries to emulate.
11:37 - Jason shares the difference between the software and auto industries when fixing production problems.
12:37 - Yes, quality matters, but perfect never gets shipped.
14:30 - So how do we build useful, meaningful software?
15:21 - Breaking out of bad bug thought patterns so you can keep making software of value.
16:47 - Who oversees fixing the bugs at 37signals—the methods they use to determine what gets fixed, when, and by who.
20:08 - Cleaning up the tech debt: the vital importance of a measured, mature way of scheduling things.
21:09 - Don’t create ****** software—it can’t be fixed.
23:55 - Jason discusses the idea an organization might be a bigger problem than just a bug in the software.
24:44 - Why you should never become “too big” to listen to your customers.
26:28 - For more, check out our dev.37signals.com blog, where the 37signals developers write about some of their processes.
26:45 - REWORK is a production of 37signals. You can find show notes and transcripts on our website. Full video episodes are also available on Twitter and YouTube. If you have questions for David and Jason about a better way to work and run your business, we’d love to answer them. Leave your voicemails at 708-628-7850 or send an email.

Links & Resources

Transcript

Kimberly (00:00): Welcome to REWORK, a podcast by 37signals about the better way to work and run your business. I’m your host, Kimberly Rhodes. 37signals is in the business of software development, but their founders, Jason Fried and David Heinemeier Hansson readily admit that creating software that is completely bug free is unrealistic. Jason and David are here to talk more about this. David, I’m gonna start with you. I think some listeners might be surprised to hear the CTO of a company saying like, yeah, bugs are normal.

David (00:27): Yes, I think that is simply accepting reality, which is a really good place to start if you want to have some impact on reality and if you want to be able to be realistic about what you can do about it. I think, I’m not entirely sure where this started, but as long as I can remember, bugs have had this relationship to guilt when it comes to software development. That a bug is a representation of our human failings as programmers, as designers, as product makers, which I think is quite unfortunate when it’s so prevalent. Like if bugs were something that very rarely happened and it only happened through sheer negligence or malice, okay, fair enough. We could assign all this guilt to it because it is something that would be clearly preventable. Now, we’ve dealt with software for the better part of 50 plus years. Every piece of software in that entire time has had bugs.

(01:18): If it had any serious complexity or any users. There’s basically only two types of software that does not have bugs, either the kind of stuff that doesn’t get used by anyone or the stuff that is so simple it can fit on a single sheet of paper. Anything more complicated than that will have bugs and accepting that and dealing with it in a measured logical, not overly, um, sort of emotional way, I think it’s just the only reasonable mature response to software development if you’re gonna be in it for the long run. Because when we assign this level of guilt to it, we just end up in a place where we’re feeling bad about a natural side effect of what we’re doing. We could stop making bugs, but that would also mean we would stop make making software. Is that a good trade?

(02:08): No. Most software, in fact, I would say what’s really interesting, the really interesting cases are the pieces of software that are written with bugs and people continue to use because the value of that piece of software is higher than the annoyance with the bugs, right? So there’s, I’ve had pieces of software in my history where I’ve been really annoyed with some bugs that seem glaring and obvious, which is this other telltale, and I kept using it because it was just so valuable. But what an individual user often misses is that what makes a bug glaring and obvious to them is how they use that software. And there’s a ton of other people who don’t see those same glaring bugs because they don’t use the software in the same way. They don’t have the same kind of data or they, they input it in that way or they didn’t use that combination of features in the way that triggers that bug.

(02:59): Um, but as we always are, we’re so myopic, we think the way whether we use something is the same way that everyone else uses it and like, we just can’t believe why couldn’t they see this? Right? So if you realize that most people use any serious piece of software, any complicated piece of software in very many different ways, it’s a little easier to understand. Now, all that being said, people will absolutely stop using software that’s of low quality and low value. So this isn’t like permission just to push in all the bugs or not care about bugs or never fix bugs. Um, we will literally drop everything at 37signals. We have two levels of dropping everything. We have what we call code red that is, there’s a really serious bug that might be causing data loss or uh, presenting as data loss. That’s usually the kind of things that we have.

(03:49): It looks like the customer lost some data, but actually it was just not visible. Um, or prevent someone from accessing data that they need. That’s the kind of stuff we just drop everything, go straight to the bug, pull as many people in into dealing with it as possible, and then we work on it until it’s fixed. We actually just had one of these cases on, uh, the HEY email system where we had this bug where people using the iOS app on a slower network connection would end up seeing this, um, configuration error. Those are the most annoying bugs by the way, when it’s misdirection. It’s telling you that the problem with you using it is something that makes no sense. I think anyone has used a piece of software they see, oh yeah, this is um, error code minus 100. Basically like it doesn’t make any sense, right?

(04:35): Users justifiably get really frustrated with stuff like that. We had one of those cases and for a while we thought like it was under slow burner and then there was just enough it, A, had been going on for too long, it had affected enough people. We were like, no, actually this is just graduated. This went from one of those bugs where bugs are not special. We’re just gonna prioritize this maybe the next cycle or whatever to, Nope, this is, this was first a code yellow and then we couldn’t quite fix it in a week and then we made it a code red. So thinking of the graduation with bugs and the fact that there’s a whole spectrum of them, there’s the fact that any complicated piece of software will have probably hundreds of open bugs. Many of them will never be fixed. Never, right? It’s not like we work on a first in, first out order here where you see a bug, then you log it and then, you know, eventually someone’s gonna fix it.

(05:26): No, I guarantee you we have bugs in Basecamp that have been there for 10 years. Absolutely. I mean, we have prior versions of Basecamp that we’re still selling that just have the original bugs in them and it will have those bugs until the end of time. They’re like these, uh, actual physical bugs that were captured in ember. And you can still see, oh, this is a mosquito that’s 2.4 million years old. So you have that and you have that in every piece of, um, complicated software, but then you also have the other kind, right? So a non-emotional approach to handling bugs is to A, accept that they will always happen. B, realize that there are different kinds of bugs. We call everything bugs. When there’s like a piece of graphical jittery, it doesn’t align, right? The line breaking is wrong. Oh, there’s a bug. Yeah, okay, that’s a bug.

(06:13): And then there’s the bug that like, I just lost your important emails. I mean calling both of those things, bugs is doing a disservice to the latter. Like the latter stuff is materials impact whatever, loss of money, so on and so forth. The first thing is a minor annoyance at best. So separating just those two things up and then just realizing that the vast majority of the bugs will be of the former kind. There will be low impact, minor annoyances. They should just be scheduled along with everything else and some of them will never be fixed. And setting those expectations with customers in particular is sometimes difficult. Any customer we have, including when I’m a customer of someone else’s software, we’ll encounter a bug and then we’ll think that is the most important thing you could be working on bnecause I just saw it. I just saw it and it either annoyed me or it was more serious than that.

(07:00): You should just drop everything and fix it. Um, and if you let that sentiment translate directly into your organization, you’re gonna get a very screwed up product roadmap because you will never be able to finish anything of note if you’re constantly just dealing with that on an interruption basis. This is actually one of the reasons I think it’s such a double-edged sword to have founders do, um, customer support in the beginning. Because on the one hand it’s really great that you get all this unfiltered feedback and you can instantly plow it into development. You can instantly micro adjust where you’re going. But on the other hand, it is so easy to confuse the enthusiasm a customer might have with their annoyance with, this is something we need to fix right now. Customer’s very upset. I’m getting that upset. I’m passing that upset straight on to the development team.

(07:52): You’re like, eh, maybe you should be acting as a filter. And sometimes that’s easier when you have actual customer service professionals, as we do now, they can act as kind of a filter. They record all the bugs, they will make sure that things get, um, properly escalated if it’s really serious. But a lot of the minor stuff won’t reach developers with the intensity that causes sort of these existential guilt, um, issues of, oh yeah, I didn’t do a good job. I’m actually a terrible programmer. Eh, I don’t know. I mean, again, all the best programmers in the world, they make bugs. I make bugs all the time. All everyone makes bugs all the time. Realize that first. Then, um, treat things in a measured way as how you try to fix it.

Kimberly (08:32): Okay, so David, you mentioned a code yellow and a code red. I’m curious if you could kind of talk us through what does that triage process look like? Like can anyone declare something a yellow versus a red? Like how does that filtering look amongst our team?

David (08:45): Yeah, that’s a really good point and something I think we still work on because there’s a sense I think for most employees inside of a company that they perhaps haven’t been at in for a million years that like calling something, for example, at us code red means other work stops. So that’s kind of like a high order bit to pull. You don’t wanna just do that willy-nilly if you’re constantly calling code red, how are we gonna get anything else done, right? So I think the impulse is to do it too late. In fact, I’d say in at least half the cases that I can immediately recall, it was me or maybe Jason or someone else quite senior who stepped in and go like, no, no. Do you know what? This is just not acceptable. We have to call a code red system that we deal with it right now.

(09:31): I think we try to, I try to at least convince people that code red and code yellow are kinda like the Andon cord. This was a concept from the Toyota production line, which when they introduced this was a completely novel idea. The Toyota production line got so good and Toyota cars so reliable in part because of the way they approached quality assurance. And the quality assurance process was, we will give every worker on the assembly line the power to stop the entire line if they notice problems that should be fixed at the root. And the wisdom and the inside here is that the people closest to the work will notice when things aren’t right. So they have the most information, the Toyota managers, they don’t have the information of whether whatever the air condition doesn’t fit quite right, so it has to be jammed in a little bit and that sometimes clips a thing and that then shows up in customer recalls or returns or something.

(10:27): But the person doing that work actually does. So if we give the people the power to stop the assembly line where they notice material issues like that, we will improve the assembly line. Now the hard part here is that when you first do this, at least as the story is told in Toyota law as at first the, this assembly line would stop all the time, constantly stop and, and you’d go like, okay, well maybe you can deal with one day of that. But if you’re in week two or month three, you might start getting a little nervous that all these work stops. You’re not producing the cars. Who cares if they have issues? But the cumulative effect of that is eventually you will be producing fault free cars, which Toyota essentially does, right? I don’t think there’s anyone who produces internal combustion engine cars with as low a defect rate as Toyota does today. And it goes exactly to this. So that’s the spirit that I try to reinforce and I don’t think we do nearly as good a job of this as Toyota has done, but thinking of it in these terms, anyone should be able to stop the assembly line if they notice something like that that’s just out of alignment. That if we keep doing that, we’re just gonna keep producing issues that are gonna affect a majority of, uh, of customers in that regard.

Jason (11:37): I was gonna add one quick thing to that too. One of the thing that’s hard about it in the software world compared to the car world is that when a car goes off the production line and and goes to on the lot and it’s sold to a customer, this isn’t always true. It’s less true than it used to be before, but like that’s a shipped product that cannot be fixed unless someone brings it back to the dealership. Now there’s some over-the-air updates and certain brands and whatever that that’s happening, but in software in the world that we work, um, I think people are a little bit more afraid to pull the cord because like software can be fixed later in a sense where like a car cannot be fixed later. So I think there’s something about like once this thing is off the line, that problem is shipped and it’s different in our world, at least software as a service. It wasn’t traditionally the way in shipped software where people had to, you know, press something into a CD or on and floppies and send it off. But I do think there’s a little bit, people are a little bit more reticent to, to pull for that reason. Uh, that like, well someone else or someone else will see it or we can just deal with it later and that that’s a hard thing to combat. I think that that human instinct.

David (12:37): And I think that’s perhaps one of those areas where, I mean I go most aggressive on I think the problem being here, there’s too much guilt. But you can certainly also flip over in the other direction where you’re just like, bugs just don’t matter at all. I mean, no, they totally do do matter and you do have to have some processes and some encouragement that someone is going to pull the cord to to stop it. And you have to have a culture that like quality matters. But I think the sense of quality is one where like we build good stuff. We don’t build perfect stuff because perfect is the enemy of good. Perfect is the stuff that takes forever to build. Perfect is the stuff that never gets shipped. Perfect is just a minor subset of what’s valuable. When you have a commitment to building good solid stuff that inherently includes the willingness to make trade-offs that you know what, we could fix a bunch of very low priority bugs that are things are wrapping wrong or colors wrong or whatever.

(13:38): Or we could work on something new and valuable, a feature that would benefit everyone, right? And you have to constantly trade those things off. And I think that is the art and science of software development is to be able to hold that conflict. No soft bugs are not great, but they will always happen if we’re trying to create useful, meaningful software in a reasonable amount of time on a complicated system, right? So that’s the world we live in. Um, let’s try to make it as good as possible. And I think this is one of the reasons why focusing on these processes of what can you do when there are material bugs? How do you prevent like, oh, bugs are not special from becoming, we build shitty software. As soon as you tip into we build shitty software, you are off. What I like to um, think about in that regard is that focusing a bit on the habits and the process of building software, not just the outcome.

(14:30): So how do we build software? Um, well we build software where we do automated testing. That’s just one thing that just comes with the territory. We build software where the software itself, like its internal properties feel like they’re of high quality. We will spend time polishing the appearance, the aesthetics, the beauty of the software that we are building because in part one of the derivative effects is that when you care a lot about how something looks and simplifying it, it gets easier to spot the obvious bugs. It gets easier to actually attain a certain level of quality. So when you have the habits of like, we built good stuff of high quality, I think you, you’re already quite well underway to, um, ending up in a reasonable place where you accept the fact that there are some bugs we fix the important or difficult ones or the most impactful ones right away.

(15:21): And most everything else we, we leave to the side. And most of all, you have the, I wanna say almost moral stamina to keep going. I’ve seen people get absolutely destroyed by guilt over bugs. This often happens in open source development where the customers aren’t even paying you if you wanna put it in that way, right? Like they’re benefiting from your part-time gifts and then they show up, oh, it has a bug and rah, and then you go like, oh man, I’m also just such a bad programmer because I didn’t see this already now. No, no, please stop. Right? Um, just stop. First of all, you’re, you’re making something of value. That’s why they’re complaining and this is should already dilute the level of complaint that they’re giving you. Um, but also if you’re focusing on just building stuff to the best of your ability and you’re constantly striving to improve, you know it’s not gonna be any better. This is one of these things that sometimes does get me frustrated when good software developers are kind of written by guilt over to your occasional bug they put in. Like what’s the alternative? If you are already putting in the best of your ability? Do you have another 20% of capacity? If so, why weren’t you just using it all the time? Right? So there’s just some um, bad thought patterns that I think programmers in particular are susceptible to falling into that you should break out of while still maintaining we build good stuff.

Kimberly (16:47): Okay. I know a lot of people like to hear how 37signals is organized. So tell me a little bit about who is fixing bugs. Is it just the on-call team? Are designers and developers fixing bugs? Like how does that look structurally at the company?

David (17:01): So the way we do it generally is if it’s a serious bug that warrants the call of a code, particularly a code red, we will pull in whoever worked on that feature because they have the most information about how that works. And then we also have this sense of if we’re launching something within that same cycle, especially during cool down, so we’ll work for six weeks and then we take two weeks of cool down, you’ll work on something, let’s say for six weeks you’ll push it out and then in the first week you’ll find 80% of the issues if you’re pushing to a large enough customer base. The team who just worked on that is responsible for cleaning it up. But after that phase, it is everyone’s job to ensure that we have software of high quality and we have a couple of different methods of doing this.

(17:45): One is we have what’s called the on-call rotation where we have two programmers who will work on just fixing issues and answering questions and correcting data for a week at a time. And it is their responsibility to fix minor things that can be fixed within the scope of that week. So let’s say a customer found some edge case, they report that to support, support goes like, oh yeah, that’s really an issue. They’ll bubble it up to the on-call team. The on-call team will take a look and if it’s something small they can fix right away, they’ll fix it right away. If it’s something that’s urgent and important, they might call a code. If it’s not something they can fix right away and it’s not urgent and important to put a code on, they will put it on what people sometimes call the backlog. I like to call it like a, a forget drawer because it essentially gets put into a place where like, do you know what, there’s a fairly high likelihood this is never gonna get fixed.

(18:35): That is the category of bugs, that it’s not urgent that it’s been, or it’s a result of interactions and software that are rare and that it’s lived there for a long time without someone else reporting it too. Those are three attributes of bugs that are highly unlikely to be fixed, um, unless against they have that high criticality because it means that, you know what, if it really was important, someone else would’ve found it sooner. It was really dire we would’ve bubbled it up to a code and then we put it on these long lists and we do have some quite long lists and I’m sure you can find lists for previous versions of Basecamp still in Basecamp somewhere detailing, oh here’s like a, uh, 80 bugs related to something never gonna get fixed, right? Um, and then occasionally what we also have is that we will dive into that um, drawer of probably never gonna be fixed bugs and fix them occasionally.

(19:24): We have done two things. We’ve run spring cleaning. I think that just sounds good and it, it has a nice metaphor to it. So in the spring, occasionally we will dedicate either a whole cycle or part of a cycle to fix something. And then at the end of the year, the way our six cycles a year usually line up is that we will finish our last cycle in the beginning of December and then December is already this kind of, I don’t know if it’s a messy month but it, it’s a month that you have a lot of people out and so forth. So we will dedicate the rest of that month to just fixing issues that otherwise wouldn’t merit full prioritization. So you can have all these interleaving and overlapping ways of doing it, but I think ultimately you have to think of it in terms of like what, where’s our overall quality?

(20:08): Like if I had a sense, if Jason had a sense, do you know what customers are not picking Basecamp or they’re stopping to use Basecamp because they just keep hitting issues that it’s a low quality product? Absolutely I would stop all feature development and I will get to the root of that. But if you have a diligent approach to building software, hopefully that’s not where you are. I mean it does happen all the time that people dig themselves into this hole. It usually happens when they don’t have a measured, mature way of scheduling things. They’re constantly thrashing, they’re constantly pushing things out, they’re constantly multitasking. You end up with massive amounts of tech debt. Um, you never go back to clean it up, okay? You’re gonna have a low quality product and the most important thing you could work on is cleaning up that product or quality. But if you are where we are and I think also the best software companies are your product is a pretty good quality, treat the bugs as any other thing that needs to be prioritized, trade it off against other value. You could deliver to more users more at the time.

Jason (21:09): And to add something to that too, I think it’s easy to think that bugs are the things that need fixing, but really it could be the user experience isn’t very good or something isn’t very clear or this copy, it’s not, the copy isn’t good or doesn’t make sense or the instructions are aren’t clear or this flow takes too many steps and these aren’t bugs. I mean that you could classify them as them I suppose, but they’re really just flaws in the product or maybe inefficient approaches to something. And that might be far more important than something that’s quote broken. But if you just focus on brokenness and you just assume everything else that works works well, then you’re never gonna really improve the product in other ways. It probably matter more than than plugging the holes things that people, you know, might fall into. So I think that’s a really important thing. You can get sort of lost in fixing things that are broken and then not attributing any brokenness to anything else because it works but works how well? That’s the real question.

David (22:09): I think there’s a great way of thinking of this is that there exists a lot of extremely high quality software measured in the sense of it has no bugs or very few bugs that no one gives two hoots about and are not using and are not paying for.

Jason (22:24): Perfectly bad software basically. Perfectly bad. Yeah.

David (22:27): Exactly. But just creating high quality quote unquote high quality software, bug-free software is not nearly sufficient for you to have a successful business, right? So way those two things up don’t create shitty software. Although I will actually say, to put it to a point, I would rather you create low quality software that people want than you create high quality software that no one buys. Low quality software that people want. That can be fixed. You can fix those bugs if you have customers and revenue and whatever coming in. You can’t fix a bad product in the sense of a fault free one that no one cares about. That’s a much harder thing to correct for. And I think the myopia that Jason identifies here is very prevalent amongst the engineering types. We have this sense of narrowing in on bugs because they can fit on a checklist, they can be checked off. Checking off, like our onboarding process sucks. That’s very fussy, right? But it may very well be worth a thousand bug fixes to correct what the onboarding is like.

Kimberly (23:34): Well I also can’t help but think that customer support and bugs kind of go hand in hand because you want a company where you feel like you can reach out and say, here’s the issue that I’m having and I’m having this problem. Versus like, you know, there’s so many companies that I’m like, I’m never gonna reach out 'cause they’re never gonna respond to me and I’m just just not gonna use the product.

Jason (23:55): That’s a, to me, like a bug in the company too. So this is something we talked about in, uh, It Doesn’t Have To Be Crazy At Work, which is that your company is a product as well. And we actually reference this, that what are the bugs in your company? Like what doesn’t run well in your company, um, is something that you should really look at. And that would be an example. Like if you already feel like you’ll never hear from this company, you’ve already written 'em off. Like it doesn’t matter how good or bad their software is, but if their service is terrible or they don’t feel accessible or reachable or whatever, or they don’t feel like they have your back, then you’re out no matter how damn good. It’s, so that in my opinion, would be a bug. We’re gonna call it that. We’re kind of classifying that a bug in the company itself that needs to be fixed far more, I would say faster perhaps than, than some obscure thing that’s in the software that is easy to track as a checkbox.

David (24:44): This is one of these emotions I get all the time when I deal with the big tech conglomerates. Like I never even factor in that I could write Gmail to report an issue because I just have no confidence whatsoever that anyone will even look at it, right? So I think when we say this is one of the more nuanced uh, takes, we need to sort of process here because like we actually do look at this stuff. It’s not that no one will ever look at a bug report coming in. People will absolutely look at it, it will get filtered through. It’s just that having the humility to accept reality that A, the software is gonna have bugs and B, you can’t fix all of them and C, you shouldn’t, um, does not undermine at all that every customer out there who reports an issue deserves to have someone legitimately look at it.

(25:34): Because if there weren’t, how do you even know whether it’s a big issue or a small issue, right? You absolutely need humans to be part of that process and you need to have faith or I prefer to have faith in the companies I buy from that, you know what, they’re gonna take me seriously. That’s really the thing with something like Gmail, which I used for, I don’t know, 15 years, I found all sorts of stuff where I went like, this is not quite right or this is not quite right. And you’re like, never once did it occur to me to write Google to talk about it. Right? And thankfully we have tons of people writing us all the time to report little things and sometimes big things and sometimes, and this is where the hard thing is, it’s a trickle of little things that actually unearth the fact that there is a big thing. The first report you go like, oh, maybe this is a minority issue or whatever, weird combination. Then you hear the second one, then you hear the third one and you go like, uh-oh, pull the cord.

Kimberly (26:28): Okay, well with that we’re gonna wrap it up. I’m also gonna link in our show notes to our developer’s blog, which is at dev.37signals.com. That’s where some of our developers write about some of their processes. You can check that out as well. REWORK is a production of 37signals. You can find show notes and transcripts on our website at 37 signals.com/podcast. Full video episodes are also available on Twitter and YouTube. And if you have a question for Jason or David about a better way to work and run your business, leave us a voicemail at 708-628-7850 or you can send us an email at rework@thirtysevensignals.com.