What Software Development Can Learn from Aviation
- Good craftsmanship is the basis for a sustainable and efficient software development process. The groundwork is laid if the different pieces making up your application are of high quality.
- Fostering a culture where everyone follows well-established practices prevents big surprises and creates a cohesive team. Knowing that your colleagues follow high quality standards will make you less likely to take a shortcut when things seem heated. Nobody wants to be the guy constantly being reminded to do his homework.
- Thinking about and preparing in advance for things that could go wrong will not only reduce the time to recover, but also let us sleep better knowing we’ve covered all the bases. For example, practicing to recover a database outage might not prevent the outage itself but will make the restoration a lot easier.
- Looking outside of our own software development bubble and finding similarities with others allows us to apply similar and already well proven solutions.
- Taking inspiration from aviation, like focussing on clear and precise language, can help manage our growing complexities in software development
Whether it’s a personal pet project, an open source project on GitHub, or a large enterprise application, an important part of the software development process is good craftsmanship.
However, a lot of things that are non-negotiable for a good craftsman are often ignored or interpreted in a creative way when dealing with software.
A lot of professions have been around way longer than software development and have developed “best practices” to handle typical problems and challenges. Aviation puts emphasis on precision and reliability – who wants to hear that his flight has been cancelled because of missing or bad quality maintenance?
So we as software developers can definitely benefit from taking a closer look at aircraft maintenance or a pilot’s processes to learn from them, optimize our own processes. and last but not least, try to reduce some of the stress that we experience over and over again.
Learning from pilots or airline mechanics
In a difficult situation, people tend to do what they are familiar with and what they are good at – not necessarily what they should do to tackle the problem at hand. So we need to align these two things; we have to be familiar with doing the right thing in an emergency.
Pilots do this by practicing these things over and over again. Handling a burning engine will always be an emergency. But if a pilot is familiar with the procedures and knows exactly what to do when it happens, then he will do the right thing because it’s something that he is familiar with, having done it over and over in the simulator.
As software engineers we are often very well aware of the things that could go wrong: a crashed database, a compromised password, or a network error. But we seldom really prepare for this by practicing how to restore a database or how to rotate our security credentials throughout all systems. By doing this, we can not only reduce the time it takes us to recover from such scenarios, but also have some peace of mind because we know that if something does happen we’ll be very well prepared.
Airline mechanics are aware that they are working on an immensely complex machine. After all, a large modern airplane is made up of several million pieces that all have to work together. Maintenance on an airplane doesn’t simply happen; it is planned, checked and double checked. People take their time to prepare the part of the airplane they’re working on.
As software engineers we often start coding immediately, because “code is where the fun happens”. We don’t take the time to actually think about “What are we trying to accomplish? How does this change relate to the overall application?” Taking a step back to get a clear idea of what should happen and how to approach the problem is often a lot more useful than rushing to action, or as an unknown developer famously said, “Weeks of coding can save hours of planning”.
Whenever you build something, you always leave a mark of yourself on it. A good craftsman takes ownership of what he does, because he feels responsible for making sure the result is as good as it can get, not just because a bad result will reflect badly on him, but also because it’s his inner attitude and work ethic.
Deviating from the right path, like not working thoroughly or taking a shortcut, may sometimes seem like a quicker way to reach a goal. But like a boomerang, it will come back and hurt you – sooner than later. Sloppy work will yield sloppy results. Sloppy results will force us to go back and spend time on fixing things. Fixing things takes time away from building new features and from producing actual value for our users. It will increase our frustration (“I thought I fixed this yesterday. Why is this happening?”) and will slow us down.
So good craftsmanship is the basis for a sustainable and efficient software development process.
My team and I used to work with a difficult client who was fresh out of a bad experience with another vendor. So every result was checked twice, and we had to explain even our smallest decisions in great detail. But by doing our homework and delivering high quality work, we were able to, step by step, convince her that we paid attention to all the details and covered all the bases. Trust grew, the review meetings became shorter and shorter, and eventually stopped completely. By demonstrating a culture of high quality and good craftsmanship, we were not only able to deliver the product on time and on budget, but also got an internal referral to work on a different application as well.
How aviation works
Aviation is a very complex environment with a lot of parties involved, and every single one of these parties is aware that their actions have consequences for all others. An inoperative runway at a major airport like London Heathrow might cause delays more or less all over the world, so everyone tries to make sure that all the parts of this global machinery stay well oiled.
This includes things like always thinking a couple of steps ahead. When a flight is delayed and passengers miss their connections, airlines don’t wait until the plane has landed and the passengers start to approach the customer service agents. Instead, they proactively rebook these passengers on alternative flights to ensure they end up where they need to be.
Although an airplane only makes money when it’s in the air and actually transporting passengers from A to B, every airline is aware that maintenance is a crucial part of their overall operation. Skipping a minor check or two may work for a short period of time, but eventually will make future maintenance operations take longer and become more expensive. This doesn’t only apply to the hardware – the aircraft- but also to the software – the processes and procedures. Investing time into “what-if” scenarios may seem useless at first, but when a snowstorm suddenly appears it’s comforting to know that the procedures of what to do are already existent and just need to be activated.
Another important aspect is safety. Aviation prides itself on having the highest safety standards. A large part of the training, as well as the daily business, is preparing and being ready for emergencies. Not only pilots, but also ground staff and air traffic control train for emergency procedures over and over again so that they are intimately aware of what to do and how to behave if something unexpected happens.
Nobody wants them to happen, but the cruel truth is that accidents will happen. We can work to reduce them, but it’s just a question of when, not if, they will happen. Aviation has again and again demonstrated that what matters is how you react to them. Every accident is ruthlessly analyzed and the findings are always used to optimize existing hardware and processes. Data is openly shared between otherwise competing airlines and other parties because here they all have a common goal: making sure that at least an accident for which we know the cause will not happen again.
What stops us from following “good practices”
I may experience pressure from my manager to finish a feature earlier or add additional scope items. The right thing to do would be to explain to him that delivering the feature with good quality simply needs time. But that’s a difficult discussion to have and something that I don’t want to do.
So a very natural reaction is to cut corners and try to drop tasks that don’t seem that urgent to do what is asked of me: work faster. Maybe I can drop a couple of tests, as they don’t really block releasing the feature to production. Of course, that will reduce the quality of the codebase and will result in a lot more work further down the road, but for now I can say “Alright, I’m done!”
That’s just a simple example, but it already illustrates how to get on that slippery slope. That’s why it’s so important to foster a culture where everyone follows well-established practices so that cutting corners simply isn’t an option.
Nobody starts out by saying “I don’t follow good practices because I want to create a crappy application”. Most of the time it’s simply taking a shortcut to avoid confrontation.
One of the ways we raise awareness of the importance of following good practices in our team is to invite everyone to look at pull requests of projects they’re interested in or want to know more about. These reviewers usually aren’t directly involved in implementing the feature, and thus provide a fresh perspective on the code. It’s easy to become lost in one’s own thoughts and ideas, but getting a friendly reminder like, “This looks good, but don’t you think it’s a bit too much code duplication and not enough clarity on what the code intends to do?” helps to get back on track.
Our team also regularly discusses what we have experienced in the past and what conclusions we can draw from that. Hearing from a colleague who has spent hours debugging because the data flow wasn’t traceable leads to empathy from others and better code in the future. No one wants to see others suffer. It also helps to be reminded of decisions the whole team agreed upon (manifesting itself in review comments like, “Remember, we didn’t want to fall back to generic ‘execute’ methods any more? This may actually be a good opportunity to refactor this module to use more meaningful method names”).
How I applied learnings from aviation
Just a couple of months ago we introduced the concept of database outage drills. We are a data-driven company so our data is our most valuable asset. Making sure that it is safe is one of our highest priorities.
As with pilots practicing how to react to an engine outage, we regularly practice how to react to a database outage. Once a month two of our engineers are randomly selected to run a database outage drill. We present them with the scenario that one of the databases on our staging system has crashed and needs to be restored from a backup. In this scenario they are the only people available and need to get the database up and running as soon as possible.
We learned pretty quickly that these drills are enormously helpful. They give our people the confidence that if something like this actually happens, they won’t have to guess (or find some documentation on) what the next move could be, but can rely on their experience. It also greatly improved our documentation and tooling which apart from being helpful in an emergency, has given us a better overview of our system landscape.
We can already see that when performing the drill for the second or third time, our engineers are a lot more relaxed. They know what to do and what to expect. We’re pretty confident that if something like this should ever happen in production, we’ll be able to handle this a lot better and a lot quicker than if we had not practiced these drills.
An interesting side effect of these backup drills was that we also detected an error in our backup procedures. Under certain circumstances the backups themselves weren’t written correctly. During one drill our engineers detected that the latest backup was older than it was supposed to be. It turned out that an error had occurred while creating the backup which we didn’t notice. So one of the outcomes was to fix our backup creation and set up a notification to make us aware if backups haven’t been written correctly, something we may not have caught without these drills.
Another best practice from aviation that helped us immensely were checklists. Pilots use checklists all the time – shortly before taking off, before landing, at certain other important steps in the journey, and of course in emergency situations. Checklists avoid ambiguity. They tell us exactly what to do – we just need to follow them. They are ideal for tasks that require the same sequence of actions over and over again.
For our team, one of the places where this comes in handy is when onboarding or offboarding people. A new colleague always needs a certain set of credentials for our systems: task management, log aggregators, databases and so on. For a colleague leaving, we want to ensure that his credentials are removed and shared tokens are recreated. Before introducing checklists, there was always something that we forgot. This took time to notice and to fix (“Hey, it seems like I have access to system A but cannot login into system B”). No bug issues, but things that took time away from other tasks.
By collecting all the different tasks in checklists we can be sure that nothing is forgotten, and when checking all the boxes we can be sure that there is nothing left to do. Having the different tasks visualized in a checklist also helped us to automate most of it. When we know exactly what to do, it becomes almost a no-brainer to automate these things and have a machine to do every little thing.
Increasing the visibility of ongoing work
Ongoing work should be tracked at a single location. This can be a board with post-its on a wall, or a software tool like Jira or Trello, but there needs to be one single source of truth that everyone uses. Establishing transparency by having all the ongoing work being tracked like this will give you a realistic picture of what is actually happening.
We have all heard status reports like “Almost done”. What exactly does “almost” mean? Are we talking about a single detail or about a whole list of things that all need to come together? In aviation it’s important to know when a job is done and what is left to do. Nobody wants to fly on an aircraft whose maintenance is only half finished because someone simply didn’t know there were still some open issues.
Having all work visible also allows us to detect dependencies or overlaps; if a couple of people are working on the same issue, it may make sense to have them communicate to each other what they are doing so that they know about what the others are doing.
It all starts with knowing what is currently happening.
Stress often stems from uncertainty: uncertainty of what to do, uncertainty of how to do things, uncertainty of when to do things, and uncertainty of how things will be perceived. Reducing these different uncertainties will reduce the stress that they’re causing.
Take the example of practicing for emergencies: a big contributor to the stress we feel during an emergency is the uncertainty of what to do. How do I fix the outage? What do I have to do to get the database back up and running? Having the knowledge and experience gives me confidence and reduces the stress level.
When I had a chance to actually visit an aircraft maintenance hangar, I was surprised by the relatively small number of people actually working on the aircraft. After all, an aircraft doesn’t earn money while being serviced, so I assumed when being in the hangar you want to do as many things in parallel as possible. But in talking to one of the mechanics, I was told that having too many people working on the same components at the same time heightens the stress level. So their solution was to have a smaller number of people working concurrently, but giving them the chance to fully concentrate on what they are doing. In the end, the number of mistakes, big or small, gets reduced so no additional time needs to be spent to fix these issues.
This also applies to building software; most of us probably have been on a tight schedule or in a difficult situation where the pressure and stress levels were already high. One option in these situations is to bring additional people into the project to finish it more quickly. While this may work in some situations, it often increases the stress level and the overall load. The new people will need to be ramped up and the overall communication within the team needs to increase. So while looking counterproductive at first, it may actually be a better solution to take people off the project and give the remaining team a chance to concentrate on the really important things.
Learning about how aviation works
Luckily aviation is a topic that many people are interested in, so there are a lot of resources on the internet. First, there is Wikipedia’s Aviation page. It provides a good starting point for branching out into different aspects, be it technology or logistics. If you’re more interested in the personal stories of pilots or flight attendants, aviation blogs provide a good window into “life in the air” (for example askthepilot.com or flightattendantlife.com).
I would also recommend visiting YouTube. A lot of excellent creators have produced hours and hours of material covering aviation from every perspective imaginable. Wendover Productions is just one example with some great videos on different aspects of the aviation industry. As with blogs, you can also find several pilots and flight attendants regularly vlogging about the airline industry. Mentour Pilot and Captain Joe cover a wide range of topics, from explaining how turbulence builds or how many landings airplane tires can withstand, to answering really important questions like “Can a pilot have a beard?”
Aviation is fascinating, not only because it works with modern and impressive technologies, but also because it’s an enormously complex environment with many participants all needing to work seamlessly together to make traveling from A to B work.
Software development is becoming more and more complex as well, not only on a technical level, but also on an organizational level. Hardly anything these days works without any kind of software.
Taking inspiration from aviation is one factor that can help us improve our day-to-day work and offer solutions to some of the problems we’re facing. It will not magically solve all of our problems, but can help in our quest to become better at what we’re doing.
Applying some of aviation’s principles is pretty easy. Building a checklist for example doesn’t take a lot of effort, but the additional security it provides is felt from day one. So let’s be the captain of our own project instead of merely having a back seat, watching everything fly by.
About the Author
Christian Seifert has been busy writing software for 20 years. He is currently working as team lead software development at BetterDoc in Cologne, Germany, where he helps to match patients’ needs with the right doctors. Having experienced a wide range of projects and requirements, he is constantly asking himself: how can we do things better while keeping the fun in what we’re doing – even in stressful situations? Although originally fascinated by working with machines, these days he also enjoys interacting with people, trying to push software craftsmanship ideas and help other developers to realize their full potential. He will speak at Agile Testing Days 2021 where he will explain in more detail how applying ideas and principles from aviation can help in software development and software testing.