- RUN! I’ll explain later (2022-04-18)
- What is this “AI” you speak of? (2022-04-19)
- Certainty and Uncertainty (2022-04-20)
- Traceability: Beware of what you wish for (2022-04-21)
- The tragic case that was entirely explainable (2022-04-22)
- General Izability to save the day (2022-04-23)
1. RUN! I’ll explain later
Apparently, the most common phrase uttered in the long-running science fiction series Dr. Who is “I’ll explain later.” In a universe where a sympathetic immortal alien saves humanity from fantastic monsters and cosmic disasters on a weekly basis, all the while traveling through time and space in a blue box that’s much bigger on the inside, only meeting friends who speak English and foes who can’t climb stairs but can build epic fleets of space warcraft, the audience may be forgiven for having the desire to be reassured that there exists, somewhere, a short set of statements that causally connects all the bewildering fictional developments presented in the latest episode, and helps to make perfect coherent sense of them all, even if we never actually get to hear them. I’m a big fan.
Explanations are short stories that are meant to reassure. When building automated systems that guard life or whose failure can cause death and destruction, for example, in medical or aerospace applications, assuring and re-assuring are called "certification." In aerospace, for example, rules and regulations with uninspiring names like ‘CFR part 25.1309’ prescribe that such systems come with evidence that (a) they do what they are supposed to do, and (b) that they don’t do things they are not supposed to do that might be dangerous. For software-based systems, in particular, standards like “DO-178C, Software Considerations in Airborne Systems and Equipment Certification” further specify the extra work you have to do besides making the system actually work, to provide this evidence to the satisfaction of the regulatory authorities.
A key concept in DO-178C and related standards is ‘traceability of requirements.' First, you write down very precisely what the system is actually supposed to do and what it is certainly not supposed to do. These are the functional and safety requirements at the aircraft and system level, which software engineers then get to split out further into design requirements, which in turn are implemented in software. By tracing the links up and down this hierarchy, we have an explanation for every bit of code (it is there to meet this or that design requirement, which in turn is there to meet such and so system requirement), and conversely, we can “prove” why we think every requirement is indeed met, namely because these pieces of code here, assuming all the reasoning is valid and the code is free of bugs. When the “proof”* is to the satisfaction of the regulator, certification follows.
When we founded Daedalean in 2016, the discussion around Artificial Intelligence in the aerospace industry and other safety-critical applications was based on a widespread belief that nobody knows how “AI” works, so you can’t explain why the system does one thing or another, and, moreover, that such systems are non-deterministic, so you can’t even predict what they’ll do in a given circumstance, and therefore, expert opinion had it, AI could never be certified for safety-critical applications. That seems bad indeed, so if only we had “explainable AI” where we could trace the internal structure of the AI component to requirements, as well as the system outputs to those internals, then we could use the methods of DO-178C to certify our system, and we’d be done. Cue a host of ‘Explainable AI’ startups and unlimited funds being funneled into this research direction.
There are a couple of layers of fallacy and misconception in this reasoning, leading to the false conclusion that some magic concept of explainability would lead to provable safety. These require careful unwrapping, so we don’t add to the confusion. We’re going to need a bigger blog post, I’ll explain later.
* I write “proof” in quotes because the bar of a mathematical proof is not quite met. Despite the popularity of all kinds of “formal methods”, the requirements remain a story by humans for humans.
2. What is this “AI” you speak of?
Before we dive into the fallacy of explainability, we first have to define what we mean by “Artificial Intelligence”, which is mostly a marketing term with a shifting definition. When we talk about the kinds of systems that are all the rage in the self-driving community and say that Daedalean is building for applications in controlling flight, we mostly mean Machine Learning, a field in which so-called Deep Convolutional Neural Networks trained on so-called Big Data have taken the spotlight.
Systems built with this technology can answer questions that until about 15 years ago were too hard for computer programs, like “is there a cat in this picture” or “what bird is this” or “where is the runway in this image”, to name some relevant ones for our purpose.
In machine learning, instead of locking a team of software engineers in a room with the requirements and sufficient coffee and chocolate, you twiddle the knobs on a computer program called “the model” until you find an acceptable solution to your problem. Neural networks, a family of such models, have a simple basic structure, but they can have many millions of knobs, and the twiddling is done by another computer program called “the machine learning algorithm” that uses large datasets of labeled examples together with some known recipe to construct ever better settings of all the knobs until it finds a member of the family that meets the requirements you set out to meet.**
This may sound like magic, but it is an extension of statistical techniques, very much like the one you may have studied in high school: Linear Regression. Imagine you are given a sample of people’s heights and weights in a spreadsheet, which we can visualize in a scatter plot:
We can try to draw a line through the data to fit the points best. The model family takes the form “weight = alpha x height + beta”, where alpha and beta are the parameters. The recipe to fiddle alpha and beta to get the best fit follows from defining the error as the sum of the squared differences between predicted and measured values. Your spreadsheet software can machine-do this for you. Now we have machine-learned what we call a “model” of height-vs-weight in this dataset, which we can use to help with all kinds of things, like designing better chairs, or determining who is, statistically speaking, too short for their weight in order to recommend health measures.
After alpha and beta have been optimized, the residual error may say something about how well this will work, which ultimately depends on the problem you are trying to solve, which is what we would call the ‘system requirements’. Typically we would use the model to make predictions on cases that were not part of the dataset we tuned the parameters on. That is to say, we are trying to predict something based on previously unseen data.
In much the same way, to answer the question ‘is there an aircraft in this 32x32 8-bit pixel camera image?’ a Deep Convolutional Neural Network is tuned by taking a dataset of images where the answer is given as a ‘yes’ or ‘no’ label and having a machine learning algorithm explore a very large design space until the result is good enough for our purposes, which may or may not be possible. A more advanced class of models may even output a bounding box around the aircraft or say for each pixel in the image whether it is part of an aircraft or not.
When we feed the linear regression model some data in the form of single observed height measurement, it very deterministically produces one single predicted weight. However, there may be nobody in the dataset who actually has that weight, or there may be multiple samples with different weights for the same height, and so when this model is applied to unseen data, we expect a non-zero error. If the dataset we used is a good sample of the unseen population, we expect this error to be distributed according to the same distribution we observed for the residual error when tuning alpha and beta. But if the sample we ‘trained’ our model on is wildly unrepresentative of the conditions during application, for example, because of all kinds of biases during sampling, say only female sumo wrestlers, all bets are off.
With the Deep Convolutional Neural Network, we get the same: we feed it one new 32x32 image, and provided we properly implement the runtime hardware and software, it will produce in hard real time a ‘yes’ or a ‘no’ output or a bounding box or a pixel mask. If we feed it pixel for pixel the same input twice, it will deterministically produce identical output, but if this is on previously unseen data, it may be wrong, and the best we can do is to be wrong as often as we were when we were training, which depends on having used the right dataset.
With this background, we can already begin to understand how and why this form of “AI” works when it works: we build models of data that we then hope will generalize to their application in the field by virtue of having captured underlying regularities of the problem domain. We still have to verify for any concrete application that it works indeed.
Next post, we’ll look into what we can already say about AI, now that we have specified what we mean by it.
**Here I am specifically talking about offline supervised machine learning. Unsupervised, reinforcement and/or online (“adaptive”) methods each come with their own can of worms which we can avoid by simply sticking to offline & supervised.
3. Certainty and Uncertainty
From the picture of machine learning that we sketched in the previous episode, we can draw two important conclusions: first of all, the requirements that your hardware and software are working as intended and are safe remain unchanged: your avionics system will have to meet DO-254 and DO-178C standards, it will have to be sufficiently high performing in its computation power, it will have to be reliable so that internal memory does not spontaneously flip bits and CAST32a compliant so that multicore processors don’t unpredictably stall each other, ruining any hard real-time guarantees.
We also require that the whole system is in a proper enclosure and that the power supply is reliable (DO-160) and that the ethernet cables won’t catch fire etc etc. Much hay is made of assuring – i.e. providing certainty about – these things, since aerospace engineers know very well how to certify them for a living, but they have zero bearing on whether the neural network’s prediction is correct or not. The emergent property of the model's predictions matching reality is not covered by any of these standards. They are necessary but insufficient by themselves to guarantee the system’s overall safety and fitness.
Second of all: while the system produces an output deterministically, how well that output fits reality is a matter of statistics, and this is because of the nature of the problem with its inherent uncertainties and not because of the nature of the solution. Calling a machine-learned system non-deterministic misattributes the source of uncertainty which really lies in the environment from where the system gets its inputs.
For example, one component of Daedalean’s Visual Traffic Detection is a neural network that decides if an image contains an aircraft or not. Here are some examples:
Even if our training data set is 100% accurate, there may simply not be enough information in the 32x32 pixels of 8 bits each to make the call, just like height does not uniquely determine a person's weight, even though there is a clear statistical relationship. At a great distance, an aircraft may be indistinguishable from an equally shiny ground vehicle. There are 2^32x32x8 (approximately equal to a 1 followed by 2466 zeroes) possible images in our input space.
Between any image that we would clearly label ‘yes’ and one we would clearly label ‘no’, there are insanely many images that differ only by one bit in one pixel which are on a decision boundary, and to test exhaustively that we get them all right is completely impossible. It is not even possible to determine what ‘right’ is in this case: the system requirements will have to be statistical in nature.
While this looks like a weakness of machine-learned systems, it only makes sense to apply them exactly to such problems. If a problem is easy to capture in crisp requirements, you can probably construct a traditional rule-based system with all its guarantees and translate that to code in a straightforward way. It is exactly for the kinds of problems that have this kind of unavoidable uncertainty, that machine learning is an effective, and currently the only feasible way to come to a solution.
As I have argued elsewhere, in the air, these are the kinds of tasks that are currently handled almost exclusively by human pilots, and if we want to build systems that take over risk management in flight, we are going to have to learn how to deal with uncertainty. Rather than faulting the system for acting unpredictably in an unpredictable environment, we must establish statistical bounds on how well it works given a correct statistical description of the inputs. This will be one of the pillars of certifying Machine Learned systems, to which we’ll return in the last post in this series.
But first, we’ll look at the other big objection: lack of traceability. Next post.
4. Traceability: Beware of what you wish for
With the ‘non-determinism’ out of the way, let’s have a closer look at the traceability of the system's internals to requirements and of the system’s output to its internals, the alleged major flaw of machine-learned systems.
In a neural network, these internals are made up of possibly millions of parameters, spanning a space that is once again orders and orders of magnitude larger than the already huge input space we talked about before. It is true that no particular value of any of these parameters can be easily linked to a specific example or feature from the training dataset, or given a meaning that fits a story other than ‘we tuned all these numbers with this machine-learning recipe’.
Some attempts at ‘explainable neural networks’ use the technique of running a network in reverse to find out what input pattern would maximally stimulate a particular neuron, which leads to spectacular psychedelic images that give you a clue of what goes on in the network. Other techniques identify which pixels in the input image contribute most to the outcome. You could be excused for thinking that these, the neuron weights and the pixels that matter the most in case of a particular bug, give you a way to fix the system for that bug, and you would not be entirely wrong, but perhaps in a surprising way.
With ‘traditional’ software, proving correctness when many permutations and combinations of inputs may also arrive in different orders in time is already highly non-trivial. We restrict ourselves to making our C++ code as scrutable as possible by avoiding constructs like unbounded recursion, dynamic memory allocation, and multithreading and use techniques like MCDC coverage to flush out bugs in the logic. If you fix one such bug, you can be certain that you have addressed a significant sector of the input space and can ensure or at least reestablish hope that the system is working perfectly as intended (again). This is why we like the traceability property in such systems: suppose we find a combination of inputs for which the system did not meet the requirements; by fixing it, we can likely plug the gap in our reasoning and re-certify for airworthiness (although we typically face an investigation on why we didn’t find this bug earlier).
With a neural network, if we find an example of a wrong output, say the system says ‘no, there is no aircraft here’ when there clearly is one, we can run through the network and thereby reconstruct exactly why*** the output is ‘no’ instead of yes, and indeed we can use the knowledge of which neurons and pixels matter most to tweak all the weights a bit, and this, in fact, is exactly how the training algorithm works when it does its knob-fiddling. So we already possess the maximum amount of traceability of the outcome to the individual weights, as the weights are set by doing this ‘fixing’ step on each member of a sufficiently large training dataset, and as the output is determined straightforwardly by applying the mathematical operations that make up the neural net. Where human programmers have to write a little story explaining why they implemented this code to match that requirement, the machine learning algorithm is, in a very precise sense, exactly that story.
But there is one very important difference with the classical system: where fixing the one bug in a classical system gives you a reason to believe you re-established total correctness, in the machine-learned system we never had such an ideal to begin with. We rely on the machine learning algorithm to capture an underlying generality of the problem that then, by virtue of the training set being a representative sample of reality, will work with the same expected non-zero error once deployed.
The problem is that once you add a specific case to fix from your test set to your training set, you have completely destroyed the reason you had to expect this generalization to hold, and you can no longer expect your system to behave well in production at all.
The reason is that this generalization property rests on sampling from reality with independence. This concept of independence is very close to the independence that DO-178C prescribes between testing and implementation, only here it has a very precise mathematical meaning. And it is a precious property that is easily disturbed by good intentions and completely destroyed by hand-fixing particular cases. We will look into this in more detail in the last post of this series.
So to summarize for now,
- We can already trace the neural network parameters to the requirements because the machine learning algorithm tunes them so that the target performance is met on a dataset. The system requirements become requirements on the definition of the target error function, and on how the training/testing datasets are sampled representatively from a relevant operational domain. These requirements ‘explain’ the parameters, and this, in turn, explains the model performance on the dataset, including its flaws.
- We can already trace the system output to its neural network parameters because we perform a very straightforward calculation, which explains how the output is reached in great detail. Even if we had a more concise or somehow clearer (in the eye of the beholder) summary of this detailed explanation, we could not use this traceability to fix bugs or improve anything about the system beyond what the learning algorithm is already doing.
Therefore any attempt to create more rule-based, more ‘explainable’ versions of machine learning will only produce systems that work less well, as they inevitably lack the capability of dealing with input uncertainty, and will result in systems for which there is no reason to believe they will be fit for purpose and safe in production based on how they perform in the lab.
Tomorrow we’ll ask ourselves the fundamental question o̶f̶ ̶l̶i̶f̶e̶,̶ ̶t̶h̶e̶ ̶u̶n̶i̶v̶e̶r̶s̶e̶,̶ ̶a̶n̶d̶ ̶e̶v̶e̶r̶y̶t̶h̶i̶n̶g̶: What problem are we trying to solve?
*** If you don’t want to call this a valid explanation because it is somehow too detailed or ‘not human-readable’, I challenge the reader to come up with a definition of ‘explanation’ that is not just in the eye of the beholder that this example falls afoul of. I suspect that one reason the field of “explainable AI” can keep applying for research funds is that everyone carefully avoids precisely defining what constitutes a valid explanation and what not.
5. The tragic case that was entirely explainable
In the previous posts, we went into some detail on how a machine-learned system can never be perfect and how you can’t really fix that. That doesn't sound like a good idea when dealing with systems that guard life and can wreak havoc when they fail. Why bother?
The answer is that somebody has to solve these problems, and currently, it is typically a human operator who is not exactly perfect either. If we want to make progress towards systems that are safer, faster, better, cheaper, and more reliable, we have to deal with the actual tasks that are currently done by people. Whenever it comes to managing risk by keeping the operating parameters of a system, say an aircraft or a car, within safety bounds, systems will have to deal with uncertainty in the environment. The cases where you can engineer all risk away upfront by putting in safeguard upon safeguard and restricting operating conditions cover only a tiny fraction of the set of all problems worth solving. In all the other ones, you will have to make adjustments to the system dynamically, as the situation changes under external influences, subject to imperfect information. The good news is that there’s not really a ceiling on how good we can make these systems, but we have some way to go to get there.
On the evening of March 18, 2018, a vehicle struck and fatally injured 49-year-old Elaine Herzberg crossing N. Mill Avenue, outside a crosswalk, in Tempe, Arizona. This would have been ‘just’ one of 36560 traffic deaths in the USA in 2018 if not for the remarkable fact that it was a test car operated by Uber for their self-driving system. As the system was far from production-ready, it was easy to blame and convict the safety driver who wasn’t paying attention at the time, but the NTSB report goes into quite some detail on how the system worked.
As it was, the system had multiple fundamental design flaws: to suppress spurious alerts, the engineers had put in logic to wait a bit to see if a detection persisted, which cost precious reaction time. If the system would hesitate between a plastic bag (‘other’) and a pedestrian, it would reset its prediction logic, and for a plastic bag, it would not even try to predict a trajectory.
The report explains the system design flaws entirely at the classical software level. At no point does the question “why did the neural network not recognize the pedestrian?” arise. We know upfront that false-negative recognition has a non-zero probability, and given the possible consequences, the rest of the system should have been engineered from the ground up to deal with this uncertainty.
If the safety driver had been paying attention, or if there had not even been a self-driving prototype system on board, the statement “I thought she was a plastic bag, and I did not expect pedestrians on this section of the road” would have been considered a perfectly valid explanation, sadly without a better outcome for the victim.
The report could come to a detailed explanation of what had happened because the system, fortunately, recorded enough data to reconstruct in great detail what had happened. Had the system had any output to the safety driver to say ‘I see an obstacle here, but I’m not sure if it is a pedestrian, so I’m going to hit the brakes now,’ then the driver would have had an explanation even if the system was wrong in its judgment. These forms of ‘explainability’ are very useful to find flaws when the system as a whole is malfunctioning or to gain trust and acceptance when it works as intended, but this is not an ‘explainability’ of the machine learning component.
As a robotics problem, flying is a lot simpler than driving, even though the stakes are typically higher. Where in driving you have to understand the difference between a plastic bag and a pedestrian, and a dog, and a traffic light, and a bike that’s standing there, and one that is being ridden, in flying you can go by the simple maxim: if you can see it, don’t fly into it – unless you are really really sure you want to land on it. But in both driving and flying, it is a misconception to think that the role of the human is merely to operate the controls. Pushing the buttons and handling the steering wheel or yoke to follow a trajectory can be automated easily, but the real and hard-to-replace task of the human in the role of driver or pilot is maintaining a safe state.
To do this, he or she must have an adequate picture of the current situation to predict the near future accurately enough to steer the vehicle away from danger and towards safety, preferably according to the travel plans. If you ask the driver or the pilot why she took one action vs. another, she’ll probably be able to come up with a plausible explanation, but we also know that humans are terrible at explaining their own behavior, and we almost never can objectively verify that any given explanation is actually true or valid. This hardly ever has any bearing on the ability of the human to deal with risk during flight or drive effectively.
Risk is inevitably expressed in terms of probabilities, and therefore if we want to build any but the most trivial risk-managing systems, we will have to learn how to use machine learning. A perceived lack of explainability need not stand in the way, but that’s not to say there aren’t any challenges. So one more post tomorrow.
6. General Izability to save the day
I just spent 5 blog posts explaining in some detail how we cannot hope to achieve certifiability of AI by adding a magic explainability sauce to make things DO-178C style traceable, but I also argued that to solve the next generation of problems in vehicle or flight control, we have no choice but to use Machine Learning based systems. So we better come up with some way to verify which ones are good enough and which ones should not be allowed into operation.
Fortunately, there is a way. It requires us to take a step back and rethink what we actually need.
While the methods of DO-178C are an “acceptable means” to demonstrate the adequacy of software, they are not the end. The end is what rules like ‘14 CFR part 25.1309’ prescribe: the systems shall be designed to perform their intended functions under any foreseeable operating condition. This is a requirement of the system as a whole. So if the system contains a component with limited accuracy, it better be designed to cope with that.
As an example, consider a neural network that, when given a single image, decides if and where there is a runway in the picture sufficiently precise only 96% of the time. If our system during the flight only ever looked at a single image and then further blindly landed on whatever it thought was a runway, we would crash or at least badly mess up one in 25 landings, which would clearly be terrible, and also a very bad design. But such a neural network component is part of a larger system that, if properly designed, can deal with this finite per-image error.
Daedalean’s Visual Landing Guidance system engages during an approach to an area where we know there should be a runway because we carry a database of runways. During the approach, the probability of having at least 1 frame with bad guidance asymptotically approaches 1 (i.e. certainty). At the same time, we expect the neural network to become more and more consistently sure of what it sees. This is monitored by a separate subsystem. Only when we have a consistent reading of where we think the runway is, does the system provide guidance; otherwise, it flags that it can’t lock on to the proper approach path, so a higher-level system can decide to abort the landing or use an alternate source of navigation.
By carefully analyzing the dependencies in this system, we can come to something that can deal with the 4% failure rate on a single image and becomes something that will fail to properly identify the runway without a warning fewer than once in 10^6 landings. By combining this with other sensors, we can reduce this further.
Similarly, the Traffic Detection function doesn’t have but a single chance to spot other traffic (which it would miss 3-4% of the time); instead, it has to try to detect an approaching aircraft before it is too close, for which it has a new opportunity with every frame from the camera. The probability of getting it right in each frame is not independent from one frame to the next, but it will be after some time elapses. Again, by carefully designing the system to deal with the finite failure rate per image, we can achieve acceptable performance (and definitely a lot better than human performance) for the overall system. This dependency can be analyzed completely ‘classically”, as we saw in the NTSB report on the Uber crash, which perfectly described the design flaws stacked on top of an unreliable image recognition component.
But being able to design systems to deal with finite failure rates at the single-shot image level in a component does require one very important property: that the, say, 4% failure rate we observed on our test dataset in the lab will also hold when we deploy the system in real flights. If the failure rate goes up to 40% when confronted with reality, for example, in fog or against the sun, we have a disaster in the making.
And this is the crux of the problem of certifying a machine-learned component. How can we be sure that if we measure a precision, recall, or accuracy number for our model on a dataset, it will hold “under all foreseeable operating conditions” in the operational domain?
This problem was widely studied in the 1970s in a field called ‘learning theory’, well before neural networks were invented and way before they became popular. The property we look for is ‘generalizability’, and the means to tame it is a quantifiable domain gap.
When any quantity is computed over a sample dataset, it becomes an estimate for that quantity on the broader population it was sampled from. When the quantity at hand is a distribution of an error metric, we have a probability distribution over a probability distribution of how that error metric will be in the ‘reality’ we drew the sample from, provided we drew the sample from the same distribution.
That may sound a bit abstract. Say we measure 96% recall (fraction of aircraft we saw divided by how many were there) and 90% precision (fraction of aircraft that were actually there if we said we saw them) on a dataset; then there are theorems that say that in reality, we won’t be worse than, say 90% resp 89%, depending on the size of our sample and the capability of our model to fit (and overfit) anything that is thrown at it.
These theorems are not all straightforward, and may produce unusable, so-called ‘vacuous’ bounds, like ‘the probability that you are wrong is smaller than 200%’ (we knew that!) or lead to the requirement that your dataset contains 100 billion samples. Like with all fields of engineering, it helps if you know what you are doing.
All such ‘generalisation bounds’ crucially depend on the dataset being sampled from the same distribution as you will find during operation, otherwise, it is impossible to make any meaningful statement at all. Conversely, when making a statement that the performance of the machine learning component is X with confidence Y, this statement is meaningless without specifying on what dataset and how this dataset was drawn from reality.
Consequently, it is not possible to build something that just “simply always” works, an illusion we may have gotten away with for the simpler avionics systems that we have today. Instead, the requirements on the system will have to be traceable to requirements on the error function the machine learning algorithm is trying to minimize and the dataset on which we evaluate it.
Where in the past our system, high-level and low-level software requirements were assuring stories on how one level explained the other, the machine learning aerospace engineer additionally will have to put the same effort into explaining why the dataset is representative and sufficiently large, which starts with as precise as possible a characterisation of the systems operational domain.
This, also, is not entirely new to safety-critical engineering. In software, running the same unit test twice should give the same result, but in any other part of the aircraft, tests are usually of a statistical nature. A strut is put on a test bench and hammered 45000 times to establish that, on average, it breaks after 35000 hammerings, and the service manual will say “replace after 25000 hammerings”. (My favorite example is the bird strike test for jet engines. Depending on the size of the inlet, there is a prescribed weight of bird that jet engine builders have to throw into a very expensive, running, engine to see if that doesn’t destroy it. In practice, these tests are not even independent: you throw in 1 bird, and if it destroys the engine, you build another one and test it until there is one that passes. But 70 years of building jet engines have apparently shown that this is a sufficiently rigorous test. Perhaps the fact that a failed test costs a complete jet engine makes the manufacturer over-engineer the machines rather than game the statistical flaw in the procedure.)
This is where ‘data science’ enters the stage, a different skill set than traditionally found in avionics. When using machine learning, we will have to come up with methods to characterize the operational domain and the datasets drawn from it. How to do that will be the subject of another series of blog posts someday.