With more than three decades of government service under his belt, Jose Orench knows a thing or two about information technology. As a former NASA Systems Engineer as well as Special Assistant to the Head of the FBI’s Engineering Research Facility, Orench spent his career learning about and utilizing technology for safety and defense. In this episode, he applies his extensive knowledge to the world of autonomous vehicles.
Hey everyone. This is No Parking: the podcast that cuts through the hype around self-driving tech and artificial intelligence. I’m Alex Roy, with my co-host, Bryan Salesky. Hey, Bryan.
So, Bryan, I don’t know if you saw it but there was this amazing video recently that was making its way around the internet of a passenger on a plane. The passenger was shooting from his camera phone looking out the window sitting over the wing of an engine on fire. And this went on for some time. Incredibly, the plane landed safely, nobody was hurt. My friends online were totally split between the ones who were saying, “I’m never flying again. I’m totally afraid I can’t believe that happened” and the ones who actually knew something about engineering. And they overwhelmingly agreed with my point of view, which is what a miracle it is that a system, after all these years of experience designing and building planes, could be so robust that a part could fail, the engine could explode, pieces could fall off, and the plane could still land safely. And everyone was okay. And this got me thinking that the whole concept of safety is never about a part, but it’s about redundancy built into an entire system. We had to do an episode about big disasters, complex systems, and the idea of safety engineering. So today we’re talking to Jose Orench. He’s a retired FBI special agent with a really diverse career in security, crisis management, and technology. Straight out of college, he joined NASA, then he joined the Bureau. Today he has a business called Orench Consulting where he helps businesses become more robust and resilient. And boy, will we get to that. Jose, welcome to the No Parking Podcast.
Thank you, Alex. And thank you for having me. Most of all, I’d also like to thank Bryan Salesky. I’ve heard a lot about you and I’m looking forward to this conversation.
Jose, you’ve had a really fascinating career. I don’t even know where to start. But of course, we’re gonna have to start with you joining NASA straight out of college. Can you tell us a little bit about what took you down that path?
Originally, when I was in high school, my career goal was to be an FBI agent after a career day where an FBI agent came in to talk to us. However, my teachers and my parents all thought that because of my math skills, I should be an engineer. So I listened to them and I became an engineer. While I was going to school to study Mechanical Engineering at the University of Puerto Rico, I became very interested in aerospace and in NASA in particular. So my goal was to work with NASA. I was lucky enough to be able to interview with them during my last semester and I got an offer. So that’s where my career with them started.
So this whole concept of systems engineering and baking safety into a complex system. I know that you worked on the investigation into the space shuttle Challenger disaster. I remember, I think, was it 1986? That was that one?
Yes, that’s correct.
So I remember standing on the campus by the field in my high school, when it happened. And it seemed impossible to me that that could happen. The notion that so many people and so much time and money and expertise could go into something like launching the shuttle and something could happen and it could just explode and crash was outside my ability to comprehend it. Can you tell us what caused it or what the point of failure was? And then walk back about what caused that?
So the point of failure was one of the O-rings failed at the lower portion of the solid rocket motors. The solid rocket motors are actually boosters to provide additional thrust during liftoff. And because of the cold weather, the seal did not act properly or as it was designed, allowing a breach of hot gases to come through, and eventually causing the entire system to fail. When you say walk back, we’d have to walk back to around 1968. When you look at the shuttle system, how it eventually turned out to be, you see what looks like an airplane on top of a big orange external tank and two boosters on the side. In the original design, you would be looking at two airplanes, the space shuttle orbiter on top of a much larger airplane-like vehicle that would take the orbiters to suborbital position and then it may come down and land at an airport or at a runway like a regular airplane. Also, that original design called for jet engines on the Space Shuttle so that when it returned to Earth, they could actually maneuver to the landing site. Cost cutting and budgetary concerns changed that design, removing that first vehicle and creating a vehicle that had an orbiter on top of the solid rocket and on top of the external tank and the solid rocket motors. The big problem with that is solid rocket motors were never designed for human use. When you turn on a solid rocket motor, it turns off when it runs out of fuel unlike liquid fuel engines, where you can actually throttle them, control them, and turn them off at will. So basically what that says is, for the first two and a half minutes of a launch, you have no control if something goes wrong. Those rocket motors are going to continue firing until they are expended.
So another problem with this design was because of the several joints, there were several pieces of solid rocket motor that had to be joined together. Each one of those joints were sealed with putty and also an O ring.
This O ring didn’t work properly, just like any rubber sealing apparatus at temperatures below 40-35 degrees in there. Rubber isn’t as flexible at lower temperatures. So there was actually what NASA calls launch commit criteria, which is a rule that says you cannot launch unless these conditions are met. And one of those launch permit criterias was not to launch below a certain temperature. On that day, the temperature around that O ring was approximately 28 degrees, which was well below the launch permit criteria. The night before it was known that the temperature was going to be outside of the limits, during several discussions between contractors management and systems engineers, management overrode decisions that should have been made by systems engineers, and they said we’re going to launch anyway.
So what you’re saying is that there was a suboptimal design decision made 18-20 years before the actual disaster, and that-
We call it a compromise.
I don’t know if it was suboptimal. That was a decision that was made. But the thing is, they knew what the requirement was. It was the system engineers who properly cascaded it to be part of the launch criteria. And people without the right knowledge made a really bad decision. I guess I’m curious, were the system engineers sort of scratching their heads? I mean, was there a faction of the team that was like, “You can’t do this, this is a really bad idea.” or did the engineers actually not know how catastrophic it could be? Did they not play out all the things that could happen the night before?
Yes, they did play out all the things that could happen and they did warn that it could be catastrophic.
That is insane. I actually didn’t know that. I knew a little bit about the failure mode, but I didn’t know about the management decision making. I can’t believe it. I’m a little bit in shock right now.
Me too. And so what were the lessons and takeaways from that that carried forward from the incident to prevent future problems?
So with that particular incident, the Challenger accident, the obvious one is management should never override subject matter experts in making critical decisions like that. That’s what they’re there for. And the other one is what I consider beware of normalization of deviance. The reason for that is before Challenger disaster, the launch prior to that was Columbia. And up to that point, there was a lot of… let me just backtrack a little bit. In addition to the launch commit criteria there were also these requirements called OMRSD requirements, which means orbiter maintenance requirements and specifications. There were over thousands of those requirements that were waived in the previous flight in order to launch on time. And prior to that, there may have been several hundred that were waived in the flight before Columbia. And so by the time we got to Challenger, it became normal to waive launch commit criteria and maintenance and specifications requirements.
You could see uninformed management basically thinking, “Well, look, if we leave it to the so-called experts, we will never launch because there will always be something that’s not within boundaries.” And that’s bad thinking because you’re not looking at likelihood and consequence. You’re not doing risk-based decision making when you do that. It’s clearly uninformed, what they did. Wow.
And working there, one of the things all of us systems engineers always felt was constant schedule pressure. And during Challenger, there was a lot of schedule pressure because of all of the rescheduling of the launches in the prior mission. And then Challenger itself, I think we attempted to launch it four or five times before that day. On that particular day, President Ronald Reagan was going to do the State of the Union address and I believe there was going to be an attempt to have him communicate with Christa McAuliffe during that State of the Union. So there was additional political pressure going to the thought, “Hey, we have to launch today.” It’s very important for NASA, it’s very important for NASA’s image.
Bryan, you often say to me when we talk about when autonomous vehicles are gonna deploy and you’re always like, “when they’re ready.”
That’s right. We all have goals as a company. You have to drive teams towards goals but you also have to be data driven when you make determinations about are you ready or not. When you reach a milestone, you have to assess whether you’ve actually accomplished what you set out to do. Part of that is evaluating the safety criteria and seeing where you are. And I think that’s where there’s a lot of decisions that have to be made but you have to follow the data and you have to look at the facts. You have to listen to the experts.
This is a great case study in how those communications can totally break down and the consequences.
So let’s move onto the space shuttle Columbia. I remember now the Columbia was re-entering the atmosphere. I remember watching this live on television, and seeing that the shuttle had disintegrated. The narrative at the time was that a piece had fallen off the rocket booster and hit the wing and there’s nothing that could be done, the whole thing was doomed. Can you explain to us what actually happened to the space shuttle Columbia and then walk back and tell us why it happened.
Basically what happened is a piece of foam came off of the external tank, that’s the big orange tank that you see the shuttle attached to, instructing the leading edge of the wing, which is made up of reinforced carbon carbon. And during re-entry it caused a breach which allowed hot plasma gas at about 3000 degrees Fahrenheit to go inside the wing and destroy the vehicle.
So how could that possibly happen? In the system engineering of the shuttle, was it not foreseen that there was a possibility that a piece of foam could come off the tank and cause such a puncture?
Going back several flights, they were tracking pieces of foam that were coming off the tank. At the time, I was no longer with NASA but what I found out in the course of the investigation is that apparently there was a change in the propellant used, a more eco-friendly propellant was used to spray the foam and it wasn’t holding on as well as the other foam. And it was another one of those normalization of deviance where a piece of foam broke off and maybe there was some slight damage to some of the tile that wasn’t that big. We need to correct the problem, but we can continue flying. It probably should have been a situation where they find out and fix this problem and stop flying until they figure out how to completely avoid it.
So the bottom line was, the vehicle was allowed to continue flying even though they had been tracking debris falling for a while. This piece that fell off was large enough to damage and reinforced carbon carbon is very strong. In fact, the space shuttle’s exterior consists of three different thermal protection systems. One is the tile, which everybody knows about. Those are very lightweight squares, they look like tiles you would use in your bathroom. They’re made of almost pure silica. Amazing heat protection qualities, but they’re not strong at all. If you have one in your hand, you can break it by just pressing your finger on it. I once saw an eighth grade student doing one of the demonstrations where I was letting kids hold tile and told them it was very delicate and the kid punched his hand through it just to see.
Anyway, these are very delicate but they have great heat protection. Other parts of the shuttle have what are called thermal blankets and that’s on the sides of the vehicle. They also protect against heat. And then there’s the areas of the shuttle that actually carry and transfer loads into the vehicle like the nose cone and the wings. Those are made of reinforced carbon because of its strength. It can withstand all of that strength and all of that load that has been transferred into a vehicle that’s traveling at Mach 25.
As part of the system, is there not a protocol for: you’ve taken off and there’s an expectation that foam may come off, potentially damaging an element of the spacecraft? Is there not some protocol procedure to inspect the wing or leading edges before you return to Earth? Is that part of a redundant system?
That is an excellent question, Alex. So the way they determined that this impact occurred was in reviewing. There’s what’s called a debris team that reviews shuttle launch videos to see if there’s any anomalies during ascent. The day after the launch, they discovered this huge piece came off and impacted the wing. A lot of discussions went on while the vehicle was in orbit and a recommendation was made by the debris team to somehow do imaging to be able to see what the damage was to the wing. Johnson Space Center is responsible for the mission and Kennedy Space Center is responsible for launching. A decision was made, I’m pretty sure, at the Johnson Space Center that it would not be feasible to do any type of imaging. There were never any plans for doing that. Like you said, there wasn’t an established procedure where once we’re in space, we really need to examine the vehicle to make sure there’s any damage to it. Anyway, they did do some computer modeling and I guess the information they gained from the computer modeling of the impact was that there was going to be some heat damage to the leading edge of the wing but no structural damage and that it would be a return to orbit issue. Once the vehicle got back prior to the next launch, it needed to be repaired.
Is there anything they could have done had they actually gone out and looked at the damage?
Those were also part of the discussions. One point was there is nothing we can do right now to see it even if we do with other satellite technology or other options. Even if we know that the wing is breached, there is nothing we can do. That was the prevailing attitude at the time that it would just be a matter of saying, “we know this vehicle is not going to make it back and there’s nothing we can do about it.”
So what was the takeaway? I mean, what was the learning for NASA and system engineering for future flights to prevent that problem?
I will render my opinion on this. Number one, a problem will not go away by ignoring it, which is two of the problems. One is that the debris was coming off of the tank for a while and it should not have been ignored. The problem didn’t go away. Once it was in orbit, ignoring the fact that there might be possibilities of both looking at the wing to see if it’s breached, let’s pull together similar to Apollo 13 and see what we can do to get them back safely whether it’s somehow in another vehicle, from another country, another shuttle. People were saying “Well, there’s no way we’ll have another shuttle ready in time and based on the resources like oxygen they have there to be able to rescue them.” It’s my opinion that we should have given that a try, we shouldn’t have ignored it. The second lesson learned, I think, is the same one as before, beware of normalization of deviance. We’re getting used to the strikes. Let’s just move forward, waive it and hope that nothing happens. And the third takeaway to me was, have more faith in the human element. Had the problem been known and understood, I think there’s a possibility that we would have figured out a way to get the astronauts back safely. And of course, all of this is saying this from the comfort of hindsight and “Monday morning quarterback”.
It seems like for every disaster, there was always one or more people who projected potential issues and oftentimes warned of them, and they were ignored. And some organizations are culturally better optimized to not let that happen. So tell us about Apollo 13. What went wrong and what the learning was from that incident?
Yeah, so about 56 hours into the launch, one of the oxygen tanks in the service module exploded. The Apollo vehicles were actually made of three components: the command module where the astronauts were, the service module which provided the fuel and oxygen propulsion, and the lunar module which was actually the part that went down onto the moon. So in the service module, one of the tanks explodes and it damages the valve to one of the other oxygen tanks and immediately oxygen starts bleeding out. Pretty soon after that, it’s determined that it’s just going to be a race to get the astronauts back and figuring out how they’re going to be able to do this when the vehicle that’s actually designed to bring them back to Earth is now damaged. Now the only thing left is a capsule for them to be in during the re-entry and the lunar module, which is not designed for bringing humans back into Earth.
So what caused that tank to explode?
That particular tank that exploded was originally intended to be installed in Apollo 10. It was removed for some modification and during the removal, it was dropped a couple of inches. So they decided not to put it back into Apollo 10. Let’s replace the tank. At some point the tank makes it into Apollo 13. I don’t recall right now what the determination was that two or three inches of drop would not have damaged anything, let’s go ahead and use it.
Can you explain what two or three inches of drop means because I’m not sure that I understand it?
So when they pulled it out of the service module, somebody dropped it about two or three inches from the ground. And so it was determined that it should not have affected anything when, in fact, it affected one of the lines that either brings the oxygen in and also allows the oxygen to drain out. So that was one of the problems with that tank. Sometime after that the tanks were redesigned because originally they were designed to only use 28 volts, which is the voltage that is supplied by the command module to the systems in service module. However, when the vehicle is at the launch pad, the launch pads provide 65 volts to keep systems running before launch. So everything regarding that tank was modified to use 65 volts except for these thermostatically controlled heaters switches. So when the tank was being tested at the Kennedy Space Center, when you had to fill it and re-drain it, they were having trouble draining the tank. It was because of the problem that apparently when it was dropped, one of the lines was bent and the oxygen wouldn’t drain out. So a work around procedure that they came for draining it was let’s just heat it for about eight hours. Let’s run the heater for about eight hours, which will cause the pressure on the oxygen to rise and force it out of the tank. That worked, but apparently one of the other things that did to that thermostat that we talked about was it burned the insulation on the wires because it wasn’t meant to operate with 65 volts and for that length of time. So when they got up into space, there were in-flight procedures that called for mixing running a fan inside of that tank to keep the oxygen flowing. And so when the switch was activated, the insulation was lost and obviously caused the accidental explosion.
This is fascinating to me because it seems like even if individual parts or components pass muster, if they’re not tested together and retested together, things will get overlooked and the smallest thing can make all the difference and break everything. So there was a famous story about a single rivet that was incorrectly installed. Give us a short version of the rivet story because I don’t even think I know what the accurate one is.
It was during a time where I was a lead structural engineer on one of the vehicles. Basically it was about two o’clock in the morning and we were working on a modification that was sent to us from the Johnson Space Center. The shuttles had gone through structural analysis where they determined that the aft bulkhead, which is the area that separates the cargo bay from the engine compartment, was not as strong as it should be. So they wanted to beef it up by adding
aluminum doublers basically adding additional metal to stiffen that structure. When you’re working inside the shuttle, you have to wear what we call bunny suits. You’re completely covered to make sure that you don’t introduce any contaminants or college rings or anything like that. Things that have been found in prior flights when astronauts got on orbit and things started floating around. So anyway, you really can’t tell who’s in there. It’s just a bunch of people in bunny suits. These two technicians are drilling holes to install these doublers and I hear one of them say, “Uh oh” and the other technician says, “Uh oh, what?” and he said, “I drilled the hole too big.” And basically, the rivet is gonna float around in there. So one of the technicians says, “Well if I just install the rivet anyway, NASA will not be able to tell that the hole it’s in is too big.” They kind of agree with each other and at some point, one of them looks over at me and says,
“Oh, you’re NASA, aren’t you?” And so, we had to stop the job, pick up what’s called the problem report and come up with a solution to the large hole. The whole thing behind that story is that sometimes when you add that human element in, it doesn’t matter how well you design
a system or an improvement to the system if little things like that can happen.
I’m taking it that the person installing the rivets did not have any clue the potential effect of having that hole too large. I’m guessing he didn’t understand the engineering behind how important that actually is.
Correct. Yeah, I don’t know. I could remember that night trying to analyze after I went home thinking, basically what you said, Did he not know the importance? Was he afraid of being embarrassed when his supervisors found out that he drilled a hole that was too big when the specifications actually said how big the hole was? Was it a schedule thing? There’s just so many parameters. The bottom line is that was a decision that he was contemplating on, let’s just put the rivet in there and continue to keep the job going.
But this shows you how important the supply chain integrity is and that everybody understands what they’re doing and why they’re doing it and that the details matter and are important. You can do everything right in your own company, but whoever you subcontract to they need to feel like they’re part of the family to some extent, and that they understand the big picture of how they’re contributing to the mission. It’s so important for that to get spread down to everybody.
Yeah. And just to give you additional perspective on situations like that, if we were installing any piece of equipment, let’s say a piece of equipment that required five bolts, five washers, and five buttons, and five nuts. When we started installing the equipment, if we opened a bag and there’s only four and not five watchers, we couldn’t just go to a bin and pick up a washer that was the same size. We had to pull a piece of paper, which is called the problem report. Basically, the problem report was saying we’re missing a washer. The job stops and it goes to a Materials Review Board consisting of engineers of quality safety to determine how to fix the problem. After all these people meet, that determination is to get a new washer. That washer has to come from the manufacturer which sells us flight-certified washers. So we have to order a new washer. The washer comes in, and a lot of times the parts we ordered were from Downey, California because Rockwell was one of our main contractors. So that just kind of gives you the perspective of how sometimes a small job could impact the schedule with a simple problem. It’s a problem which you consider simple because you could usually fix it in 10 seconds if you’re at home or working on something else, but when it was a space shuttle, it would stop a job.
Because a non space-certified washer can go bad. And then it’s a bad day.
That and because the system engineer working there cannot determine that the fix to that problem is another washer.
Right. So when I was a kid, I remember there were still airliners flying around with three engines. I remember, and I was obsessed with planes, that there was the L-1011. I was a Lockheed Martin and I think it was a Boeing… who made the DC-10? McDonnell Douglas? Yeah, McDonnell Douglas. And so they had a third engine on the back at the bottom between the fuselage and the tail fin. I’m not an engineer, I wasn’t then or now, but I looked at that and I always thought to myself, “I’m not sure that’s the best idea.” and my dad said, “Well, we’re gonna fly anyway.” This is famous. There was this famous crash of United 232. Can you tell us about what happened there? Because I remember when that happened, I thought to myself, that plane has three engines and I knew that was gonna happen even when I was a kid.
Well, under normal circumstances a complete failure of that engine, let’s say it was unable to produce thrust, would not be an issue as to engines that would be able to safely bring the aircraft back. However, the way the fan blades broke apart, they impacted all three lines for all three hydraulic systems, which eliminated the ability for the pilots to be able to steer the plane. So what they noticed, when they lost that engine and they started going through the checklist, they could not move the throttles and they could not shut off the fuel to that particular engine. So the plane just wanted to turn right the whole time. And because of the expertise of the pilot, he was able to steer the plane by differential throttling of the engine on the left and the right side.
And there was actually another pilot on the plane who was training airline pilots and they were able to come into the cockpit and assist. He was working the throttles while the other two pilots were trying to find workarounds for landing because one of the other problems with the loss of total hydraulic pressure is the landing gear, you need hydraulics to lower the landing year. Fortunately, they did have procedures for being able to lower the landing gear and that’s what the other two pilots were working on. They were able to drop the gear and also activate another level, which opened the door to the landing gear. Once those doors were open, the landing gear would drop into the position that needed for landing.
If the fan blade failed on the third engine, and it cut all the hydraulic lines, isn’t that a fatal flaw? The notion of the system is that it has redundancy. And yet they baked into the system that there was the possibility that you could cut all three lines if this engine failed. Right?
One of the things that I’ve learned in investigating different engineering disasters is you cannot plan for nor predict every possible failure scenario. When this happened, the pilots were communicating with the mechanics on the ground explaining what they were experiencing with no hydraulic pressure and they were saying that’s impossible. You have three systems, you have three hydraulic lines. It’s impossible that you have no hydraulic pressure on any of them. So again, it goes back to the point that you just can’t predict everything. That was a failure scenario that was never predicted.
In Cactus 1549, it was the Sully Sullenberger incident. If I recall correctly, some birds struck one of the engines. Was there never any design plan, or no one ever speculated that a bird strike could shut down engines like that?
Well, actually they have. At the time of that flight, engines needed to be certified to withstand the impact of at least a two and a half pound bird going through an engine and you should be able to still produce thrust. Unfortunately, on this flight, there were Canadian geese. Each engine took in at least two of them, and each one could weigh up to eight pounds. So that was 16 pounds of birds going through engines that were only rated to hold two pounds of birds going through.
So what’s the takeaway from United 232 and Cactus 1549? What are the lessons learned from that to mitigate the risk of future incidents like those two?
With 1549, I believe they increased the weight capacity of birds to three or three and a half pounds or maybe even five. I know they changed that criteria. On the 232, I haven’t seen many of those aircraft anymore with the engines up on the tail.
So I wasn’t crazy.
I don’t think you’re crazy at all. But to me, the biggest takeaway from both of those have just been the human element and how the human element was able to save a lot of lives.
It seems like people always think that tech just gets invented, and it just works. But it seems clear that tech, even once it’s built and designed, has to be maintained and upgraded and that people will always be part of the story. People shouldn’t be so obsessed with technology as much as the people in organizations that design, build and manage the technology. So Orench Consulting, your firm. Explain what you do.
Yes. Currently, I am an advisor for a startup in the UK, which is creating crisis management online courses. So I am providing content based on the crises that I have been involved with. Basically, that’s what I’m doing for that firm. But in addition to crisis management, I have been providing advice to companies in the United Arab Emirates on construction and training facilities for law enforcement, particularly in the area of the use and deployment of technology for law enforcement.
You know, you talk about training and to this day, it’s always surprising to me that you can get a four year engineering degree and not have been forced to take a class around engineering disasters and the lessons learned. Similar to what we went through today, but maybe a little bit of a broader treatment on the topic. Don’t you think that that’s something that really should be considered to be part of an accredited engineering college?
I think it should be for every field of engineering. I was involved with failure analysis, both in my Mechanical Engineering Bachelor’s and in my Material Science Masters, but I did not see courses like that in the electrical computer or any of those other curricula.
My first job out of college was working for a railroad company that provided control systems and control centers that would oversee the operation of class one railroads across the US. So they had CSS Union Pacific, some of the largest railroads in the country. And the software that I wrote, that was like 5% of the job. The other 95% of the job was understanding all the ways in which it could fail. Making sure it integrated properly with all the other systems. Making sure that it was super well tested. That we forecasted as much to the degree of our expertise and abilities the ways in which it could fail and make sure that if one element failed, another element doesn’t. Because if that software shut down, the railroads would lose their eyes and ears over their entire operation, and that cost them significant money but also, there’s some mission criticality issues there as well. And it was a part of the job I didn’t expect to get in terms of training. I thought I was gonna learn a lot about how to build software in an industrial sense. I suppose I did, but what was surprising to me was how much the most experienced architects in that company spent the majority of their time thinking: How can it fail and how do I keep it from failing? And it was really interesting to see. Most of the architects in that company had at least 15, if not 20-30 years of experience doing this. You get them in a room, and there was no better schooling than just to listen to them in all the ways in which things could go wrong. And you probably never wanted to fly out a plane with them, because they could probably scare you to death but I wonder how do we take that knowledge and make sure that gets passed down to the next generation?
I think you said it all. I think all engineering curricula should include that. What you said about that also reminded me of one of my favorite books in engineering which is To Engineer is to be Human. The author is Petroski. It’s an excellent book. It talks about how we should design things to failure. Design things with failure as a consideration in every aspect of the design.
So we have some young people and aspiring engineers listen to this podcast, and I think a big shout out is to read books like that one and to really think about failure modes. Think about how you could make software, hardware, or whatever you’re building more resilient. And it’s actually really fun. I found it to be a fascinating, fun part of the job. The best time we had was when we were in the lab, and it was sort-of pull the cable day and we just pulled random network cables and other things just in case your test cases didn’t cover it. Let’s see what happens, right? And sometimes the best learning came out of those types of tests. We did the same thing when we did the Urban Challenge in 2006-2007. We sort-of called it “free for all testing” where we would go through very principled ways all the different permutations through a four way stop at a controlled intersection, for example. We really would plan out every different way you could go through that intersection and run those tests. You learn things. And when things broke, you fixed it. It was very helpful to understand what went wrong, obviously, when you go through the permutations of tests like that. But the best kind of testing was when it was “Alright, now let’s watch the system operate sort of organically.” Let’s not try to plan it too much. Let’s throw a bunch of traffic on the test course and let’s let’s see what happens. That’s when you get some really interesting things to play out that you maybe didn’t plan for or you didn’t think to put it into a test plan, whether it be to trigger some real subtle variations on timing that would expose flaws. To me, that’s how you build resilient systems. It isn’t just pre-planning and trying to think everything out in a conference room. You’ve got to let it out and you have to do controlled tests but still allow it to really interact in the environment in an organic way. It’s got to be difficult with a shuttle because you can’t really do what I’m saying without launching. And that’s where simulation, I guess, comes in and is so important.
One of the things that I just thought of when you were talking about your testing is in that testing that you do, is there ever a situation where you have a scenario of something that happens, or you think is going to happen, and you throw at it at someone that’s driving a car, and let’s say test it with several dozen, several hundred human beings to see how they react to that, and then see how an artificially intelligent system reacts to it and compare the two?
We’ve done some of that, certainly. We’ve got a pretty large test track now that allows us to really create all sorts of different variations. there’s times where we’ll say, “Hey, everybody knows how to ride a bike in the company. Let’s go to the track, and let’s throw all these variations at it. We’ve done some comparison to human testing as well. I mean, all of this helps inform the requirements at the end of the day and sort of helps you plan out: what are the behaviors you want to see from the vehicle based on those experiences?
Jose, it seemed like the rules of risk mitigation in designing safe systems are never normalized deviations, redundancy, and supply chain integrity that Bryan brought up. What am I missing?
While we’re thinking about that, let’s talk about supply chain integrity. That’s a huge concern for me because of things I’ve seen throughout my career and also things that I’ve read about. Are you guys familiar with the super micro issue in 2018?
Yeah, I was actually gonna bring that up, in case you didn’t. So Bloomberg had a huge piece in October of 2018. And I’ll let you tell the story, because you probably tell it better than I can, but what was funny was when that came out, that was something that was rumored about and was considered sort of a hoax or a myth. Somehow, that investigation, as I understand it, had been happening for a long period of time. And then there was a myth in Silicon Valley about this. I remember supply chain people telling me “Ah, that didn’t actually happen.” Tell us the real story. What happened?
Well, I read the same article you did. But again, based on my experience, and things I’ve seen, things I’ve worked on, it’s a very real threat. The way it’s explained is not a myth. The concerns I’ve had with it is when we talk about autonomous vehicles and being able to get things in the supply chain that have a chip installed, and I guess we didn’t explain what the what the issue was, but apparently, super micro computers that were used for servers had a very tiny chip about the size of the head of a pencil that was installed into the boards. This chip had network capabilities, processing capabilities, and capabilities to change instructions inside the board. Basically, when this board would be powered up on the server, it would call home and say, I’m waiting for instructions here. So it was giving access wherever these boards were installing servers. It was giving you 100% access to whatever.
Well, this article alleged that it was actually a subcontractor to a subcontractor of Amazon for their cloud business, and more specifically, their government cloud business. So think about that, right? You commissioned someone to get these motherboards for these servers to be created, you created a design, you went to a manufacturer and said, “Hey, please, please build this.” and they build everything you asked for except there’s the addition on the board that was not in the plans, not in the design. The article describes the size as a the size of a grain of rice
that’s added. And so how are you going to find that, right? And there are certain types of tests that you can do that will absolutely find that. You can do scans of the boards afterwards and compare it back to your design and layout and verify that. Most of the time, you’re verifying that all the chips that are supposed to be there are there but you could also look for things that should not be there. There’s also forms of these tests that do it randomly or that do it less precisely that would never have caught it. There were some security engineers that were a little bit concerned. I don’t I don’t know exactly what the telltales were, but they went and looked into it and came to find out that SuperMicro was adding this network component that Jose was just describing.
So this is basically a national security issue. Because if you’re deploying these things for government use, or in a vehicle or a shuttle, there may be a bad actor who wants to see a very different outcome from another actor. So what is the optimal relationship between law enforcement, government agencies and private companies to prevent this?
Well, that’s one of the things that we had talked about when I was with the Bureau’s technical liaison unit. My job was to reach out to companies to try to find partnerships with them.
First of all, let them understand what the vulnerabilities and the problems were out there. Because I know with civil liberties, people don’t want the government involved with private companies when asking for help and trying to form these partnerships. But it was very important, especially when you have cases where you have a kidnapping, to gain access simply to this. The simple act of getting access to a cell phone would help you find that kidnapping victim. You have the cell phone, but you don’t have the password and you need help through the manufacturer, say how do I bypass this password? Quite often we would get the answer that there’s no way we can bypass it. We had to, through technical labs that we have, actually do chip off forensics on a phone board to remove a data chip and reverse engineer the data that was in there to figure out or to obtain the information that we needed.
Do you ever think about what the next big disaster might be? I mean, what keeps you up at night?
My biggest concern right now is with things happening similar to snuck Stuxnet. Are you aware of that?
For those who don’t know what it is, would you please tell us?
Yeah. So basically, it was a computer virus that was introduced into the Iranian nuclear program, and it caused their centrifuges to self-destruct. So I worried because we have so many systems in the country and worldwide – power plants, manufacturing facilities – that you can actually control the things that happen in them from home. I can recall earlier in my career, there was a time when I first got into NASA, I left for about a year and came back, I worked for Baxter Laboratories. I was a manufacturing engineer and my pager was going off all the time, I kept having to go back in because a boiler went down or an automated line went down. So I would have to fix it. But you know, fast forward a couple of years, now the troubleshooting can be done from home, or they can be analyzed from your house. So a lot of companies are putting their systems online. And to me, that’s a big recipe for trouble if you don’t have the right cybersecurity systems in place. When you think about weapons of mass destruction, you think a lot about biological or maybe even radiological, or explosives. But those things, for you to be able to create a real weapon of mass destruction out of explosives or biological, you have to buy ingredients that will raise flags. When you’re coming up with code for a digital weapon of mass destruction, you don’t have to buy these ingredients. You will not raise any flags, you’re doing this at home on a computer. So that’s a big concern that I have.
I gotta ask this. I’m a big fan of science fiction. I’ve been watching this show called For All Mankind, which is what would have happened had the space race continued with more Apollo missions into the 70s. They touch a little bit upon things that happen in the Soviet space program and it seems that they had a lot of issues with system engineering and safety and their bigger rocket designs were prone to exploding and they could never solve that. And that appears to prevent them from executing on their lunar mission plans. Is there a single through line through all of the things you’ve observed at NASA or even at other programs that we can carry forward into the private sector? Is there one rule, the Orench rule, that would make all systems safer?
Yeah. Even back in my early days at NASA, I learned really quickly to trust subject matter experts. I saw so many decisions being overwritten based on budget or schedule that just to me, I thought this is not important. Getting this done faster is not important. So, trust the experts.
Jose Orench is a retired special agent with the FBI, a veteran of NASA and the president of crisis management from Orench consulting. Jose, thank you so much for coming on the No Parking Podcast.
Thanks for joining us.
Great, thank you.
If you enjoyed today’s episode, please connect with us on social media. We’re on Twitter at No Parking Pod. And of course, I’m everywhere but especially on twitter at AlexRoy144. That’s Alex Roy and the numbers 1 4 4. Please share No Parking with a friend. Like us, subscribe, give us a five star review wherever you find your podcasts. This show is managed by the Civic Entertainment Group. Megan Harris is our awesome producer. And of course, my co-host is Bryan Salesky, the co-founder and CEO of Argo AI. Until next time, I’m Alex Roy, and this is the No Parking Podcast.