AI Digest

OCT/ NOV 2026

~ ~ \\ // ~ ~

authored by

Professor Neil Siegel

15 MAY 2026

AI based on LLMs are unsuitable for Mission-Critical Applications

Society depends in an essential way on modern engineered systems – for example, ambulance dispatch, air traffic control, hospital operations, process control at factories and refineries, and so forth. Every human creation contains defects and errors, and years of assessment has conclusively demonstrated that testing never removes all of those defects and errors. It is of vital importance to ensure that the latent defects have been reduced (in both frequency and magnitude of impact) to some pre-determined level that is considered acceptable by the intended users of the system.

Some of these societal systems generate recommendations for review by human decision-makers. But it is already the case that quite a few of these systems actually automatically take action without waiting for human approval; examples include the auto-pilot on an airplane, and many types of controls at industrial facilities. Today’s version of these systems, however, do not employ artificial intelligence. Instead, these systems deliberately employ algorithmic techniques (usually based in physics or mathematics) that allow us to estimate and bound the frequency and magnitude of the remaining errors. It is the existence of such a bound, derived from a rigorous discipline (such as physics) that creates confidence that these systems are safe to use.

However, it seems certain that developers will consider using artificial intelligence techniques – especially the technique called large language models – in future versions of these critical societal systems. It is an established fact that these large language models make errors. The term “hallucination” has been adopted to talk about these errors; somehow, the developers think that using that term (rather than the term “error”) makes them seem less problematic.

As I will discuss below, even the developers of these LLMs acknowledge the existence of the errors, and also acknowledge that future improvements will not reduce the number or magnitude of these errors. I will also show that – unlike the methods used today in critical societal systems – the methods use in large language models do not allow for the establishment of a bound on either the frequency or magnitude of the remaining errors.

Because of these facts, the use of large language models in critical societal systems is certain to increase significantly the frequency and magnitude of errors in those systems. If this increased incidence of errors is not somehow corrected, those systems based on large language models will create significant hazards for their users, and for society at large.

"INTELLIGENCE" IN OUR COMPLEX SOCIETAL SYSTEMS

Many computer systems are intended automatically to provide information and assistance to human decision-makers, ranging from the President of the U.S., to the war-fighter, to physicians, and to consumers; I have built examples of all four of these. There are, as noted above, two very different types of use-cases for such “automated assistance to decision-making”:

Use-case I: the computer presents information – and perhaps makes recommendations – to the human, but the human makes the decision
Use-case II: the computer (perhaps, for example, because of reaction-time constraints) makes the actual decision, and also immediately and automatically takes action to implement that decision3

Note that both of these use cases are already in widespread use.

Examples of use-case I – A human makes the decision:

Amazon suggesting products that you might be interested in buying, based on your past purchases and browsing history within Amazon.
Various types of medical diagnosis. Computers already provide recommendations for medical diagnosis (such as interpreting the results of chemical blood tests, or assessing various types of medical imagery), but in the U.S., diagnoses are considered “official”, “actionable”, and “payable” only when the diagnosis is made (or at least, confirmed) by a human being with a specific license (MD, RN, PA, depending on the type of diagnosis).
Google suggesting a driving route.

Examples of use-case II – The computer makes the decision, and can automatically initiate &/or perform an indicated responsive action:

The auto-pilot and automatic landing system on an airplane. Commercial aircraft have used electronic devices (“auto-pilots”) to control the airplane during level flight at least since 1911, and to perform landings (“auto-landing systems”) at least since 1937. In addition, the Global Hawk (a military autonomous aircraft with a wing-span about the size of a Boeing 737) has been conducting fully-automated and fully-autonomous flight operations for nearly 20 years. There are data that might cause one to conclude that, in fact, we already reached the point where such automated controls for aircraft are safer than human pilots.
A self-driving car. A company called Waymo is conducting experimental operations of fully-automated cars in Los Angeles and other regions.
Credit-card purchase approvals and denials. This process is already fully automated by several different companies; humans intervene only in the case of a disputed denial. The volume of transactions involved likely precludes routine human involvement in the approval/denial process.
Many types of industrial automation. Many industrial facilities (such as chemical plants, oil refineries, oil production facilities, and so forth) are fully automated11, in the sense that sensors monitor activities, and computers make and implement decisions about controlling the process (e.g., opening and controlling valves, setting temperatures, and so forth). Human beings have only a monitoring (and potentially) and an over-ride role.

As you can see, society already does quite of bit of use-case II (fully-automated operations); this is not just some theoretical future consideration.

But it is also the case that almost none of the current generation of fully-automated systems employ anything that we would call “artificial intelligence”12. Why is this? Because these are critical missions, whose outcomes really matter, and therefore we have used automation approaches that:

(a) are constrained to methods we trust,
(b) allow us to characterize the frequency and magnitude of errors that might be made by the automation, and
(c) allow us to create mechanisms intended to bound those errors (both in frequency and magnitude).

To date, we have only developed methods for performing these items (a), (b), and (c) for “conventional” algorithms and computing strategies, that is, those that do not employ any form of modern artificial intelligence; in particular, we cannot bound those errors when a large language model is used.

THE NATURE OF LARGE LANGUAGE MODELS

As noted above, even the developers of these large language models acknowledge the existence of the errors, and also acknowledge that future improvements will not reduce the number or magnitude of these errors.

Why is this?

They use the wrong assumptions for critical missions: according to a recent paper by key technical people at Open AI13, these large language models operate on the basis that it is better to guess than to admit uncertainty. The cited paper explicitly states that “training and evaluation procedures reward guessing over acknowledging uncertainty”. But for a critical mission, guessing is a bad strategy; decades of decision-making research says that it is better to acknowledge uncertainty than to guess.
The errors are inherent in their algorithms; they are not caused solely by errors in the training data, nor would the use of perfect training data prevent these errors: “We show that even if the training data were error-free, the objectives optimized during the language-model training would (still) lead to errors being generated” (Reference 14).
The developers of large language models are not aiming for quality; instead, they aim to do well on industry-standard benchmarks, as they believe that those drive investment and sales (“ . . . can only be addressed . . . by modifying the scoring of existing benchmarks” (Ref 15); note that they explicitly prefer such adjustment to measuring incident rates and severity of hallucinations.)

Furthermore, the methods employed in large language models do not permit the establishment of rigorous bounds on either the frequency or magnitude of the errors made by the model.

As noted above, we have used computers to inform (use-case I, above) or to make (use-case II, above) decisions for a long time. But the algorithms and methods that we used were selected and tailored the for each specific problem domain, and we grounded those algorithms in physics or some other similar rigorous method; the tailoring of the method to each specific problem domain is an approach long demonstrated to enable the reduction of the number and magnitude of the errors in a system.

In contrast, a large language model using the a single general method to address all problem domains. Such generality has long been demonstrated to result in increased errors.

In addition, the specific algorithms used in large language models (interpolation, pattern-matching, extrapolation, and induction) do not admit of bounds to either the frequency or magnitude of the errors that the method will make. These methods can make vast errors, and it is also the case that these are methods where it is also very difficult to identify when these errors have occurred.

Here is an example of the errors that can result from extrapolation: consider an experiment about the viscosity of water; if we measure the viscosity of water at 80F, at 70F, at 60F, at 50F, finally and at 40F, we will find that there is little variance in the viscosity with temperature. If we then use those data to extrapolate in order to predict the viscosity of water at 30F, our prediction will be completely erroneous. This is because water, of course, undergoes a phase-change (“freezing”) at 32F, and extrapolation cannot produce useful results across such a phase-change. But many phenomena, both engineering and social, exhibit such phase-changes (most of which are much harder to predict than the freezing point of water), and in these cases the use of extrapolation can lead to errors of unbounded magnitude.

The use of induction is often associated with similar errors. There is a rigorous form of mathematical induction, wherein one creates a formal mathematical proof that that if a proposition is true for case N, then it is also true for case N+1. One then establishes that the proposition is true for some specific N.

But outside of the bounds of formal mathematical proofs, we can almost never actually accomplish this level for formality and rigor. Therefore, what is called “induction” in large language models is something different, and not at all a rigorous process: what they call induction is to determine that a proposition is true for some set of cases, and then make the inference that this proposition is therefore always true, or at least is true for some specific case. This process is full of problems that lead to errors. For example, most swans are white; one might examine 1 million swans, notice that they are all white, and therefore (by the informal “induction” described above) assert that all swans are white. But there are in fact some black swans in the world, and therefore that conclusion is wrong16. In engineered systems, and in situations that involve the behavior of groups of people, such informal “induction” almost always leads to errors, often gigantic (and unbounded) errors.

Because of these facts, the use of large language models in critical societal systems is certain to increase significantly the frequency and magnitude of errors in those systems. If this increased incidence of errors is not somehow corrected, those systems based on large language model will create significant hazards for their users, and for society at large.

What to do?

To be a little poetic, someone has to be the “adult in the room”, by ensuring that such models are not adopted for critical societal missions until the issue cited above are properly addressed.

How might this be accomplished?

There is at present no widely-accepted method for bounding the errors (both in frequency of occurrence and in severity of impact) created by the operation of a large language model. I propose the following as candidate methods to perform this important task:

Never use the output of a LLM directly to inform or make a decision; instead, use the LLM only to create recommendations, parameters, and inputs for a more-traditional decision algorithm, one for which we know how to bound the frequency and magnitude of errors) (Ref 17).
Create multiple yet truly independent ways of solving a problem (one or more of which, but not all of which, could be based on a large language model), and cross-check the solutions against each other (Ref 18) before applying the decision. Note that complex systems have used this method for decades; for example, the U.S. Space Shuttle had (I believe) five computers that in essence “voted” on every control action (Ref 19).
A technique that has worked for bounding the errors of (traditional) algorithms in some settings is radically narrowing the scope of the problem addressed. Of course, current LLMs are constructed in exactly the opposite fashion; they are intended to be completely general, and trained against vast amounts of diverse data. But it might be that specialized LLMs, trained only with data from the intended problem domain, and whose algorithms were adapted to the specialized circumstances of a specific problem domain (e.g., lung cancer diagnosis) would make fewer and less-severe errors than today’s more-general ones, albeit at the cost of having their field-of-use restricted to the specific problem domain at hand. One would have to create such a focused LLM model for each problem domain. Some preliminary work in this area is already being undertaken (Ref 20).

~~~ end ~~~

REFEERENCES & FOOTNOTES

Corresponding author. Telephone: +1-213-740-0263. E-mail address: nsiegel@usc.edu.
Those errors that remain in the system after testing is complete are called latent defects. In general, software has about 1 latent defect per 1,000 lines of source code (data collected by Caper Jones, among others). A typical large automation system today might contain 100,000,000 lines of software code, and therefore, likely contains on the order of 100,000 latent defects.
Usually, there are provisions for after-action review of such decisions, in order both to build confidence in the decision-making algorithms, and to seek for potential improvements in those algorithms.
https://genesys-aerosystems.com/the-evolution-of-aircraft-autopilots-from-basic-systems-to-advanced-technology/#:~:text=For%20example%2C%20The%20first%20aircraft,course%20without%20a%20pilot's%20attention.
https://www.smithsonianmag.com/air-space-magazine/the-first-autolanding-3818066/
https://www.af.mil/About-Us/Fact-Sheets/Display/Article/104516/rq-4-global-hawk/
Owned by Alphabet, the company that also owns Google (https://www.facebook.com/SFGate/posts/waymo-owned-by-googles-parent-company-alphabet-is-expanding-its-service-area-wit/1087764066729333/).
https://www.cnet.com/roadshow/news/waymos-self-driving-service-expands-in-california-with-eyes-on-new-york-what-to-know/
Self-driving cars have only reached an experimental stage; in contrast, everything else on this list of fully-automated (use-case II) examples are already in regular operations.
Note that in my experience, self-driving cars are far more difficult to implement than self-driving airplanes; the ground is far more cluttered than the sky, other objects are much nearer (and therefore reaction times are much less), and one cannot see nearly as far ahead on the ground as in the air. This contradicts what I suspect is a common intuition that flying an airplane is more difficult than driving a car.
For example, see https://www.jrautomation.com/blog/examples-of-industrial-automation.
Self-driving cars, which have reached only an experimental stage, may be an exception, perhaps making use of artificial intelligence methods, although most likely only less time-sensitive matters such as route-planning; collision avoidance is likely performed strictly on a physics and mechanical-dynamics basis. It may be the case that some of the credit-card purchase approval-and-denial process makes use of AI tools.
“Why Language Models Hallucinate”, Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala [Georgia Tech], and Edwin Zhang, September 4, 2025.
IBID
IBID
Nassim Nicolas Taleb discusses this example in his book Fooled by Randomness, Random House, 2001.
I call this the “2-step method”.
Note that some research indicates that this is likely the way that the human brain actually works; that is, the brain has multiple, independent regions that each create independent answers. We then use our emotions and our judgement to pick among those competing answers (although not always very well!). See Overlapping Solutions, by David M. Eagleman, an essay in the book This Explains Everything, edited by John Brockman, Harper, 2013. Having a computer vote in real-time on solutions created by multiple identical computers is a similar technique used for many years in control systems (e.g., the U.S. Space Shuttle; see https://space.stackexchange.com/questions/9827/if-the-space-shuttle-computers-all-output-contradictory-commands-how-is-it-chos and https://gandalfddi.z19.web.core.windows.net/Shuttle/USA007587%20Rev%20A%20-%20Shuttle%20Crew%20Operations%20Manual%2020081215.pdf, especially the data processing section); since the computers and their algorithms are identical, this is really more of a form of error-correction, rather than competition.
See, for example, https://www.nytimes.com/1981/04/10/us/computers-to-have-the-last-word-on-shuttle-flight.html and https://www.si.edu/object/computer-general-purpose-space-shuttle-ibm-ap-101-processor-unit%3Anasm_A19950160000#:~:text=Summary,were%20replaced%20by%20newer%20models.
Mostly, however, by training a full-generalized large language model against only a specially-selected set of data (see, for example, AI Assisted Model-Based Systems Engineering with SYSML, Doug Rosenberg et al, Lean Publishing for MBSE MBSE4U, 2024). I am not aware of significant work in trying to create a large language model for each specific user mission &/or problem domain.

About the Author

Neil Siegel, Ph.D. is the 2023 US National Medal of Technology and Innovation recipient, a member of US National Academy of Engineering and the IBM Professor of Engineering Management, Professor of Industrial and Systems Engineering Practice with Distinction, and Professor of Computer Science Practice with Distinction at the University of Southern California (USC). He is a recognized expert in the design and development of large and complex mission critical intelligent systems that serve important societal needs. He is a well respected technical leader in the defense sector and was Sector Vice-President and CTO of Northrop Grumman Corporation. He was on the Defense Science Board and was elected to the U.S. National Academy of Inventors and the National Academy of Artificial Intelligence (AAI). Neil received his B.A. (mathematics), M.S. (mathematics), and a Ph.D. (industrial and systems engineering) degrees from USC.

Disclaimer: The information in this digest is provided “as it is”, by the SAFE AI FOUNDATION, USA. The use of the information provided here is subject to the user’s own risk, accountability, and responsibility. The SAFE AI FOUNDATION and the author are not responsible for the use of the information by the user or reader. The opinions expressed in this article are solely that of the author, not the SAFE AI Foundation. All copyrights related to this article are reserved by the author. Please reference this article if you wish to cite it elsewhere.

Note: The SAFE AI Foundation is a non-profit organization registered in the State of California and it welcomes inputs and feedback from readers and the public. If you have things to add concerning the impact of AI for Mission Critical Applications and would like to volunteer or donate, please email us at: contact@safeaifoundation.com

AI Digest

Professor Neil Siegel

AI based on LLMs are unsuitable for Mission-Critical Applications

THE NATURE OF LARGE LANGUAGE MODELS

The Era of AI