2026/05/28

Confidently Wrong - Hallucination, Bias, and the Limits of LLMs

"There were never any utilities for enlightenment"...

Today, Large Language Models (LLMs) are often presented as the promised land of computing. In the days when ChatGPT was first released in 2022, it took the IT community (and the general public) by storm. While the first impression was that of a nice little toy (“oh look, it speaks my language”), that rapidly changed to the impression of a versatile tool that could do any task for us. It felt like not having the right tool was the only thing keeping everyone from enlightenment.

The LLM-Revolution

With the success of ChatGPT, the term “Large Language Model” rapidly became everybody’s darling. Every IT-adjacent manager was suddenly dreaming of a chatbot to fix their business.

However, not all “LLMs” behave the same. The term bundles together many model families and deployment patterns that differ in capability and risk. Important distinctions include: base models versus instruction‑tuned models; models augmented with retrieval or tools (RAG/plug‑ins) versus closed‑domain generators; models fine‑tuned for specific tasks or companies versus general public models; and multimodal models that handle images and audio versus text‑only models. Safety filters and context windows further change behavior. Each of these choices affects hallucination rates, bias patterns, privacy exposure, and suitability for specific use cases.

See Bender et al., “On the Dangers of Stochastic Parrots” (2021) for a discussion of scale, data provenance, and ethical risks of large language models.

Treating LLMs as a single monolith leads to wrong assumptions about reliability and governance. A risk assessment should therefore consider model architecture, training, fine‑tuning history, augmentation, and deployment controls.

These differentiators affect the behavior of the model:

Instruction‑tuned GPT vs. base transformer: instruction tuning reduces irrelevant output and increases helpfulness.
Retrieval‑augmented systems: can ground answers in up‑to‑date documents and reduce hallucination, but introduce new attack surfaces (poisoned retrieval).
Fine‑tuned domain model: often much better for specific tasks (medical notes, legal text) but can amplify dataset biases.
Closed vs. open models: closed commercial APIs may offer monitoring and safety layers; open weights allow on‑premise use for privacy but require your own guardrails.

See Brown et al., “Language Models are Few-Shot Learners” (GPT-3 paper, 2020) for an explanation of base model capabilities and limits of few‑shot/generalization.

See Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (RAG, 2020) For information on how retrieval grounding reduces hallucination and adds new attack surfaces.

Yet even today, we can observe a tendency to discard these differentiators and turn to the next best LLM for help: Our conversion rates are low? Let’s integrate ChatGPT into our sales routine. Our customers are unsatisfied? Let’s get Gemini into the support hotline. Our customers do not understand our product? Let’s get Claude to explain it to them.

To some, using a chatbot seems like the obvious answer to all problems.

Looking for world peace? Let’s ask ChatGPT. Starving children? ChatGPT will surely feed the world…

Use cases for LLMs

In hindsight we can tell that this was not the one solution to rule them all. Even some of the IT-adjacent managers might be able to tell so by now. In other words: it is pretty obvious. Yet, ChatGPT was not just a fashion, not a trend that quickly faded. It is still here, albeit not as the solution for everything anymore. Asked about the most popular topics, ChatGPT produces the following list:

Programming & debugging
Education & homework help
Writing & rewriting (emails, essays, CVs)
Language translation
General knowledge & explanations
Data analysis & spreadsheets
Creative writing
Career advice
Math problem solving
Productivity / summarization

Some of the points on this list are obvious. LLMs are good at rewriting and creating text under constraints. At least numbers 2, 3, 4, 7 and 10 of the above list fall under that category. It gets more interesting when looking at numbers 5 and 9. “General knowledge & explanations” could still fall under the category of rewriting text as it could be rephrased to “take the wikipedia article and rephrase/summarize it”.

This leaves us with numbers 1, 6, 8 and 9. Let’s take a more detailed look at these and start with “Career advice”. Career advice is something very personal, and to some extent it might surprise that people ask ChatGPT for career advice. However, on a certain level this also makes sense: ChatGPT has all the information in the internet available and can surely synthesize typical career paths from that information, so given a reasonable starting point, it makes sense that it will find answers to career questions that help its users. A similar reasoning applies to “Math problem solving”, if the math problem under investigation is a solved one, I am sure that in its endless wealth of scientific papers, ChatGPT will be able to find a valid answer to the problem.

Which leaves us with number 1. This is where it gets really tricky. In my experience, ChatGPT can be quite good at simulating understanding of a problem. It is very highly trained at giving you answers in a certain vocabulary that sound right and might even follow a certain reasoning. Programming, however, falls - at least in my opinion - squarely into the category of creative work. From a white screen with a blinking cursor, the programmer creates a series of commands that execute certain tasks given certain conditions. Now, often, the problems being solved in a program will be problems that have already been solved. And for these areas you are quite likely to get acceptable results. At other times, however, a programmer creates new and innovative solutions. Then your LLM will most likely not have ready solutions at hand to learn from. That means that in this case, the output of the LLM will only be as good as the description you feed it. Similar assumptions hold for creative writing, of course, a prompt like “write a story about a frog” without further context will most likely not give you the read you are expecting, more detail is needed if you want certain content to be in the story. But, unlike programming, language is more forgiving. Your frog-story might not be the read you are looking for but it will be correct in syntax and understandable. > In a story, it does not matter whether you describe the frog first or the pond it lives in. If you write program code, the order of statements matter a lot, “print results” should always come after “check permissions”, otherwise, the side effect might not be what you were going after.

Hallucination and trust issues

But back to the prompts. Apart from the syntactic and logic details that matter in programming, you should remember that LLMs can only synthesize information from its training data. For example, saying “I want a login screen written in react” will most likely yield a reasonable result because there are millions of login screens out there, that your LLM will have seen during training, however saying “I want a screen for managing my assets written in react” will most likely not deliver the result you desire. A human would in this case ask “what assets do you want to manage?”, “how are they managed?” or “are your assets structured in hierarchies?” or something similar to that to better understand, what functionality you desire whereas a LLM will just assume what an asset management usually looks like and produce an outcome based on these assumptions (without ever telling you the assumptions).

This is a common problem encountered when working with AI and falls under the category of “hallucination“. If AI is uncertain of what the right answer is, it will tend to invent facts rather than state its uncertainty. That is models are trained to maximize likelihood of plausible continuations of the text. In many cases this will work well. It will give you a story, it will summarize that article or translate this text passage without knowing details of the context and in these cases its plausible guesses work well. However, in other cases this might be challenging. Imagine a lawyer having their arguments dismissed in court because the reference cases they cite are made up. Or a doctor making a wrong diagnosis of a rare disease because of symptoms the AI hallucinated about.

As a developer getting code fragments from your AI referencing functions that either do not exist that the LLM “forgot” to implement can be exhausting and frustrating at the same time. In these cases, remember that the LLM simulates understanding but is not intelligent in itself as it does not understand what its gaps and limitations are and when it plainly cannot answer a question correctly. See the Wikipedia article on hallucination in AI for more details and examples.

See Ji et al., “Survey of Hallucination in Natural Language Generation” (2023/24) to learn more about hallucination in LLMs.

Bias - built into the model

A second, bigger problem with AI in my opinion is bias. When a LLM is trained, it uses training data, so the output it generates will of course depend on that training data. Assume that the internet contained only images of female doctors, if you would then ask a LLM to produce an image of a doctor, the outcome will for sure be female, no matter how high the percentage of male doctors is in reality. There have been articles published on this phenomenon. The bias in training data of LLMs can have several sources:

Historic bias: Of course the LLM is trained on historic data and older documents make for the majority of its training set. So if in the past a certain viewpoint was strongly favored, it is very likely that this viewpoint is also over-represented in the training data of your LLM. This means that even though this viewpoint may have been disproven lately, the LLM will still be biased towards arguments for that viewpoint because of its training data
Representation bias: groups of people or viewpoints that are not represented strongly on the internet (or in the LLMs training data) will be underrepresented in the training data. For example, a lot of the public data available on the internet comes from research, so scientists will surely be overrepresented in the training data whereas bus drivers would likely be underrepresented.
Measurement bias: Somehow the developers of AI models have to assign a weight to certain features of the training data. As the AI model usually is a black box for users, we do not know how the features are weighted when generating answers, so we have no way of telling if we can trust the outcome of the LLM.

There are much more details to be explored regarding bias, see for example the Whitepaper on bias in AI by the BSI for more explanations.

See Mehrabi et al., “A Survey on Bias and Fairness in Machine Learning” (2021) for a broad survey of bias taxonomy and mitigation strategies.

What are the consequences?

As you can see with just these two problems, LLMs should be used with care, depending on the nature of your problem, the answers might be skewed or simply untrue. And we have not even started to cover ethical topics and privacy issues. Especially the latter can be very challenging. I see developers sharing their entire workspace with third party models where they don’t even know where they are hosted, who can access their data and if the data is in any way protected. And just to emphasize why I see a problem here: their entire workspace includes settings like access to (hopefully) development databases or other third party services. Why this might be an issue, can be read e.g. here.

See Carlini et al., “Extracting Training Data from Large Language Models” (2021) for a demonstration of privacy/data extraction risks.

The trend of explainable AI and a continued scientific and public debate are helping raise awareness of the capabilities and limitations of LLMs, however we are far from done here and many efforts are beyond what people are able to consider in everyday life. The explainable AI effort is a good example for what I mean by that. Having a model where you can understand how a certain output was created is nice and important in a business setting, in other uses of AI, however, when you are not about accountability and being revision proof, understanding the model is far too complex for the everyday user. And yet the everyday user needs to be aware of bias, hallucination and other side effects of using a LLM. Especially when working in a business context.

Many people treat LLMs as tools they can use without bothering to learn how they work or how to judge their outputs. That’s risky: even better known evaluation concepts like accuracy, precision, and recall are technical, task‑dependent, and often meaningless for open‑ended generation unless you instrument and define them for the specific use case. Promises of “explainable AI” sound reassuring, but most XAI methods produce technical artifacts (feature attributions, attention maps, influence scores) that are hard to interpret unless you are an expert. So these artifacts create a dangerously false sense of understanding. In practice, explainability rarely translates into actionable guidance for everyday users. It helps experts debug or audit models, not a product manager deciding whether to trust a contract drafted by an LLM. If we want responsible adoption, we need simpler user‑facing indicators (confidence calibrated to task, provenance/citations, clear failure modes), basic user education about limitations, and mandatory human review for high‑stakes outputs. These would be much more useful than just adding to the pile of opaque explanations aimed at researchers.

See Ribeiro et al., “Why Should I Trust You?”: Explaining the Predictions of Any Classifier (LIME, 2016) and Doshi‑Velez & Kim, “Towards a rigorous science of interpretable ML” (2017) to read about limitations and interpretability challenges with XAI.

Spot on…

Yet, as professionals in a professional setting, we need to think about these questions. We should be aware of bias, of hallucination and of security aspects of using LLMs in our work environment. And only if we are aware, we can use LLMs (or any other AI model as a matter of fact) responsibly and reliably.

So, the goal of this article is not to tell you not to use AI in your work context, but to use it responsibly. Especially by openly discussing and creating awareness of the issues outlined in this article.

Making it actionable

I would like to end this article with a checklist for using LLMs responsibly in daily work

Define scope & stakes Classify each use case by impact (low/medium/high). Low‑stakes: internal drafting, creative brainstorming. Medium: customer‑facing content, summaries that inform decisions. High‑stakes: legal, clinical, financial advice, automated decisioning. For medium/high uses mandate human review, sign‑offs, and stricter controls.
Pick the right model & deployment Match model type to task: base models for exploration, instruction‑tuned for helpful responses, retrieval‑augmented or fine‑tuned models for factual or domain tasks. For sensitive data prefer private hosting or on‑prem deployments and ensure the vendor’s data‑use policy matches your compliance needs.
Specify success metrics Define success criteria before production: task‑specific automated metrics where appropriate (e.g., BLEU/ROUGE/BERTScore for summarization), plus human judgement rubrics for relevance, correctness and safety. Track hallucination rate, calibration (confidence vs correctness), and user satisfaction metrics rather than only “fluency.”
Ground outputs and show provenance For factual tasks use RAG, citation‑enabled prompts, or deterministic APIs as the primary source. Surface provenance to users (source links, retrieval snippets, timestamps) and prefer designs that let users inspect source material.
Prompt defensibly Standardize prompt templates that require the model to (a) state assumptions, (b) list sources for factual claims, and (c) indicate uncertainty. Use constrained prompts for structured outputs and include sanity‑check instructions (e.g., “If uncertain, respond: ‘I don’t know — verify with X’”).
Validate systematically before deployment Automate checks: unit tests for expected outputs, schema/type validators, and QA‑based fact checks (pose model answers as questions to a verifier). Run human audits on sampled outputs and fix prompt or model configurations before scaling.
Monitor in production Log prompts, responses, model version, and metadata. Track error rates, hallucination incidents, user corrections, and drift in performance over time. Instrument alerting for spikes in failure modes or unusual patterns.
Enforce data hygiene & privacy Never send secrets or PII to third‑party APIs without explicit approval. Redact sensitive fields client‑side, apply least‑privilege access to logs, enforce retention and deletion policies, and document data flows for audits.
Implement guardrails & safety layers Apply input/output filters, refusal policies, and rate limits. Block or escalate high‑risk queries by default. Use content moderation tooling and fallback behaviors (e.g., route to human agent) when the model indicates uncertainty or policy violations.
Design UX for uncertainty and verification Don’t hide uncertainty, surface calibrated confidence, provenance, and clear “verify this” affordances. Provide obvious ways for users to request sources, correct outputs, and escalate to humans. Make trust boundaries explicit in the UI.
Train users & assign ownership Provide short, role‑specific guidance: common failure modes, required checks, and when to escalate. Assign accountable owners for each LLM integration (product, security, legal) and require sign‑offs for medium/high‑risk deployments.
Prepare incident response procedures Define how to investigate, rollback, notify, and remediate when the model produces harmful, biased, or leaked content. Keep runbooks and communication templates ready for customers and regulators.
Audit, iterate, and keep a changelog Schedule periodic audits for bias, safety, and privacy. Re‑evaluate models after data, prompt, or architecture changes. Maintain a changelog of model versions, prompt templates, and deployment changes so you can trace regressions.

And most important of all, follow this principle: Treat LLMs as powerful assistants that accelerate work but require human supervision, clear responsibility contracts, engineering and governance to make their outputs trustworthy.

Photos:

Candle: Photo by George Becker
Darts: Photo by Jeff Kweba

A german version of this post can be found on the virtual7 Blog