Large Language Models

The Real-World Harms of LLMs, Part 1: When LLMs Don’t Work as Expected

Dive deep into the ethical risk arising from the increasing use of LLMs

The Real-World Harms of LLMs, Part 1: When LLMs Don’t Work as Expected


In the last year, large language models (LLMs) have gone from being relatively unknown to being widely used by individuals and businesses alike. Most of this explosion is owed to the public release of ChatGPT—a free, web-accessible chatbot. Since then, multiple language models have been released, and adoption of the technology has blossomed across industries. As applications are built out, LLMs have the potential to provide a number of social and economic benefits. However, a year before the release of ChatGPT, a group of researchers at Google’s DeepMind published a paper discussing the social and ethical risks of LLMs. While at the time many of those risks were hypothetical, it is worth exploring how the problems they identified are manifesting now that LLM use is much more widespread. 

In a series of two blog posts, we will describe some of the major areas of ethical risk arising from the increasing use of LLMs. We will also highlight opportunities for risk mitigation, both at the individual and organizational level. In the first part, we will outline key requirements for safe and functional LLM applications and discuss risks that arise when LLMs do not meet these requirements. In the second part, we will describe the set of risks that may arise even when LLMs are performing as intended.

Areas of Risk

In this section, we will discuss three requirements for a safe and functional LLM application:

1. High Performance and Reliability

2. Protection of Sensitive Information

3. No Toxic, Discriminatory, or Biased Behavior

We’ll describe how LLMs fail these requirements by identifying and explaining the causes of the major problems in each section. We’ll also outline how each failure is a source of significant risk to users and organizations alike.

It is important to note that specific areas of risk will naturally depend on individual LLM use cases. For the purposes of this discussion, we will focus on use cases related to chatbots, knowledge retrieval applications, and publicly available general-use applications such as ChatGPT. Of course, the areas of risk may evolve over time as new use cases and technologies are developed. 

1. Performance and Reliability

The Problems

Poor Performance: Poor performance refers to the risk caused by models providing incorrect or misleading information, sometimes referred to as “hallucinations.” This behavior is shockingly common in language models (a recent study found that ChatGPT correctly answered fewer than 65% of the test cases it was given). For more information on LLM performance, check out Arthur’s ongoing Generative Assessment Project.

Inconsistency: Early research has indicated that commonly used LLMs show low robustness and consistency, meaning that minor changes in user-provided prompts can result in a large variance in the responses that models produce.

Performance Across Groups: While LLMs can fail for anyone, a further concern is that functionality and performance may not be uniform across users. For example, although ChatGPT officially supports over 50 languages, it performs better in English compared to other languages.

Why It Happens

Poor performance occurs because language models are not designed to represent reality or understand concepts. Rather, models are trained to predict the most probable sequence of words based on the data they were trained on. Models can also produce incorrect outputs because they are outdated. Unless the application is specifically designed otherwise, LLMs only contain the information they were trained on. For example, at the time of writing, ChatGPT was most recently trained in September of 2021. A proposed solution to prevent outdated outputs is to include current information in the prompt that is provided to the LLM. However, there is still a risk that a model will default to the data it was trained on.

LLMs perform particularly poorly for some user groups due to disparities in group representation in the data models were trained on. English is by far the dominant language on the internet, meaning that it is also overrepresented in LLM training datasets. For example, The Pile, a popular open source dataset, consists of 94% English language content. By contrast, languages with relatively low representation on the internet will have low representation in LLM training datasets as well. This means that cultural views and references specific to these languages are also underrepresented in language models, making them less functional to users who speak these languages. LLMs have similar issues when prompted to answer questions regarding marginalized cultures in English as well. Disparate model performance can occur due to differences in slang usage, dialect, sociolect, education, age, disability status, and other factors that influence how individuals use language. 

Why It Matters

As individuals begin using LLMs such as ChatGPT and relying on their responses, they risk making poor decisions based on incorrect information. This risk is particularly high when users take high-stakes actions based on responses from LLMs. For example, a recent survey found that 47% of Americans surveyed reported that they had used ChatGPT for stock recommendations. When ChatGPT was used by a lawyer to prepare research for a case, the chatbot provided several case examples that were not real, putting both the legal firm and their client at risk. Given the frequency of hallucinations, use of tools such as ChatGPT for financial, medical, and legal advice can cause significant harm to users. In a previous blog post, we described in depth how a user’s mental model of a technology can lead to overreliance. 

Differential performance of LLMs across languages and cultures risks growing inequality both within the organization and externally. Students, employees, and entire businesses who are able to make use of language models will reap the benefits, while those who cannot partake will be at a disadvantage. With broad societal adoption of LLMs, this risk may contribute to growing global inequality.

As corporations and other organizations begin to make use of LLMs internally, they will have to grapple with the risk of incorrect outputs, especially when the models are being relied on to provide factual information. Seemingly minor individual errors on the part of LLMs can lead to bad decisions at scale. In cases where LLM applications such as Google Bard and Microsoft Bing are used as search engines, unreliable information may lead users to lose confidence in publicly available information as a whole.

2. Sensitive Information

The Problems

Private Individual Data Exposure: Personal identifiable information (PII) or other private information about individuals can be unintentionally released or maliciously extracted from LLMs that have access to the information or have learned it from their training data. This is starkly evidenced by the recent news of a Federal Trade Commission investigation into OpenAI (the company that developed ChatGPT and other similar models) related to how the company handles personal data in their models. ChatGPT was temporarily banned in Italy and faces scrutiny in other European countries over its data privacy practices, which may violate Europe’s General Data Protection Regulation (GDPR).

Inappropriate Confidential or Proprietary Information Exposure: Beyond the risk of private personal information, there is an emerging risk that trade secrets, intellectual property, or government secrets could be exposed by language models. Samsung recently banned the use of ChatGPT from its employees due to information leakage and other organizations report concerns about the release of corporate secrets through ChatGPT as well.

Why It Happens

Training datasets used for ChatGPT and other LLMs include data scraped from all across the internet, including data that is personal to individuals and may be intended to be private. In a Washington Post investigation into C4, a dataset frequently used to train LLMs, they uncovered data from websites hosting voter registration databases as well as social media platforms such as Facebook and Twitter. While these sources may not explicitly contain personally identifiable data, sophisticated language models may well be capable of reconstructing or even correctly guessing personal details based on information gleaned from these sources. When asked for sensitive information in a prompt, LLMs do not have the context to know what information can or cannot be released.

Why It Matters

Release of private or personal information poses a significant threat to individual users; however, this threat is magnified as LLMs begin to be used at scale in large organizations. LLMs used at this scale without proper safeguards are at risk of causing large-scale data breaches. Researchers have also noted that the potential (though not yet observed) risk of government or military secrets could pose a significant threat to national security. They also highlight that the ethics of secrecy in areas such as national security, scientific research and trade secrets are far from universal. Navigating information privacy and release in these domains may create complex and nuanced scenarios that are difficult to manage through technical solutions, meaning that data privacy regulations may be needed to address these new problems.

3. Toxic, Discriminatory, and Biased Behavior

The Problems

Toxicity: Toxicity in language models relates to the use of offensive, vulgar, or otherwise inappropriate language, usually in the form of the model’s response to a prompt. Toxicity is deeply interwoven with discrimination and biased outputs, as it often comes in the form of slurs or derogatory language towards marginalized groups. 

Harmful Stereotypes: There is a risk of perpetuating harmful stereotypes which, though not explicitly toxic, are damaging to the groups they impact.

Unfair Toxicity Labeling: Researchers have noted that labeling language as “toxic” is in and of itself subjective and contextual. There is no universal definition of what constitutes toxic language. This can lead to scenarios in which language sourced from marginalized groups is unfairly labeled “toxic.”

Why It Happens

Toxic outputs occur because models are trained on vast amounts of language data. Any toxic language, or otherwise harmful biases within the training data, is learned by the model. Since LLMs work by predicting the next most likely word in a sequence, they regularly output stereotypes based on the language they are trained on. 

One common approach to handling toxicity in language models involves filtering the model’s training data to remove harmful language. Given the subjectivity of notions of toxicity, it may be impossible to perfectly filter toxic content from a dataset. Any filtering system that could be devised would leave content that is considered harmful to some, while inappropriately censoring content that others consider non-toxic. The Washington Post investigation of the C4 dataset found that the dataset contains large amounts of harmful content including swastikas, white supremacist and anti-trans content, and content about false and dangerous conspiracy theories such as “pizzagate.”

Why It Matters

Individual users of LLMs naturally experience psychological distress when faced with toxicity and harmful stereotypes produced by the model. At the organizational level, exposure to and tolerance of toxic language can have broader impacts on the culture of an organization as a whole. Failure to prevent toxicity produced by LLMs may contribute to degrading the culture and norms within an organization, particularly causing negative psychological impact to individuals in marginalized groups.

As LLMs become more broadly used across society, the problem of toxicity becomes more complex. Researchers in this area have noted that the concept of toxicity is ambiguous and contextual. There is a risk that attempts to mitigate toxicity without the ability to take context into account may disproportionately impact marginalized groups and reduce model performance in ambiguous contexts. Similar tensions between competing values also arise in standard machine learning models, as discussed by the Arthur research team in a previous blog post. Failure to remove social biases from LLMs when they are used at this scale will mean entrenching these biases all the more deeply in the technology we use and in society as a whole. 

Mitigation Approaches

Managing the risks of LLM failures is largely in the hands of LLM developers, researchers, and ultimately policymakers and regulators. Extensive technical work and research is needed to fully understand and improve upon the limitations of these models. Beyond that, policy work will be required to ensure that LLMs are functional and safe both for individuals and for society as a whole. As LLMs become more widespread, transparency around the data they are trained on, their functionality, and risk factors related to privacy, fairness, and toxicity will be an essential component of any future regulations. In the meantime, there are mitigation approaches that individuals and organizations can take now.

For Individual Users

While ensuring the safety and functionality of LLMs is the responsibility of LLM developers and LLM application developers, there are some approaches users can take to avoid major LLM failures. Users of LLMs, even publicly accessible ones such as ChatGPT, should stay informed on the accuracy, reliability, and limitations of the models they are using. It is also important to stay aware of the use cases that the model is intended for, double check any information that a model provides if it is being used to make a decision or inform a belief, and note how up-to-date the model is. Most LLMs in use today are not designed to provide information on news or current events. Some LLM-based search engines such as Google’s Bard and Microsoft’s Bing provide citations along with model outputs to improve trust in the models. However, even with citations, these models have been found to be frequently incorrect, citing sources that may not exist. Users should also avoid sending private or confidential information when prompting LLMs, as prompts may be logged. 

While LLMs that are easily accessible to the public usually have some mitigation measures in place to prevent toxic and biased language, it is possible that these models will still output language that is distressing or discriminatory against some users. Unfortunately, it is difficult for a user completely to avoid such behavior when using LLMs. Similarly, it is difficult for users to anticipate and avoid issues of unequal functionality, as this information is not generally well-documented. 

For Organizations Developing LLM Applications

Organizations considering LLMs should first evaluate whether the use case they are considering can reasonably and safely be accomplished by an LLM. It may be that some use cases are too high-risk. Given the high frequency of hallucinations, an LLM should not be used for factual information unto itself. If an LLM is to be used, the organization will need to ensure that the systems they are putting in place are properly designed for the specific use case the LLM is intended for. This may mean using bespoke models (rather than the general-use APIs) or models that are updated with or have access to organization-specific data. LLM systems must be designed to the specific organization and use case including careful design of prompting systems to prevent improper use. It is also advisable to discuss performance, bias and toxicity, and information security with LLM vendors and consider incorporating additional security, validation, and monitoring solutions into LLM systems. Information on model limitations, weaknesses, and risks should be well-understood and mitigation measures should be documented. Organizations will also need to ensure that all end users are provided with training on how to safely use the model and what its limitations are.


While this blog focuses on the risks posed by the adoption of LLMs, the aim is not to suggest that they should not be used at all. Rather, we aim to arm users and organizations with the knowledge and tools needed to use this technology safely and responsibly. Enthusiasts believe that LLMs have the potential to increase productivity across industries, support personalized education, and provide new approaches for scientific research. To achieve these lofty outcomes, it is essential that we design LLM applications that can be trusted.

So far, we’ve discussed the main types of ethical risk that arise due to functionality issues in LLMs. While there are mitigation approaches that can be taken by individuals and organizations, these problems really highlight technical and regulatory gaps in the AI space that will need to be addressed as LLMs become more widespread. In Part 2 of this series, we will discuss the ethical risks that may arise when LLMs do function as intended.

Learn more about Shield, Arthur’s firewall for LLMs, and Bench, Arthur’s open source LLM evaluation tool.

* No LLMs were used in the writing of this blog post.


What specific regulatory measures could effectively mitigate the ethical risks associated with LLMs?
Regulatory measures aimed at mitigating ethical risks in LLMs should prioritize transparency, user consent, and accountability. Clear guidelines on data usage, model decision explanation, and privacy protection are crucial for ensuring ethical compliance.

How can LLM developers ensure their models are up-to-date with the latest information and understandings?
LLM developers can keep their models up-to-date by incorporating continuous learning processes and staying informed about the latest information and understandings in their respective fields. Regular updates and integration of diverse perspectives help mitigate biases and enhance model reliability.

In what ways can organizations balance the benefits of LLM usage with the potential for perpetuating inequalities?

Organizations can balance the benefits of LLM usage with the potential for perpetuating inequalities by implementing measures that promote fairness and inclusivity. This includes ensuring diverse representation in data sources, actively addressing biases, and regularly evaluating the impact of LLM usage on marginalized communities. Ethical decision-making processes and collaboration with stakeholders are essential for addressing potential inequalities effectively.