S&P Global Offerings
Featured Topics
Featured Products
Events
S&P Global Offerings
Featured Topics
Featured Products
Events
By Miriam Fernández, CFA, Alexander Johnston, Sudeep Kesh, Svetlana Ashchepkova, and Dan Bennett
Highlights
Advancements in natural language processing (NLP) have enabled artificial intelligence, notably in the form of large language models (LLMs), to mimic and often surpass human linguistic ability, offering meaningful competitive advantages to organizations that use the technology.
These gains come with some new risks. For example, LLMs are vulnerable to model and data bias, an inability to tell fact from fiction, and to malicious actors.
Technological improvements, such as the addition of context to data, are reducing errors, while new-generation large multimodal models (which process information in multiple formats, such as images, audio, and text) promise improved training capabilities.
Increased reliability and new applications of AI language modeling, including customization and creative exploration, promise further benefits that could be reflected in the credit quality of entities that effectively harness its potential.
Humanity had a monopoly on language for about 50,000 years. That is no longer the case.
Machines operating (and operated by) programs capable of language modeling or so-called "natural language processing" have proven to be adept at processing human-like communication and using those inputs to produce meaningful output and interactions. Indeed, the speed at which technology in this field is progressing has meant machines are not only already better than people at many language-based tasks but are increasingly so advanced that they offer a compelling (and growing) competitive advantage over humans.
That will have significant implications for many facets of society. Not the least of those will be the businesses world, where S&P Global expects that adoption of natural language modeling technologies will become a major driver of investment spending, competitive advantages, productivity, profitability, and ultimately credit worthiness.
Computer programs that employed the basics of natural language have been with us since the late 1960s when Terry Winograd, a professor at the Massachusetts Institute of Technology (MIT), created SHRDLU. The program employed basic natural language and in doing so proved that machines and humans could maintain a comprehensive conversation — even if only about moving building blocks.
SHRDLU was an early example of natural language understanding (NLU) and, more broadly, natural language processing (see Figure 1). It was followed by further examples, including ELIZA (developed at MIT in the mid-1960s by Joseph Weizenbaum), and by NLP initiatives linked to the US Defense Advanced Research Projects Agency (DARPA). It wasn't until the 1970s, however, that increased computing power coupled with innovations in algorithms related to human languages enabled significant, albeit slow and laborious, progress of NLP.
NLP is the field of AI that enables machines to gather unstructured language, such as text or speech, and process it in a meaningful way. That ability underpins the functional applications of natural language machines by enabling human-machine interaction through communication modes traditionally associated with humans, such as speech, text, and images.
Beyond understanding language, machines that use NLP can also enable decisions by adding structure to unstructured language. They do that by converting plain text into structured data that can be better analyzed. An example of unstructured data is: "I'm looking for a restaurant that sells ham and mushroom pizza in Madrid for less than 15 euros." That same information, presented as structured data, would look like this: "Ingredients: ham, mushroom. Type of food: pizza. Region: Madrid. Price limit: 15 euros."
It is a misnomer to talk of machines that understand text, though it can often seem that they do. In fact, language models estimate the probability of a sequence of words and thus can generate the most likely word in a given context. They do this using deep learning models constructed as neural networks, called sequencing models, that recognize complex patterns and which work particularly well with sequential data, such as audio clips, video clips, and text streams.
Sequencing, in this context, refers to the correct ordering of steps in a process to create a sensible output. For example, to make pizza: first, dough must be shaped; second, sauce and toppings are added; and finally, it is baked. Alter any of those steps, and the result won't be a pizza.
Sequencing models pass information from one step of a neural network to the next. The neural networks used for this type of operation, called recurrent neural networks (RNN), can (in a manner of speaking) remember past steps and use that information to make logical predictions so their output makes sense. There are other types of neural networks, such as convolutional neural networks (CNN) — typically used for image processing. A more detailed explanation of neural networks is provided in our earlier article on the subject (see "Machine Learning: The Fundamentals," Nov. 29, 2023).
When processing text, correct sequencing is typically insufficient to effectively assess inputs without the addition of context. This problem was addressed in 2017 by developers at Google with the creation of transformer architecture (see Figure 2), which assesses multiple input dimensions at once, rather than just sequences. The breakthrough concept is called the "attention mechanism." It drives understanding of context by assessing how words in a sentence are related, even if they are sequentially distant from each other. The mechanism also assigns "weights" to words– weighting corresponds to the probability of words being similar or co-occurring in a context. Words with a greater weight are considered more important to understanding the text. Weights are also applied to unseen data to make word predictions or generate text, thus enabling Generative AI. Chat GPT (a generative pre-trained transformer) is the most popular example of a large language model (LLM) that uses this transformer architecture.
While both NLP and LLMs are tasked with understanding human language, and both may be capable of performing similar tasks, such as language translation and text summarization, LLMs are in fact a subset of NLP (see Figure 3).
NLP's umbrella status reflects its position as a broader scientific field encompassing the understanding and processing of natural languages.
NLP typically involves two techniques:
The roots of NLP date back to early AI research in the 1950s and 1960s. At that time, statistical approaches of the sort used today could be imagined, but limited computing power coupled with limited availability of large volumes of machine-readable text, led research to focus on rule-based NLPs, which codified language rules to enable machines to process text. An example of this approach is word stemming, which reduces words to their base form (or stem) to improve text search. For example, "painting" becomes "paint." Algorithms that do this generally use hand-crafted rules created by linguists for a particular language.
Greater computing power combined with the amount of text available on the internet provided the means to explore statistical approaches to NLP. The effectiveness of that approach has narrowed the application of rule-based NLP, but not eliminated it. LLMs are now capable of stemming without linguists’ help, but rule-based algorithms' compactness and performance means they remain valuable for NLP.
The path from NLP's debut to modern LLMs was notably facilitated by the invention of embeddings (popularized in 2013 by Google’s Word2Vec), which are numerical representations of words in dimensional space in such a way that words with similar meaning are close to each other. Further, while words vary in length, each embedding is the same size, facilitating onward processing. Modern LLMs start their processing by transforming input text to their embeddings for further processing.
Language modeling's uses were, until recently, limited principally to translation, auto-completion of words and sentences, and grammar correction. That changed with the development of large language models, which expanded NLP's abilities from around 20 common uses to over 200 possible tasks. That expansion of abilities came with a massive increase in complexity — some LLMs employ over 1 trillion parameters (effectively equating to the number of variables they use to make predictions).
LLMs have the potential to increase automation for knowledge workers, improving their productivity, saving time, and enhancing decision accuracy, according to a survey conducted by S&P Global Market Intelligence 451 Research (see "Generative AI use cases could boost document and content management software," Sept. 13, 2023). LLMs could also act as co-pilots for businesses where they can provide support services, perform a wide range of time-consuming tasks, and assist (and in some instances replace) decision-makers (see Table 1).
LLMs' risks and limitations mean they require constant human oversight. Some of these risks are inherent to models trained on large and broad data sets, known as foundation models, which power generative AI. They include hallucinations (nonsensical output due to patterns that are nonexistent or imperceptible to humans), output inaccuracies, biased responses, and data privacy issues. We explained these risks in detail in our report "Foundation models powering generative AI: The Fundamentals," Nov. 29, 2023.
Yet LLMs also have particular risks and limitations. They notably include:
A reliance on human feedback for performance improvement. Reinforcement learning with human feedback (RLHF) is a time and resource-intensive technique in machine learning,, where it serves as a measurement of an LLM’s performance and is used to optimize the model. RLHF notably was used during the transition from ChatGPT 3.0 to ChatGPT 3.5. The technique is not new. Supervised learning techniques that have been used for years in machine learning (known as discriminative AI) require human involvement for data labeling, model evaluation, and feature engineering.
A lack of language diversity. So far, LLMs are fluent mostly for English speakers. Reaching comparative fluency in other languages will depend in part on the ability to collect, or find online, digital text in those languages. Efforts are underway to train LLMs on other languages and cultural nuances (including the multilingual Falcon LLM, which is being developed by Abu Dhabi's Technology Innovation Institute), but there is an increasing risk of LLMs becoming biased toward English and Anglo-Saxon linguistic and sociocultural patterns and trends.
A growing reliance on externally developed models that raise ethical and operational concerns. Many companies use commercially licensed pre-trained LLMs (see Figure 4), mainly due to budget limitations and skill shortages. Examples include Falcon, Llama2 (developed by Meta Platforms Inc.), and Bloom (developed by BigScience). This practice offers cost and environmental benefits (due to computational efficiency) compared to developing a proprietary LLM. Yet it can also introduce additional risks due to companies' limited ability to measure performance or impose critical dependencies on third-party infrastructure and providers.
A susceptibility to intellectual property problems. This is a complicated arena, even without considering AI issues, due to the varying laws, standards, differing definitions, and protections across different jurisdictions. Much of the content created by generative AI evokes a quality of "realness" due to a perception of familiarity or sense of emulation that originates in an observer’s brain. A similar concept, in which a likeness between two works is found to partially depend on an observer's familiarity with the earlier work, has played a role in copyright infringement cases — notably including a lawsuit brought against, and later won by, Ed Sheeran (see "Did Ed Sheeran copy Marvin Gaye's 'Let's Get It On'?" Reed R., Harvard Law Today, May 1, 2023).
Computational power and energy consumption issues. Training a LLM from scratch requires significant power. This environmental cost can be mitigated (see Figure 5) by reusing LLMs and fine-tuning them for specific domains. This usually involves training an existing model on a smaller dataset with labeled data, resulting in updated model weights. For instance, BERT (a LLM developed by Google) has been fine-tuned to generate domain-specific models, such as finBERT for financial services. A second option, that requires an even less computationally intensive training regime, is in-context learning methods, where models generate context-relevant responses based on provided prompts. This is primarily used for GPT-based models and involves prompting a LLM with just a few labeled examples, without adjusting the weights of the model. Knowledge learned by the LLM through in-context learning is arguably not stable. This is because weights are not updated, and output variability can be high, depending on the prompts used. Nonetheless, it can be useful to enhance a model’s accuracy on a specific task. For instance, a GPT model can be prompted with explicit guidance to enhance its performance on tasks such as sentiment analysis or text classification.
The inability to differentiate between real-world knowledge (known as ground truth) and false knowledge. LLMs generate text by applying statistical patterns learned at the training stage. Unlike humans, LLMs do not have real-world experience or explicit knowledge, and they don’t have the ability to verify new information. LLM-generated content is a superficial output of text, not an inherent statement of fact nor a recitation of knowledge in the traditional sense. This superficiality means LLMs can be manipulated through biased inputs during training and fine-tuning stages, or through adversarial attacks (see following section, "LLMs and adversarial attacks") that deliberately attempt to manipulate a model using input data crafted to create incorrect predictions. Generative AI technology used to create so-called "deepfake" audio or video emulations are among the most insidious examples of this false knowledge creation (see "Can generative AI create a productivity boom?", Jan. 10, 2024).
Adversarial attacks are deliberate actions meant to corrupt models, forcing them to make incorrect predictions. Such attacks can be characterized as "white box" (where attackers have access to the model parameters); "black box" (where access is limited and the attacker interacts with the model through prompt engineering, prompt injections or adversarial crafted prompts); or "mixed" (where attackers have partial access to the model).
Through adversarial attacks, humans may influence a model’s behavior to make it produce desired (and biased) content. That said, LLMs are less prone to adversarial attacks than other types of generative AI models (such as image recognition models) due to higher dimensionality (the number of unique terms in a data set) and the observability of output by humans. Furthermore, adversarial attacks can be prevented using techniques that augment training datasets with adversarial examples and noise. Developers can also make LLMs less vulnerable to perturbations by narrowing the range of embedding options. This simplifies the way in which real-world words are mapped internally so the AI model can process them.
Increasing customization: As LLMs mature, we expect a shift from very large, broad models to smaller and narrower models built for specific use cases. FinBERT is an early example of this trend, but we expect specialization to continue such that models are commonly developed for specific tasks (or designed for specific edge devices) at specific companies.
The transition from LLMs to large multimodal models (LMMs): Human learning involves many types of information, including text, images, and audio. Similarly, training AI models with several types of data (and not only on text) may augment performance, while additional modalities (as inputs, outputs, or both) also increase usefulness. Multimodal input remains a relatively advanced area for AI, while the generation of multimodal outputs is particularly challenging and requires additional engineering solutions. Yet LMMs are a promising area of research. ChatGPT and Google's Flamingo are both examples of LMMs. More recently, Google released a new multimodal model, called Gemini, which is expected to transition seamlessly across text, images, video, audio and code.
Prompt engineering advancements: We expect model performance will continue to improve due to the use of curated and contextual prompts. For example, retrieval-augmented generation (RAG), which enables LLMs to retrieve external documents to serve as new reference points, can improve accuracy and reduce hallucinations by providing updated information that enables more accurate and relevant output. In commercial applications of LLMs, RAG makes it possible to avoid periodic retraining by updating a model's knowledge on a near-real-time basis.
LLMs’ augmentation by vector-enabled databases: Vector-enabled databases are an emerging class of data storage that incorporate vector embedding (data dimension information) which enables LLMs to better understand data and its relationships (see "2024 Trends in Data, AI & Analytics," Nov. 28, 2023). The use of vector-enabled databases improves LLMs’ capacity to provide context and to act as a long-term memory, and increases the speed and accuracy of semantic searches. This could lead to a host of new applications, including product recommendation systems, improved personalized search, better fraud detection, and reverse image search. We expect this nascent market (estimated by S&P Global Market Intelligence to be worth about $269 million in 2023) could grow to about $1.7 billion by 2028.
Increased use of LLMs in creative model exploration: LLMs can facilitate brainstorming by mimicking creativity. From a technical viewpoint, LLMs’ creativity is a factor of their "temperature," which is one of the hyperparameters (or configuration settings) used to regulate the diversity of outputs. A higher temperature prompts a LLM to explore options with lower probabilities (paralleling creativity), whereas a lower temperature generates more predictable (less creative) outputs. Depending on the context and use, humans can determine the degree of temperature to facilitate the best outcomes. For example, within healthcare, a higher temperature might be applied to the search for molecules used in drug discovery, while lower temperatures might assist in the application of treatments for disease.
Huan Zhang
Associate Director,
S&P Global Ratings
huan.zhang@spglobal.com