Normální zobrazení

Jsou dostupné nové články, klikněte pro obnovení stránky.

PředevčíremHlavní kanál

IEEE Spectrum
Do We Dare Use Generative AI for Mental Health?Aaron Pavez
The mental-health app Woebot launched in 2017, back when “chatbot” wasn’t a familiar term and someone seeking a therapist could only imagine talking to a human being. Woebot was something exciting and new: a way for people to get on-demand mental-health support in the form of a responsive, empathic, AI-powered chatbot. Users found that the friendly robot avatar checked in on them every day, kept track of their progress, and was always available to talk something through. Today, the situation
26. Květen 2024 v 17:00

Do We Dare Use Generative AI for Mental Health?

IEEE Spectrum

Od: Aaron Pavez

26. Květen 2024 v 17:00

The mental-health app Woebot launched in 2017, back when “chatbot” wasn’t a familiar term and someone seeking a therapist could only imagine talking to a human being. Woebot was something exciting and new: a way for people to get on-demand mental-health support in the form of a responsive, empathic, AI-powered chatbot. Users found that the friendly robot avatar checked in on them every day, kept track of their progress, and was always available to talk something through.

Today, the situation is vastly different. Demand for mental-health services has surged while the supply of clinicians has stagnated. There are thousands of apps that offer automated support for mental health and wellness. And ChatGPT has helped millions of people experiment with conversational AI.

But even as the world has become fascinated with generative AI, people have also seen its downsides. As a company that relies on conversation, Woebot Health had to decide whether generative AI could make Woebot a better tool, or whether the technology was too dangerous to incorporate into our product.

Woebot is designed to have structured conversations through which it delivers evidence-based tools inspired by cognitive behavioral therapy (CBT), a technique that aims to change behaviors and feelings. Throughout its history, Woebot Health has used technology from a subdiscipline of AI known as natural-language processing (NLP). The company has used AI artfully and by design—Woebot uses NLP only in the service of better understanding a user’s written texts so it can respond in the most appropriate way, thus encouraging users to engage more deeply with the process.

Woebot, which is currently available in the United States, is not a generative-AI chatbot like ChatGPT. The differences are clear in both the bot’s content and structure. Everything Woebot says has been written by conversational designers trained in evidence-based approaches who collaborate with clinical experts; ChatGPT generates all sorts of unpredictable statements, some of which are untrue. Woebot relies on a rules-based engine that resembles a decision tree of possible conversational paths; ChatGPT uses statistics to determine what its next words should be, given what has come before.

With ChatGPT, conversations about mental health ended quickly and did not allow a user to engage in the psychological processes of change.

The rules-based approach has served us well, protecting Woebot’s users from the types of chaotic conversations we observed from early generative chatbots. Prior to ChatGPT, open-ended conversations with generative chatbots were unsatisfying and easily derailed. One famous example is Microsoft’s Tay, a chatbot that was meant to appeal to millennials but turned lewd and racist in less than 24 hours.

But with the advent of ChatGPT in late 2022, we had to ask ourselves: Could the new large language models (LLMs) powering chatbots like ChatGPT help our company achieve its vision? Suddenly, hundreds of millions of users were having natural-sounding conversations with ChatGPT about anything and everything, including their emotions and mental health. Could this new breed of LLMs provide a viable generative-AI alternative to the rules-based approach Woebot has always used? The AI team at Woebot Health, including the authors of this article, were asked to find out.

Woebot, a mental-health chatbot, deploys concepts from cognitive behavioral therapy to help users. This demo shows how users interact with Woebot using a combination of multiple-choice responses and free-written text.

The Origin and Design of Woebot

Woebot got its start when the clinical research psychologist Alison Darcy, with support from the AI pioneer Andrew Ng, led the build of a prototype intended as an emotional support tool for young people. Darcy and another member of the founding team, Pierre Rappolt, took inspiration from video games as they looked for ways for the tool to deliver elements of CBT. Many of their prototypes contained interactive fiction elements, which then led Darcy to the chatbot paradigm. The first version of the chatbot was studied in a randomized control trial that offered mental-health support to college students. Based on the results, Darcy raised US $8 million from New Enterprise Associates and Andrew Ng’s AI Fund.

The Woebot app is intended to be an adjunct to human support, not a replacement for it. It was built according to a set of principles that we call Woebot’s core beliefs, which were shared on the day it launched. These tenets express a strong faith in humanity and in each person’s ability to change, choose, and grow. The app does not diagnose, it does not give medical advice, and it does not force its users into conversations. Instead, the app follows a Buddhist principle that’s prevalent in CBT of “sitting with open hands”—it extends invitations that the user can choose to accept, and it encourages process over results. Woebot facilitates a user’s growth by asking the right questions at optimal moments, and by engaging in a type of interactive self-help that can happen anywhere, anytime.

A Convenient Companion

Users interact with Woebot either by choosing prewritten responses or by typing in whatever text they’d like, which Woebot parses using AI techniques. Woebot deploys concepts from cognitive behavioral therapy to help users change their thought patterns. Here, it first asks a user to write down negative thoughts, then explains the cognitive distortions at work. Finally, Woebot invites the user to recast a negative statement in a positive way. (Not all exchanges are shown.)

These core beliefs strongly influenced both Woebot’s engineering architecture and its product-development process. Careful conversational design is crucial for ensuring that interactions conform to our principles. Test runs through a conversation are read aloud in “table reads,” and then revised to better express the core beliefs and flow more naturally. The user side of the conversation is a mix of multiple-choice responses and “free text,” or places where users can write whatever they wish.

Building an app that supports human health is a high-stakes endeavor, and we’ve taken extra care to adopt the best software-development practices. From the start, enabling content creators and clinicians to collaborate on product development required custom tools. An initial system using Google Sheets quickly became unscalable, and the engineering team replaced it with a proprietary Web-based “conversational management system” written in the JavaScript library React.

Within the system, members of the writing team can create content, play back that content in a preview mode, define routes between content modules, and find places for users to enter free text, which our AI system then parses. The result is a large rules-based tree of branching conversational routes, all organized within modules such as “social skills training” and “challenging thoughts.” These modules are translated from psychological mechanisms within CBT and other evidence-based techniques.

How Woebot Uses AI

While everything Woebot says is written by humans, NLP techniques are used to help understand the feelings and problems users are facing; then Woebot can offer the most appropriate modules from its deep bank of content. When users enter free text about their thoughts and feelings, we use NLP to parse these text inputs and route the user to the best response.

In Woebot’s early days, the engineering team used regular expressions, or “regexes,” to understand the intent behind these text inputs. Regexes are a text-processing method that relies on pattern matching within sequences of characters. Woebot’s regexes were quite complicated in some cases, and were used for everything from parsing simple yes/no responses to learning a user’s preferred nickname.

Later in Woebot’s development, the AI team replaced regexes with classifiers trained with supervised learning. The process for creating AI classifiers that comply with regulatory standards was involved—each classifier required months of effort. Typically, a team of internal-data labelers and content creators reviewed examples of user messages (with all personally identifiable information stripped out) taken from a specific point in the conversation. Once the data was placed into categories and labeled, classifiers were trained that could take new input text and place it into one of the existing categories.

This process was repeated many times, with the classifier repeatedly evaluated against a test dataset until its performance satisfied us. As a final step, the conversational-management system was updated to “call” these AI classifiers (essentially activating them) and then to route the user to the most appropriate content. For example, if a user wrote that he was feeling angry because he got in a fight with his mom, the system would classify this response as a relationship problem.

The technology behind these classifiers is constantly evolving. In the early days, the team used an open-source library for text classification called fastText, sometimes in combination with regular expressions. As AI continued to advance and new models became available, the team was able to train new models on the same labeled data for improvements in both accuracy and recall. For example, when the early transformer model BERT was released in October 2018, the team rigorously evaluated its performance against the fastText version. BERT was superior in both precision and recall for our use cases, and so the team replaced all fastText classifiers with BERT and launched the new models in January 2019. We immediately saw improvements in classification accuracy across the models.

Woebot and Large Language Models

When ChatGPT was released in November 2022, Woebot was more than 5 years old. The AI team faced the question of whether LLMs like ChatGPT could be used to meet Woebot’s design goals and enhance users’ experiences, putting them on a path to better mental health.

We were excited by the possibilities, because ChatGPT could carry on fluid and complex conversations about millions of topics, far more than we could ever include in a decision tree. However, we had also heard about troubling examples of chatbots providing responses that were decidedly not supportive, including advice on how to maintain and hide an eating disorder and guidance on methods of self-harm. In one tragic case in Belgium, a grieving widow accused a chatbot of being responsible for her husband’s suicide.

The first thing we did was try out ChatGPT ourselves, and we quickly became experts in prompt engineering. For example, we prompted ChatGPT to be supportive and played the roles of different types of users to explore the system’s strengths and shortcomings. We described how we were feeling, explained some problems we were facing, and even explicitly asked for help with depression or anxiety.

A few things stood out. First, ChatGPT quickly told us we needed to talk to someone else—a therapist or doctor. ChatGPT isn’t intended for medical use, so this default response was a sensible design decision by the chatbot’s makers. But it wasn’t very satisfying to constantly have our conversation aborted. Second, ChatGPT’s responses were often bulleted lists of encyclopedia-style answers. For example, it would list six actions that could be helpful for depression. We found that these lists of items told the user what to do but didn’t explain how to take these steps. Third, in general, the conversations ended quickly and did not allow a user to engage in the psychological processes of change.

It was clear to our team that an off-the-shelf LLM would not deliver the psychological experiences we were after. LLMs are based on reward models that value the delivery of correct answers; they aren’t given incentives to guide a user through the process of discovering those results themselves. Instead of “sitting with open hands,” the models make assumptions about what the user is saying to deliver a response with the highest assigned reward.

We had to decide whether generative AI could make Woebot a better tool, or whether the technology was too dangerous to incorporate into our product.

To see if LLMs could be used within a mental-health context, we investigated ways of expanding our proprietary conversational-management system. We looked into frameworks and open-source techniques for managing prompts and prompt chains—sequences of prompts that ask an LLM to achieve a task through multiple subtasks. In January of 2023, a platform called LangChain was gaining in popularity and offered techniques for calling multiple LLMs and managing prompt chains. However, LangChain lacked some features that we knew we needed: It didn’t provide a visual user interface like our proprietary system, and it didn’t provide a way to safeguard the interactions with the LLM. We needed a way to protect Woebot users from the common pitfalls of LLMs, including hallucinations (where the LLM says things that are plausible but untrue) and simply straying off topic.

Ultimately, we decided to expand our platform by implementing our own LLM prompt-execution engine, which gave us the ability to inject LLMs into certain parts of our existing rules-based system. The engine allows us to support concepts such as prompt chains while also providing integration with our existing conversational routing system and rules. As we developed the engine, we were fortunate to be invited into the beta programs of many new LLMs. Today, our prompt-execution engine can call more than a dozen different LLM models, including variously sized OpenAI models, Microsoft Azure versions of OpenAI models, Anthropic’s Claude, Google Bard (now Gemini), and open-source models running on the Amazon Bedrock platform, such as Meta’s Llama 2. We use this engine exclusively for exploratory research that’s been approved by an institutional review board, or IRB.

It took us about three months to develop the infrastructure and tooling support for LLMs. Our platform allows us to package features into different products and experiments, which in turn lets us maintain control over software versions and manage our research efforts while ensuring that our commercially deployed products are unaffected. We’re not using LLMs in any of our products; the LLM-enabled features can be used only in a version of Woebot for exploratory studies.

A Trial for an LLM-Augmented Woebot

We had some false starts in our development process. We first tried creating an experimental chatbot that was almost entirely powered by generative AI; that is, the chatbot directly used the text responses from the LLM. But we ran into a couple of problems. The first issue was that the LLMs were eager to demonstrate how smart and helpful they are! This eagerness was not always a strength, as it interfered with the user’s own process.

For example, the user might be doing a thought-challenging exercise, a common tool in CBT. If the user says, “I’m a bad mom,” a good next step in the exercise could be to ask if the user’s thought is an example of “labeling,” a cognitive distortion where we assign a negative label to ourselves or others. But LLMs were quick to skip ahead and demonstrate how to reframe this thought, saying something like “A kinder way to put this would be, ‘I don’t always make the best choices, but I love my child.’” CBT exercises like thought challenging are most helpful when the person does the work themselves, coming to their own conclusions and gradually changing their patterns of thinking.

A second difficulty with LLMs was in style matching. While social media is rife with examples of LLMs responding in a Shakespearean sonnet or a poem in the style of Dr. Seuss, this format flexibility didn’t extend to Woebot’s style. Woebot has a warm tone that has been refined for years by conversational designers and clinical experts. But even with careful instructions and prompts that included examples of Woebot’s tone, LLMs produced responses that didn’t “sound like Woebot,” maybe because a touch of humor was missing, or because the language wasn’t simple and clear.

The LLM-augmented Woebot was well-behaved, refusing to take inappropriate actions like diagnosing or offering medical advice.

However, LLMs truly shone on an emotional level. When coaxing someone to talk about their joys or challenges, LLMs crafted personalized responses that made people feel understood. Without generative AI, it’s impossible to respond in a novel way to every different situation, and the conversation feels predictably “robotic.”

We ultimately built an experimental chatbot that possessed a hybrid of generative AI and traditional NLP-based capabilities. In July 2023 we registered an IRB-approved clinical study to explore the potential of this LLM-Woebot hybrid, looking at satisfaction as well as exploratory outcomes like symptom changes and attitudes toward AI. We feel it’s important to study LLMs within controlled clinical studies due to their scientific rigor and safety protocols, such as adverse event monitoring. Our Build study included U.S. adults above the age of 18 who were fluent in English and who had neither a recent suicide attempt nor current suicidal ideation. The double-blind structure assigned one group of participants the LLM-augmented Woebot while a control group got the standard version; we then assessed user satisfaction after two weeks.

We built technical safeguards into the experimental Woebot to ensure that it wouldn’t say anything to users that was distressing or counter to the process. The safeguards tackled the problem on multiple levels. First, we used what engineers consider “best in class” LLMs that are less likely to produce hallucinations or offensive language. Second, our architecture included different validation steps surrounding the LLM; for example, we ensured that Woebot wouldn’t give an LLM-generated response to an off-topic statement or a mention of suicidal ideation (in that case, Woebot provided the phone number for a hotline). Finally, we wrapped users’ statements in our own careful prompts to elicit appropriate responses from the LLM, which Woebot would then convey to users. These prompts included both direct instructions such as “don’t provide medical advice” as well as examples of appropriate responses in challenging situations.

While this initial study was short—two weeks isn’t much time when it comes to psychotherapy—the results were encouraging. We found that users in the experimental and control groups expressed about equal satisfaction with Woebot, and both groups had fewer self-reported symptoms. What’s more, the LLM-augmented chatbot was well-behaved, refusing to take inappropriate actions like diagnosing or offering medical advice. It consistently responded appropriately when confronted with difficult topics like body image issues or substance use, with responses that provided empathy without endorsing maladaptive behaviors. With participant consent, we reviewed every transcript in its entirety and found no concerning LLM-generated utterances—no evidence that the LLM hallucinated or drifted off-topic in a problematic way. What’s more, users reported no device-related adverse events.

This study was just the first step in our journey to explore what’s possible for future versions of Woebot, and its results have emboldened us to continue testing LLMs in carefully controlled studies. We know from our prior research that Woebot users feel a bond with our bot. We’re excited about LLMs’ potential to add more empathy and personalization, and we think it’s possible to avoid the sometimes-scary pitfalls related to unfettered LLM chatbots.

We believe strongly that continued progress within the LLM research community will, over time, transform the way people interact with digital tools like Woebot. Our mission hasn’t changed: We’re committed to creating a world-class solution that helps people along their mental-health journeys. For anyone who wants to talk, we want the best possible version of Woebot to be there for them.

This article appears in the June 2024 print issue.

Disclaimers

The Woebot Health Platform is the foundational development platform where components are used for multiple types of products in different stages of development and enforced under different regulatory guidances.

Woebot for Mood & Anxiety (W-MA-00), Woebot for Mood & Anxiety (W-MA-01), and Build Study App (W-DISC-001) are investigational medical devices. They have not been evaluated, cleared, or approved by the FDA. Not for use outside an IRB-approved clinical trial.

IEEE Spectrum
What Is Generative AI?Eliza Strickland
Generative AI is today’s buzziest form of artificial intelligence, and it’s what powers chatbots like ChatGPT, Ernie, LLaMA, Claude, and Command—as well as image generators like DALL-E 2, Stable Diffusion, Adobe Firefly, and Midjourney. Generative AI is the branch of AI that enables machines to learn patterns from vast datasets and then to autonomously produce new content based on those patterns. Although generative AI is fairly new, there are already many examples of models that can produce tex
14. Únor 2024 v 17:34

What Is Generative AI?

IEEE Spectrum

Od: Eliza Strickland

14. Únor 2024 v 17:34

Generative AI is today’s buzziest form of artificial intelligence, and it’s what powers chatbots like ChatGPT, Ernie, LLaMA, Claude, and Command—as well as image generators like DALL-E 2, Stable Diffusion, Adobe Firefly, and Midjourney. Generative AI is the branch of AI that enables machines to learn patterns from vast datasets and then to autonomously produce new content based on those patterns. Although generative AI is fairly new, there are already many examples of models that can produce text, images, videos, and audio.

Many “foundation models” have been trained on enough data to be competent in a wide variety of tasks. For example, a large language model can generate essays, computer code, recipes, protein structures, jokes, medical diagnostic advice, and much more. It can also theoretically generate instructions for building a bomb or creating a bioweapon, though safeguards are supposed to prevent such types of misuse.

What’s the difference between AI, machine learning, and generative AI?

Artificial intelligence (AI) refers to a wide variety of computational approaches to mimicking human intelligence. Machine learning (ML) is a subset of AI; it focuses on algorithms that enable systems to learn from data and improve their performance. Before generative AI came along, most ML models learned from datasets to perform tasks such as classification or prediction. Generative AI is a specialized type of ML involving models that perform the task of generating new content, venturing into the realm of creativity.

What architectures do generative AI models use?

Generative models are built using a variety of neural network architectures—essentially the design and structure that defines how the model is organized and how information flows through it. Some of the most well-known architectures are variational autoencoders (VAEs), generative adversarial networks (GANs), and transformers. It’s the transformer architecture, first shown in this seminal 2017 paper from Google, that powers today’s large language models. However, the transformer architecture is less suited for other types of generative AI, such as image and audio generation.

Autoencoders learn efficient representations of data through an encoder-decoder framework. The encoder compresses input data into a lower-dimensional space, known as the latent (or embedding) space, that preserves the most essential aspects of the data. A decoder can then use this compressed representation to reconstruct the original data. Once an autoencoder has been trained in this way, it can use novel inputs to generate what it considers the appropriate outputs. These models are often deployed in image-generation tools and have also found use in drug discovery, where they can be used to generate new molecules with desired properties.

With generative adversarial networks (GANs), the training involves a generator and a discriminator that can be considered adversaries. The generator strives to create realistic data, while the discriminator aims to distinguish between those generated outputs and real “ground truth” outputs. Every time the discriminator catches a generated output, the generator uses that feedback to try to improve the quality of its outputs. But the discriminator also receives feedback on its performance. This adversarial interplay results in the refinement of both components, leading to the generation of increasingly authentic-seeming content. GANs are best known for creating deepfakes but can also be used for more benign forms of image generation and many other applications.

The transformer is arguably the reigning champion of generative AI architectures for its ubiquity in today’s powerful large language models (LLMs). Its strength lies in its attention mechanism, which enables the model to focus on different parts of an input sequence while making predictions. In the case of language models, the input consists of strings of words that make up sentences, and the transformer predicts what words will come next (we’ll get into the details below). In addition, transformers can process all the elements of a sequence in parallel rather than marching through it from beginning to end, as earlier types of models did; this parallelization makes training faster and more efficient. When developers added vast datasets of text for transformer models to learn from, today’s remarkable chatbots emerged.

How do large language models work?

A transformer-based LLM is trained by giving it a vast dataset of text to learn from. The attention mechanism comes into play as it processes sentences and looks for patterns. By looking at all the words in a sentence at once, it gradually begins to understand which words are most commonly found together and which words are most important to the meaning of the sentence. It learns these things by trying to predict the next word in a sentence and comparing its guess to the ground truth. Its errors act as feedback signals that cause the model to adjust the weights it assigns to various words before it tries again.

A chart shows the size of five LLMs in parameters and their performance on a benchmark. These five LLMs vary greatly in size (given in parameters), and the larger models have better performance on a standard LLM benchmark test. IEEE Spectrum

To explain the training process in slightly more technical terms, the text in the training data is broken down into elements called tokens, which are words or pieces of words—but for simplicity’s sake, let’s say all tokens are words. As the model goes through the sentences in its training data and learns the relationships between tokens, it creates a list of numbers, called a vector, for each one. All the numbers in the vector represent various aspects of the word: its semantic meanings, its relationship to other words, its frequency of use, and so on. Similar words, like elegant and fancy, will have similar vectors and will also be near each other in the vector space. These vectors are called word embeddings. The parameters of an LLM include the weights associated with all the word embeddings and the attention mechanism. GPT-4, the OpenAI model that’s considered the current champion, is rumored to have more than 1 trillion parameters.

Given enough data and training time, the LLM begins to understand the subtleties of language. While much of the training involves looking at text sentence by sentence, the attention mechanism also captures relationships between words throughout a longer text sequence of many paragraphs. Once an LLM is trained and is ready for use, the attention mechanism is still in play. When the model is generating text in response to a prompt, it’s using its predictive powers to decide what the next word should be. When generating longer pieces of text, it predicts the next word in the context of all the words it has written so far; this function increases the coherence and continuity of its writing.

Why do large language models hallucinate?

You may have heard that LLMs sometimes “hallucinate.” That’s a polite way to say they make stuff up very convincingly. A model sometimes generates text that fits the context and is grammatically correct, yet the material is erroneous or nonsensical. This bad habit stems from LLMs training on vast troves of data drawn from the Internet, plenty of which is not factually accurate. Since the model is simply trying to predict the next word in a sequence based on what it has seen, it may generate plausible-sounding text that has no grounding in reality.

Why is generative AI controversial?

One source of controversy for generative AI is the provenance of its training data. Most AI companies that train large models to generate text, images, video, and audio have not been transparent about the content of their training datasets. Various leaks and experiments have revealed that those datasets include copyrighted material such as books, newspaper articles, and movies. A number of lawsuits are underway to determine whether use of copyrighted material for training AI systems constitutes fair use, or whether the AI companies need to pay the copyright holders for use of their material.

On a related note, many people are concerned that the widespread use of generative AI will take jobs away from creative humans who make art, music, written works, and so forth. People are also concerned that it could take jobs from humans who do a wide range of white-collar jobs, including translators, paralegals, customer-service representatives, and journalists. There have already been a few troubling layoffs, but it’s hard to say yet whether generative AI will be reliable enough for large-scale enterprise applications. (See above about hallucinations.)

Finally, there’s the danger that generative AI will be used to make bad stuff. And there are of course many categories of bad stuff it could theoretically be used for. Generative AI can be used for personalized scams and phishing attacks: For example, using “voice cloning,” scammers can copy the voice of a specific person and call the person’s family with a plea for help (and money). All formats of generative AI—text, audio, image, and video—can be used to generate misinformation by creating plausible-seeming representations of things that never happened, which is a particularly worrying possibility when it comes to elections. (Meanwhile, as IEEE Spectrum reported this week, the U.S. Federal Communications Commission has responded by outlawing AI-generated robocalls.) Image- and video-generating tools can be used to produce nonconsensual pornography, although the tools made by mainstream companies disallow such use. And chatbots can theoretically walk a would-be terrorist through the steps of making a bomb, nerve gas, and a host of other horrors. Although the big LLMs have safeguards to prevent such misuse, some hackers delight in circumventing those safeguards. What’s more, “uncensored” versions of open-source LLMs are out there.

Despite such potential problems, many people think that generative AI can also make people more productive and could be used as a tool to enable entirely new forms of creativity. We’ll likely see both disasters and creative flowerings and plenty else that we don’t expect. But knowing the basics of how these models work is increasingly crucial for tech-savvy people today. Because no matter how sophisticated these systems grow, it’s the humans’ job to keep them running, make the next ones better, and with any luck, help people out too.