Can chatbots hold meaningful conversations?

Arseny Moskvichev dreams of the day he can have a meaningful conversation with artificial intelligence. “By meaningful, I mean a conversation that has the power to change you,” says the cognitive and computer scientist. “The problem,” says Moskvichev, “is that LLMs are complete amnesiacs. They only have so much context they can attend to. If you’re out of this context, they forget everything you spoke about with them.”

Even the most advanced chatbots can only process about 16,000 words of text within a prompt when in conversation with a human user. This is called a “context window.” And they can’t connect the information they receive during different “conversations” with a human, or build a storyline.

To help chatbots learn to hold life-changing conversations, and to improve their comprehension of the deep complexities of context—of the webs of relationships between people, events, and timelines that govern human lives—Moskvichev and his colleagues are teaching them to read novels, the way we might learn to read them in high school literature classes.

The act of reading a novel might seem like a relaxing pastime, but it requires a nuanced intelligence. We use memory and complex, layered comprehension to follow multiple characters through twisting plots, scene changes, and narrative. And while we might not think about it, the average novel averages around 80,000 words. The Picture of Dorian Gray, by Oscar Wilde, for instance, runs at 82,000 words, while The Souls of Black Folk, by W.E. DuBois, totals around 72,000. The Little Prince, a children’s book, by Antoine de Saint-Exupéry, has around 17,000 words. All those words gradually build a story we hold and examine in our minds. But such skills are currently out of reach to Large Language Models (LLMs) like Open AI’s ChatGPT, which can process text but cannot be said to read the way we do.

The problem is that LLMs are complete amnesiacs.

Together with Ky-Vinh Mai, who studies complexity at the University of California, Irvine, Moskvichev built a new database that trains LLMs to read and analyze long stretches of text as part of a postdoc at the Santa Fe Institute, an independent non-profit theoretical research institute. The pair unveiled their new tool, Narrative XL, at the Empirical Methods in Natural Language Processing conference in Singapore late last year. “It’s a dataset with very long contexts,” Moskvichev says. As a training tool it offers more than a million questions for LLMs to practice on—“way above everything that was there before.” It is a “supervised” dataset, he says, which means there are gold standard answers the AI is expected to score correctly on, making assessment possible.

Moskvichev and Mai built Narrative XL from 1,500 books publicly available through Project Gutenberg. They train LLMs in reading comprehension by having the AI read a book in the database, then asking it to find a correct scene summary from a pool of options. Some summaries are accurate, while others include decoy scenes or are “corrupted” with characters from other books—say, Dracula moonlighting as a protagonist in Pride and Prejudice.

To train memory, Moskvichev and Mai compiled read-along questions, where the AI is expected to know more of the plot if it has read further into the book—but nothing beyond that point. For example, if the AI has only read half-way through Pride and Prejudice, it should still “worry” over whether Miss Bennett marries Mr. Darcy. (Spoiler alert—she does, but Austin saves the nuptials for a grand finale!) Because memory is a function of time, where the sequence of events matters, accidental knowledge of future events, even when factually correct, counts as a false memory. Any reference to the pair as married in a summary of the book, should be identified by the AI as false, unless it has been shown the entire text in its context window.

As part of its training on NarrativeXL an LLM can also be asked to identify and correct the mistakes in corrupted book scenes. So not only would Dracula have to be removed from Pride and Prejudice, an LLM in training would need to replace him with the correct character. This requires nuanced comprehension, an understanding of how the characters function in relation to one another.

The 1,500 books and nearly 1 million questions in Narrative XL surpass existing book training databases in size by at least twofold. BookSum is a training database released in 2021 that compiles 405 books for training AIs. The average context window for all available training databases remains under 15,000 words. One of the roadblocks for developing such training tools for AI is human resources: To accurately summarize a book requires for someone to read it, and design the training questions. “It’s very expensive to collect a supervised data set of this type,” Moskvichev says. It is here that the pair used a stroke of ingenuity—they built Narrative XL by getting ChatGPT3.5 to summarize short book sections, a task at which LLMs excel. It is an example of a machine being used to train a more powerful machine.

While Moskvichev and Mai validated the use of NarrativeXL as a training tool and observed significant improvement in trained LLM performance versus untrained, the LLMs still make mistakes approximately half of the time. “AI models are very tricky. They’re sneaky, and they cheat when they can,” Moskvichev says.

The heart of the problem is that chatbots are designed to identify the quickest path to an answer, which rarely involves reading everything. It is simply never a wild guess to predict that the heroine of a 19th-century romance will get the guy, a shortcut that can land the AI in trouble. “There’s still room for improvement,” Moskvichev says.

This article originally appeared on Nautilus, a science and culture magazine for curious readers. Sign up for the Nautilus newsletter.