Ask an Expert: Evaluating LLM “Research Assistants” and their Risks for Novice Researchers

criticalai

10 months ago

Welcome to our new ASK AN EXPERT feature, a partnership between Critical AI and Critical AI @ Rutgers’ NEH-funded DESIGN JUSTICE LABS network. In response to an anthropology professor who asked about NotebookLM and what it portended for the function of documents like literature reviews and field statements as learning activities for graduate students, we consulted Tiffany DeRewal, Associate Teaching Professor of Writing Arts at Rowan University, who, with her colleague Leslie Allison, has been investigating LLM research tools through the lens of student learning. Her response has been peer-reviewed by members of the Critical AI editorial collective.

By Tiffany DeRewal

In recent months, a spate of LLM-based “research assistants,” including Google’s NotebookLM, have spurred enthusiasm and awe among a range of AI zealots, while inspiring concern among educators. These tools purport to “read” and “synthesize” specific texts: in doing so, products like ChatPDF and NotebookLM work with files that the user uploads, while Elicit, Research Rabbit, and SciSpace, draw from uploaded documents as well as scholarly literature indexed in existing research databases. The platforms then use information retrieval systems and LLMs to generate outputs in formats including document summaries, topic overviews, study guides, and (as our Anthropology colleague noted) literature reviews.

It is not hard to see why such products have caused some alarm—particularly among educators who teach secondary research methods and see active engagement with bodies of disciplinary research (i.e., “literature”) as a crucial foundation for students’ scholarly development. As they promise gains in efficiency and productivity, LLM research assistants purport to “streamline”—or bypass—the challenging, recursive, and often messy work of identifying, evaluating, analyzing, and synthesizing the existing research on a given topic. Over decades of practice, educators have come to understand these processes as themselves integral to building students’ intellectual abilities and skills. Indeed, from the standpoint of learning, such process is more valuable than the final product—which is of greatest interest to students and their advisers. In fields such as anthropology and computer science, literature surveys are part of how emerging scholars establish their familiarity with the concerns of their chosen field. The process of preparing them readies students to fashion the research on which they will predicate their own original contributions.

With a variety of LLM-based tools now claiming to produce first-rate literature reviews with just a few clicks, educators must persuade some of their students that the laborious and often frustrating work of actively learning a body of knowledge is still worthwhile. Does that mean that NotebookLM portends the end of the literature review—or perhaps the end of research-based writing as an instructional activity? As a college instructor who teaches information literacy and secondary research methods, and who has devoted some time to studying these tools, my answer is emphatically no.

First, the developers and marketers of these commercial platforms fundamentally confuse process with product. Robust and time-consuming engagement with the research “literature” is, quite simply, necessary to preserving the quality of future research and the people who will read and undertake it. At the same time, research–based writing—including rhetorical analysis, source evaluation, and the ability to grasp and analyze a set of research questions and to locate them in a larger context—remain crucial activities. The last thing the world needs in an online domain already teeming with AI “slop,” is more substandard outputs claiming to be “research.”

But this takes us to a crucial point that many educators don’t yet know. LLM-based tools cannot actually do the work of research and research-based writing.

It is no secret that the AI industry is hungry for education sector buy-in. Ed tech companies actively target educators and students—despite the fact that most of these products have been designed with little to no consideration of their impact on learning and skills development. The advertising for AI research products typically presents these tools as strategic partners that save time and offer “personalized” writing and learning support.[i] For example, Elicit invites users to “[a]utomate time-consuming research tasks like summarizing papers, extracting data, and synthesizing your findings”

NotebookLM, which is marketed primarily to students, encourages users to upload course texts and “[a]sk NotebookLM to explain complex concepts in simple terms.” Hence, it seems likely that many well-intentioned students turn to these tools because they are being told that that they need to make strategic choices: “I care about my field and my intellectual growth,” they might reason, “but I am not good at research/reading/writing/ note-taking and/or I’m afraid of doing this wrong and/or I’m pressed for time, so this tool can help me to optimize my research process!”

As many readers of this blog will know, enabling students to make informed choices is the goal of teaching critical AI literacies. Nonetheless, equipping students to recognize hype is a challenging task, as Marc Watkins among others has emphasized. Although the “GPT” architecture on which LLMs depend dates back less than a decade (to 2018), since the popular success of OpenAI’s ChatGPT in November 2022, they and their competitors have released a constant stream of systems with new features. Given the proprietary secrecy of these competing products, it takes effort to establish even baseline understandings of how each functions. Increasingly, educators are collaborating to share the results of their own probing and evaluations. What follows is my own contribution.

Under the Hood

TL;DR: My examination emphasizes four major points about NotebookLM and LLM-based research tools more generally:

They are not good at summarizing texts: they get things wrong, make things up, and do so in complex, non-obvious ways.
Their outputs mimic but do not match the outcomes of human reading comprehension and source synthesis.
They are proprietary black boxes: we cannot be certain what exactly is going on under the hood, what data the model has been trained on, and so on.
They harm the cognitive development of their users: current research suggests that AI “assistants” have a negative effect on critical thinking, knowledge, building and skill development.

Almost every glowing review of NotebookLM misleadingly promises that it “only uses the sources you upload” to produce output. As the Chronicle of Higher Education put it, “[r]ather than training the chatbot’s output on reams of data from across the internet,” NotebookLM looks only at the documents the user uploads. That claim, if true, would help to assure that the system’s outputs adequately studied the relevant files, and to reduce the likelihood of the errors and fabrications which are endemic to probabilistic systems.

The notion that NotebookLM “only draws from the sources you upload” refers to a heavily promoted feature that Google developers call “source-grounding”, which is achieved through a combination of retrieval augmented generation (RAG) and a large context window. (More on these terms below.) Through these processes, NotebookLM retrieves external data from uploaded sources and uses that data to generate content. However, as many current users of the tool have emphasized, NotebookLM’s “source-grounding” processes remain flawed and inadequate .

For example, I uploaded my own 100,000+ word doctoral dissertation to NotebookLM last year and used both the “Chat” and “Studio” features to generate summaries of the dissertation’s central argument as well as more specific claims. I found that answers to my Chat queries, as well as the “Study Guide” and “Briefing Doc” that the tool generated automatically, for the most part missed, or misconstrued, the dissertation’s central argument and key claims. I repeated this process after NotebookLM upgraded its LLM from Gemini 1.5 Pro to Gemini 2.0 earlier this year and found largely the same results. NotebookLM Chat provided direct, hyperlinked quotations from my uploaded document, and both the Chat and Studio features accurately identified my cornerstone texts, historical anecdotes, and the key takeaways from my literature review. While this was impressive to a point, the generated content sometimes offered contradictory conclusions and often failed to capture the actual claims and stakes of my thesis. Anyone relying on these outputs without reading my work would miss the intervention I was making.

Is this focus on a single text—my own dissertation no less—just splitting hairs? After all, NotebookLM correctly singled out many of my claims, and provided hyperlinked quotations from the relevant passages. What it often missed, however, was the context or function of those claims in the larger argument. As with summarizing tools more generally, such subtle errors or omissions are insidious in imbuing users with a false confidence in the quality of these tools. In fact, studies suggest that human researchers readily spot the flaws in LLM-generated summaries. However, we should not expect that non-expert users, including students who have not yet developed expertise in their areas of study, will reliably be able to identify these problems.

For example, an Australian government proof-of-concept study in early 2024 found that human evaluators consistently ranked machine-generated summaries lower than those produced by human staffers; synthetic outputs “often missed emphasis, nuance and context; included incorrect information or missed relevant information; and sometimes focused on auxiliary points or introduced irrelevant information.” Philippa Hardman’s more informal assessment of ChatGPT 4o, Claude 3.5, and NotebookLM’s came to a similar conclusion: all three models, she found, oversimplified and missed important details. They also suffered from a tendency to “intervene” by “shaping the type and depth of information,” as well as its tone and meaning.[ii] These are complex mistakes that require close analysis, and thoughtful fact-checking, to spot, which means they are more likely to be missed.

By incorporating what appear to be correct source citations, and assuming the stylistic and tonal markers of assured, well-researched academic prose the tools seem to be correct and authoritative. In actuality they pose a threat to both student learning and quality research. What such tools do not acknowledge is that it takes considerable skills and background knowledge to spot LLM-based errors—capacities that novice researchers, by definition, do not yet possess and which they may never develop if they are taught to rely on automated shortcuts.

Reality Check

In the real world, the understanding of complex research requires careful reading and active thought. The heavy promotion of automated summaries thus fails students in two key ways: both by proffering substandard summaries and by pretending the reading of summaries is an adequate means to develop knowledge and research skills.

Consider NotebookLM’s user interface. In the arrangement of its vertical windows, the interface encourages users to imagine that the tool is only working with selected sources, and that it progressively achieves an increasing depth of engagement with those sources.

The NotebookLM interface provides three windows for user engagement: in the “Sources” window, users can view the uploaded documents in HTML plain text, while the “Chat” window functions like a conventional LLM chatbot, allowing users to input questions and prompts about their documents which the system responds to with synthetic answers, complete with hyperlinked citations that direct the user to sections of the uploaded sources. The final window, “Studio,” refers to the “Content Studio,” which is where the tool produces the outputs that ostensibly summarize the uploaded documents in a variety of modes, including “Study guides”, “Briefing docs,” “FAQs” and even simulated conversational audio podcasts.

From “Sources” to “Chat” to “Studio,” these outputs become increasingly processed, or mediated, as they move from the plain text of uploaded inputs (“Sources”), to specific answers to questions (“Chat”) and, finally, synthetic summaries and other content (“Studio”) produced without any explicit user intervention. Observing the “Studio” content, naïve users will readily assume that the system is actually reading and understanding the uploaded documents, in a human-like way, and, as such, providing reliable outputs.

But as I have already suggested, NotebookLM’s facsimiles do not actually replicate human research activities and do not build a user’s agency and learning. At bottom, NotebookLM is an LLM text generator. Like all such technologies, NotebookLM can generate responses to prompts because of a pre-training process that models patterns in training data, after which intensive reinforcement by human data workers improve the accuracy of its predictions. In addition, systems like NotebookLM, Perplexity AI, or ChatGPT use processes such as retrieval augmented generation (RAG) to supply the model with access to additional information. The goal of these and many other workarounds is to make the outputs seem more trustworthy and human-like.

According to Google’s own blog, the NotebookLM platform was built on Google’s Gemini LLM, and the free version of the product now runs on Gemini 2.0 Flash which, like all LLMs, was pre-trained on a huge corpus of existing, human-produced text (as well as images, audio, video, and computer code). In contrast to earlier LLMs, Gemini 2.0 is notable for its extremely long “context window” (up to 750,000 words)—that is, the length of uploaded text (or other tokens) that the model can process during a given exchange. This makes it possible for the system to process up to about 1,500 pages of inputted content along with relevant queries. If a user uploads fifty 20-page articles to NotebookLM, the full text of all fifty articles will be part of the input for all subsequent prompts. Nonetheless, as we have seen, Notebook’s supposed mastery over these documents is deceptive. The same is true of another much-hyped technology—the latest “deep research” tools, which combine long context models with computation-intensive prompts (so-called “chains of thought”). Despite the huge expenses of running these systems, they have been shown to misinterpret the texts they are citing, come to inaccurate conclusions, and draw on lower quality sources.

Note that long context models involve significantly longer processing times and thus higher computing and energy costs; in LLM research assistants like NotebookLM, the long context window is supplemented with additional, more efficient information retrieval processes such as RAG.

The Deeper Dive: RAG and System Prompts

In what follows, I unpack a few of the technical processes on which these systems rely. Readers who have not already read Matthew Stone, Lauren M.E. Goodlad, and Mark Sammons’s “The Origins of LLMs in Machine Transcription and Machine Translation and Why That Matters” may also want to look there for a complementary history of how probabilistic “GPTs” originated in techniques for improving the accuracy of voice-to-text to and systems for machine translation. As the authors show, the confabulations, errors, and, stereotypes in GPT-3 (the model that preceded ChatGPT), were too frequent to make it a tool for widespread commercialization. Hence the need for “mitigating” this “misalignment” through industrial scale human reinforcement.

Yet another workaround that developers came to conceive for improving the performance of LLMs for informational retrieval is RAG (retrieval augmented generation), the term for a variety of processes by which new, “external” data can be added to a pre-trained system using data from, for example, indexed sources from the web. RAG works by adding chunks of such data (based on semantic similarity to the prompt) as additional input, thus augmenting the system’s ability to provide additional source information or context. In contrast to a much larger context window, RAG is a less computationally costly process of information retrieval. Retrieval via RAG can occur either before or after the LLM’s generation of text, adding additional context to a query or associating the generated text with a citation in the form of a supposed footnote. Such “footnotes,” however are mere pretenses: that is, they are post-hoc guesses about what might be relevant to the generated information rather than actual sources which (in deriving from a model’s statistical representation of training data) cannot be definitively known.

In general, RAG systems utilize the following steps:

First, new data is collected, “chunked,” and vectorized into a database that is searchable by the LLM. This external data may come from the internet or from other databases.
Next, in response to the user’s prompt, the system generates an output (such as a summary) which is then used to search the new database. Key vectors that semantically “match” the vectorized query are used to retrieve specific chunks from the external data source. As we have seen, in the case of NotebookLM that external source is primarily provided through the user’s documents.
The relevant chunks are then added to the initial prompt in order to generate what will presumably be a more accurate or up-to-date output.

Readers may already recognize that the use of RAG is fundamentally decontextualizing. That is, when an LLM system identifies semantically similar vectorized chunks of text that “match” the user’s prompt, it does not read or understand the texts in question in a human-like way. Moreover, even though RAG often results in something that looks like a footnote it continues and even obscures the decontextualization and lack of reliable crediting of source materials to which all probabilistic models are, by definition, subject.

Whether an LLM-based tool is using uploaded documents, retrieved chunks for an external database, or some combination of the two, the system is instructed to identify key points in the text. These instructions typically come in the form of system prompts, which are behind the scenes prompts that developers set to make it more likely that product output will meet company and user expectations.

System prompts are used for a variety of purposes including the specification of style, tone, and structure. When you submit a query to a chatbot, you may not necessarily instruct the model to deliver a response that is cordial and professional, or to generate content in a concise list with bullets. These seemingly autonomous choices about how to answer questions are shaped through system prompts. They may also set limits by instructing the model to avoid hate speech and to “treat controversial topics with impartiality and objectivity.”

System prompts are particularly effective in NotebookLM which can produce some outputs without any explicit user prompts. The buttons in the “Studio” window automatically generate an “Audio Overview” podcast, as well as documents purporting to summarize and synthesize the notebook’s uploaded texts (e.g., “Study guide”, “FAQ,” and “Timeline”). The structures necessary to produce these features are all prepended by system prompts. Google engineers tout the Content Studio as the “editorial” component, and the “real magic” of NotebookLM.

One particularly clear illustration of how system prompts condition the tool’s underlying language model can be seen in the tool’s generation, last year, of a 9 minute podcast on art and meaning from a document consisting only of the words “poop” and “fart” repeated 1,000 times. When this story was picked up by tech news outlets, some were impressed by the synthetic podcast’s “profound” insights. To my mind, however, those supposed insights point to the formulaic instructions through which all LLM-tools dress up glib bullshit in the form of familiar genres.

It is worth asking though how system prompts (which are concealed from users) are part of the apparatus through which generative AI shapes outputs and controls user experiences. As Eryk Salvaggio has emphasized, system prompts function as “a unique exercise of power” over the user’s own prompts. We know they are used to regulate output and deter the generation of “offensive” or otherwise “risky” language. As they do so, they create conditions that leave users with little or no information about how or why the language generated in response to their prompts is moderated. Recently it has mainly been rightwing cries over “wokeness” that grab the headlines. But people of any political persuasion need to recognize how many decisions are being made for them behind the scenes, from the choice of datasets, the teams that enlist data workers to “align” models with perceived “values,” to the guardrails and rules imposed by invisible prompts. As is well known, the enormous costs of building and running LLM tools concentrates a huge amount of power in very few hands, thus contributing to increased censorship, surveillance, and potential authoritarian control. What are the potential implications of such control for tools that purport to help students research and craft strong arguments? What might NotebookLM or ChatGPT identify as a “controversial topic”? Will automated summaries smooth over positions and claims that have the semantic markers of more “radical” points of view, or generate false equivalences where there is actually robust, high stakes debate?

Building Critical AI Literacies

Through probing models, researchers can gain a general sense of the techniques on which LLM-based research tools depend; but the secrecy that surrounds these models leads to other unanswered questions that should give users pause. When is a tool inputting full document texts into its LLM context window, and when is it using RAG? What chunks of which texts does the system access in in producing its synthetic outputs? Which documents and databases have been included in its indexing, which have been excluded, and why? What was included in the training data for the base model? What kinds of content were human data workers enlisted to reinforce and under what conditions did they labor? How much energy and water was used to train the system and to run it?

What we do know is that the tech companies lobby constantly against regulation while guarding their secrets as closely as possible. The impact of LLM tools on the learning and cognitive development of students is just getting under way. Already there is evidence that LLM tools can diminish critical thinking, engagement in learning processes, and knowledge retention. One recent study found that students who completed a writing task with ChatGPT engaged in fewer metacognitive processes than students working without LLM support. Another found that students using ChatGPT for problem-solving exerted less effort, and were more likely to overestimate their performance. A recent assessment of adults’ critical thinking skills, along with their self-reported generative AI usage habits, found that higher self-reported levels of AI tool usage corresponded with lower critical thinking scores: “as individuals increasingly offload cognitive tasks to AI tools, their ability to critically evaluate information, discern biases, and engage in reflective reasoning diminishes.” Finally, a team of Microsoft researchers found that among knowledge workers, generative AI “tools reduce the perceived effort of critical thinking while also encouraging over-reliance on AI, with confidence in the tool often diminishing independent problem-solving.” It is also worth emphasizing that interactions with generative AI tools have also been shown to impact users’ social and emotional judgments, amplifying their previously-held biases.

It has always been challenging to persuade busy, tired, stretched-thin students that doing the work will ultimately help them achieve their goals, particularly when they have been taught to measure success in grades and have heard much discussion of the diminishing value of higher education. Though some instructors are pursing creative ways to implement chatbots in the classroom I contend that for research and writing, it is crucial for students to understand what these commercial products do and (just as crucially) what do they not do. To cultivate critical AI literacies, one needs to see through the hype. Despite much blather to the contrary, LLM-based products are neither “magic” nor miraculous replicas of the human brain.

The constant onslaught from AI companies, often implying that AGI is just around the corner, conceals a track record of uptake that is in reality underwhelming given the billions that these companies lose by providing these resource-intensive tools at a loss. It is only by helping our students to develop practical understandings of how LLM systems generate what is called “research,” that we empower them to assess the true capabilities, limitations, and consequences of these products and to contend with their social, educational, and environmental implications.

[i] Ed-tech hype about new technologies providing “personalized learning” is nothing new; this rhetoric has been employed to promote classroom adoption of tech products since the 1920s.

[ii] Other assessments of LLM-generated summaries have found similar results. A 2023 assessment of LLM web search tools determined that the systems frequently “decontextualize[d]” and misrepresented information from cited web sources and “obscure[d] the provenance of information.” When the BBC asked their journalists to assess summaries of over 300 news articles produced by ChatGPT-4o, Microsoft Copilot Pro, Google Gemini, and Perplexity, they found that over half contained significant errors or distorted content. Likewise, a recent audit of eight chatbots with web search capabilities found that these tools failed to produce accurate citations for quoted news material over 60% of the time. One particularly compelling illustration of the shortcomings of LLM text summarizing tools comes from a recent evaluation of custom LLM legal research assistant products produced by LexisNexis and Thomson Reuters. These products purport to provide users with “hallucination-free” source summaries and legal information, grounded in authentic citations from these vendors’ massive databases of case law and legal reports, but when researchers probed these tools, they found that they produced incorrect information for between 17% and 34% of queries. The most concerning errors were those that identified real legal citations but produced inaccurate summaries: as the researchers emphasized. these errors would be difficult to spot without careful fact-checking and a strong foundation of legal expertise.

Share this: