The study by Stanford University found that hallucinations, or the tendency of large language models (LLMs) to produce content that deviates from actual facts or well-established legal principles and precedents, occurred from 69% to 88% of the time in response to specific legal queries.
The study applied 200,000 queries against each of GPT 3.5, Llama 2, and PaLM 2 models. Although these generative AI programs have supposedly passed bar exams, they failed at some basic tasks performed by junior attorneys. For example, in a task measuring the precedential relationship between two different cases, most LLMs do no better than random guessing. In answering queries about a court’s core ruling (or holding), models were found to hallucinate at least 75% of the time.
The risks of using LLMs for legal research are especially high for:
- Litigants in lower courts or in less prominent jurisdictions
- Individuals seeking detailed or complex legal information
- Users formulating questions based on incorrect premises
- Those uncertain about the reliability of LLM responses
The findings of this study are particularly concerning given there are dozens of legal tech startups and law firms saying that they are using AI to provide better, more efficient legal services. However, given such poor performance in these tests, anyone using AI or LLMs should exercise extreme caution. The law appears to require more intelligence than artificial intelligence currently offers.