The end of the student essay? Reasoning AIs are starting to cross into pass territory

One of the most challenging aspects of generative AI for legal education has been its potential use in assessment. There have been some polls conducted, and large numbers of students admit to using AI in their assessments. Anecdotally I’ve heard as much from my own students, and while few would admit to using it to write their essays, many are open about using it to research, to help with their structure, for bouncing-off ideas, and to get inspiration.

The current status of AI use in education in general varies from one institution to another. In most UK institutions, AI use in assessment is forbidden because it is considered personation, in other words, the student didn’t write the assessment. There have been different attempts to reach standards on how to handle this, and how to communicate it to students, but for the most part if you are caught using AI you will be failed, and may face disciplinary sanctions. Students that rely heavily on AI may also face the fact that it often hallucinates and gets stuff wrong, which could translate on a low mark or even a fail due to errors in the assessment, even if it’s not caught.

The problem of course is detection.

So far, it has been relatively easy to detect bad AI use in essays, particularly as earlier LLMs such as ChatGPT used to rely heavily on some specific words such as “delve“. LLMs also have a certain tone, a very boring voice that experienced markers have learned to identify. AIs also tend to hallucinate references, so often reading a paper which cites articles, cases, and books that don’t exist, can be a certain sign of LLM use. But more sophisticated use of AI may be more difficult to find, AI detectors are unreliable, and some students may use it for some things, and not for others.

I won’t get to the more philosophical and important questions that arise from the use (and prohibition) of AI in legal education. What legal professionals are we even trying to train? Won’t they be using LLMs anyway in their everyday practice? Should we even be policing this when we have no resources, and academics are famously underpaid and overworked? UK higher education institutions are already suffering greatly from a funding crisis due to visa restrictions, so you’re often placing a lot of stress on already stressed professionals. Sorry, I have no answers to many of those important questions, I’m just a nerdy academic who likes technology.

Whatever side of the debate you fall on, one thing is clear. AI is getting better. All the time. My blog post on AI detection was written 11 months ago, and the models have already moved on so much that what I wrote back then is unrecognisable. The current models, particularly reasoning and research models, are able to provide text that is better written and better structured, and even the most raw content from a research model can provide a very decent output that is getting dangerously close to getting a pass mark. So I ran an experiment with OpenAI’s new agent mode called “Deep Research“. This is a search agent that runs on an early version of the o3 reasoning model, and it is designed to conduct web searches and find authoritative sources and use them to answer a research question. I chose this because I think that a models such as this is the most likely to produce an essay, and because it searches for authoritative online sources, it is less likely to hallucinate results.

So I gave the model three copyright-related questions that I could ask in my class to see how it fared, and I was pretty impressed by some of the results, but I’m not sure if I would have passed them based on on general marking standards for a 3rd year undergraduate (knowledge and understanding, research, creative and critical thinking, presentation and communication). While the answers excel in knowledge and understanding, they would be marked down for being less critical and analytical in some cases (too descriptive), and marked down considerably for sources (too many online sources), and the bullet-point presentation, but would do excellent on communication.

At the very least, they are right on the edge of pass for an undergraduate essay, making a couple of assumptions. Firstly, I would assume that each of the links is a footnote, and that a student would use the proper referencing standard. Secondly, I’m ignoring the obvious tell-tale sign of an AI output, and that is the over-use of bullet points and numbered lists. These would probably fail as clearly AI just because of the formatting. I am assuming that a competent student would convert all of these into paragraphs.

So again, I am forgetting the actual presentation, and going just for what is written, and I think the models are getting there. Here they are:

“Write an analysis of voice copyright in the UK and Europe, paying special attention to performer’s rights.” Here’s the output. I would probably give this a 48 (with the caveats above). Good knowledge, too descriptive and pretty basic sources.
“Write an analysis of the current UK government consultation on copyright and AI? What are the options on the table, and what is likely to be the end result? . Here’s the output. This is the best of the bunch, and probably would mark at 52 (with caveats). Properly formatted and with better sources, this could even go higher The policy recommendation section at the end is another tell-tale sign of AI use.
“Critically analyse whether artificial intelligence can generate works capable of copyright protection. Use UK and US law in your comparative analysis.” Here is the output. This is the poorest of the three from an analysis perspective, but the best with primary and secondary sources. I’d give it a 42. Still a pass, but just about.

Concluding

I know I’m opening myself to criticism with my marking standards. Am I being too harsh? Too lenient? Without the two concessions, these would have been clearly fails, but I think that it’s not too difficult to assume that most students would be able to turn the above text into more than passable essays, and that is what I am trying to get at. We’re probably at the point at which the written essay is no longer a viable manner of assessment. What next? Vivas? Written exams?

Last year we had a lot of submissions that were clearly written by AI. This year? Not so much. It could be that students aren’t using the technology, or it could be that we’ve finally crossed the “pass” boundary. So I’ll leave you with this meme.