‘We’ve worked hard to figure out what LLMs can and can’t do,’ says Carousel founder Adam Boxer

AI

The integration of artificial intelligence into various sectors has been a hot topic for years, and education is no exception. 

Carousel Learning, an innovative edtech company, has recently conducted a significant research project examining the potential of AI in marking students' work. 

In an exclusive chat with ETIH, Adam Boxer, Founder of Carousel Learning and science teacher, shared the motivations, findings, and implications of the groundbreaking study.

The research paper stemmed from the collaborative efforts of AI researcher Owen Henkel and EdTech funder Libby Hills, who have a well-established presence in the field.

 Henkel and Hills have previously published research on using large language models (LLMs) for marking student work in Ghana and co-host the sector-leading Ed-Technical podcast. 

Recognising the potential for further exploration, partnering with Carousel Learning was a natural next step.

We've always been fans of Owen and Libby through their fantastic podcast Ed-Technical, and they published a paper showing the promise of LLMs in marking students' work, so we thought it would be a match made in heaven." shared Boxer. 

 The collaboration provided a strong foundation for investigating AI's capabilities in an educational context.

Hypotheses and initial testing

Before launching the full-scale trial, Carousel Learning conducted internal tests to gauge AI's effectiveness in marking. 

“We had done some internal testing ourselves before the trial, and essentially found that as complexity of the answer increased, the effectivity of the AI decreased. I therefore thought that the trial would show positive results, but it's about how positive and in which contexts. Boxer noted.

The preliminary findings set the stage for a more comprehensive study to determine not just the effectiveness but the specific contexts in which AI could benefit. The goal was to see if AI could complement human markers, especially for simpler, more straightforward assessments.

Key Findings 

The central question driving this research was: Can large language models (LLMs) mark students’ work accurately, given their propensity to make mistakes? This question is crucial as many believe LLMs could potentially save significant amounts of teacher time and deliver more feedback to students than conventional methods.

 To explore this, Carousel Learning selected 12 questions from history and science across key stages 2, 3, and 4, ensuring a mix of difficulty levels. They gathered 1,710 ambiguous responses and involved nearly 40 teachers to mark them "blind." These marks were then compared with the responses fed into several LLMs.

The study revealed that GPT-4 had the strongest performance among the models tested. Although GPT-4 wasn't quite as accurate as human teachers, it was close, showing broad agreement with the teachers' assessments. 

Teachers agreed with each other 87% of the time, highlighting that even human grading isn't perfect. GPT-4 matched this level of agreement 85% of the time, demonstrating a comparable overall accuracy to human markers. While it took teachers approximately 11 hours to mark the work, GPT-4 completed the task in about 2 hours. Interestingly, GPT-4's performance was consistent across the 12 different types of questions.

However, it is worth noting that the study only investigated 12 questions, which, while varied in difficulty, key stage, and subject, were all relatively short answers. The effectiveness of GPT-4 in handling longer or more complex responses remains uncertain. Additionally, the model's "temperature" was set to zero to reduce the likelihood of errors, a setting not typically available in standard interfaces like ChatGPT.

Furthermore, each question was marked by two teachers who occasionally disagreed, often in legitimate "edge cases." This means that when GPT-4 was deemed incorrect, it often handled these edge cases differently from the teachers but could still justify its output.

Broader context in edtech

The broader EdTech landscape is witnessing a rush towards incorporating AI into educational tools. However, Boxer emphasised a cautious approach, “Everybody is building AI into their products at the moment (reflecting broader changes in the tech sector in general). I don't necessarily think jumping straight in is the right course. 

“I think we need to think carefully about the contexts in which AI will actually be useful. What specific problem are we trying to solve? How well does AI solve it? Or are we just putting it in because it's cool and sexy and everybody's doing it?”

One of the central questions surrounding AI in education is its potential impact on teachers' workloads and the overall educational process. Boxer acknowledged the uncertainties,  We don't know, and nobody knows yet. LLMs make mistakes - this is how they are designed. They almost have to make mistakes in order to be the way they are. We don't yet know if it's possible to get them to a point where they are working reliably enough to trust students' education with them.”

As for Carousel Learning, the research is part of a broader, pragmatic approach to AI integration. “At Carousel our approach to LLMs has been neither optimistic nor sceptical - we’re trying to be pragmatic,” explained Boxer. 

Future directions

The findings from this study suggest that AI has the potential to play a significant role in education, particularly in automating simpler tasks and providing additional support to to teachers. However, the path forward requires careful consideration and rigorous testing. Carousel Learning's approach, focusing on gradual improvements and responsible use, shows how edtech companies can navigate the complexities of AI integration.

As Boxer aptly summarised, “We’ve worked hard to figure out what LLMs can and can’t do, and we aren’t going to just shove one into the system because it’s cool and attractive. 

“We think that LLMs could have an important role to play in gradually improving education in a general sense, and hope to be a responsible, robust and cautious part of that improvement.”

Previous
Previous

University of Phoenix launches teacher preparation boot camp amid national shortage

Next
Next

Dan Newton appointed CEO of Security Journey to lead innovation and global expansion