What affects chatbots' ability to solve logical problems?

Exercise tips, dinner suggestions, or help with school or work. Large language models (LLMs) like ChatGPT, DeepSeek and Gemini are designed to help us with many different tasks and problems.

But how good are they at solving logical problems? And what affects their reasoning capabilities?

This is what researcher Daniel Kaiser explores in a recently published study. He is researching LLMs as part of his doctoral project at UiT's Machine Learning Group and Integreat – Norwegian Centre for Knowledge-Driven Machine Learning.

In the study, he has developed a method to examine the logical problem-solving and reasoning skills of LLMs. His research is published as a conference paper for ICLR 2026.

Uncovers hidden limitations in LLMs

While LLMs have become a useful technology with several benefits, they are known to make mistakes. Sometimes in a catastrophic manner.

Researcher Daniel Kaiser says his method, CogniLoad, can help us detect and understand the limitations LLMs face when solving logical problems. Foto: Private

"One should never blindly trust what LLMs say – even if it appears true or convincing. It's important to always double-check and verify their answers," Kaiser warns us.

He believes the method, named CogniLoad, can help detect and understand the limitations models face when solving logical problems.

"It is made to help us understand why certain LLMs excel or fall short on different tasks," he adds.

Not all LLMs are good at the same task

An LLM's design, such as its model size and training data, determines its ability to help us with a particular problem. In other words, not every LLM is equally suited for the same task.

"There are huge differences in what different LLMs are capable of. Big models like ChatGPT's GPT-5 model tend to excel at advanced problems, while smaller models like Meta's LLaMA models are more suited for easier ones," Kaiser explains.

But it isn't always obvious what certain LLMs are best or worst at. The models' complex structure also makes it difficult to understand where potential mistakes come from.

Valuable knowledge

Therefore, it's important to find out what different LLMs can and cannot do, regardless of how advanced they are. Even the most advanced models can still make mistakes, despite their confident tone.

"A test like CogniLoad can help pinpoint where a model's reasoning breaks down. This makes it possible to examine what kinds of logical mistakes LLMs make," Kaiser says.

This knowledge is valuable in many different ways.

"We can use this information to understand what these models struggle the most with. Developers can use this to adjust their models to make them better," he adds.

Uten navn-1.jpg — Kaiser's method, Cogniload, involves giving LLMs a logical riddle to solve. Foto: Mostphotos

Logical riddle

CogniLoad involves giving LLMs a logical riddle. It starts by describing a situation with several people and facts about them, like what they are wearing or what music they last listened to.

Then the model is given a series of statements which repeatedly changes what this situation looks like. At the end, the chatbot is asked one specific question about a person, like what color their socks have.

"To get it right, the chatbot has to keep track of all these changes from start to finish without making any mistakes," Kaiser explains.

Kaiser can adjust the riddle to make it more challenging, such as by increasing its length or complexity, or by adding more irrelevant information. This tunability is designed to reveal what aspects of the riddle affect the LLM's ability to solve it.

The method is based on Cognitive Load Theory; which states that how hard our brain has to work affects our ability to solve different tasks.

"When we have too much to keep in mind at once, it becomes harder to reason carefully and avoid mistakes. Since AI systems like LLMs are designed to imitate human intelligence, we wanted to look at how different types of cognitive load affect an LLM's reasoning abilities," he adds.

Tested on ChatGPT, DeepSeek and Gemini

Kaiser tested the method on 22 different LLMs – both on open and commercial models like ChatGPT, DeepSeek, and Gemini.

"The point was to see what kinds of pressure these different models handle well, and what kinds make them struggle," Kaiser explains.

Findings show that the method can provide unique insight into how these LLMs process and solve logical problems – regardless of their size.

"They show that we can apply this method on all these different models to understand what affects their reasoning capabilities," says Kaiser.

Similarities to human intelligence

The results reveal some interesting similarities between how humans and LLMs process information.

"We see that factors such as length, complexity and noise do in fact affect the LLMs' ability to solve logical problems. Just like when humans are exposed to different forms of cognitive load," Kaiser says.

Even the biggest LLMs struggled when the task was made longer or more difficult.

"It's a reminder that even when the best chatbots sound confident and fluent, they can still lose track of important details and end up wrong," Kaiser says.

Model size plays an important role

Adjusting the riddle's length caused the most issues for the LLMs. However, the size of the models also play an important role.

"The longer the puzzle got, the harder it became for many models to give an accurate answer. We see that smaller models tended to struggle sooner, while bigger models could follow the chain for longer," he says.

"But eventually, the best models started making more mistakes when the task became quite lengthy," Kaiser adds.

He observes a similar pattern when tuning the riddle's complexity.

"Accuracy also drops off when the statements became more detailed and harder to follow," Kaiser says.

CogniLoad is not meant to measure what LLMs already know, Kaiser explains – but to study how well they reason when encountering new information.

"It is not a test of knowledge where we quiz the LLMs about facts they are supposed to remember. In this case, we look specifically at how well the models do when facing a problem they've never seen before," he says.

Is Artificial General Intelligence closer than we think?

AI systems develop rapidly, and some people fear they will soon match or surpass human intelligence – achieving so-called Artificial General Intelligence (AGI).

While CogniLoad doesn't provide a clear answer about the future, Kaiser's research still suggest that this imagined scenario is far beyond the horizon.

"Even puzzles that sound simple can become difficult for today's models when you make them longer and harder to follow. The riddle should actually be pretty simple for an LLM to solve, so it's quite fascinating to see that even the most advanced LLMs found it challenging when we increased the difficulty," Kaiser says.

Both small and more advanced LLMs still have plenty of room for improvement.

"It shows in a way how far away even today's AI models are from achieving AGI," Kaiser laughs.

Reference:

Daniel Kaiser, Arnoldo Frigessi, Ali-Ramezani Kebrya, & Benjamin Ricaud: CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density. The Fourteenth International Conference on Learning Representations (ICLR 2026) (Preprint).

About Integreat – Norwegian Centre for Knowledge-Driven Machine Learning

A Norwegian Centre for Excellence (SFF) that aims to make machine learning more sustainable, accurate, trustworthy, and ethical.
By combining mathematical and computational cultures, Integreat's research seeks to solve fundamental problems in science, technology, health and society.
The centre is headed by the University of Oslo, UiT – The Arctic University of Norway, and the Norwegian Computing Center.

Read more about Integreat on www.integreat.no.

Bjørklund, Petter petter.bjorklund@uit.no Kommunikasjonsrådgiver / Maskinlæring

Published: 24.02.26 09:25 Updated: 24.02.26 09:35

Technology