Reasoning in LLMs: Prompting Strategies & Dynamic Environments

This blog post summarizes a study focused on evaluating the reasoning and adaptive capabilities of Large Language Models (LLMs) in dynamic environments, moving beyond traditional static benchmarks. The research systematically examines the impact of self-reflection, heuristic mutation, and planning as prompting techniques on various LLMs.

Problem Definition

The study addresses the challenge of evaluating the reasoning and adaptive skills of LLMs in dynamic environments.
It moves beyond static benchmarks to assess how well LLMs can autonomously learn and adapt in complex, interactive text-based games.
The research investigates in-context mechanisms for continuous learning and multi-step reasoning in LLM agents.
Related work includes techniques like Chain-of-Thought (CoT), Self-Refine, ReAct, Reflexion, AutoPlan, DEPS, RCI, EvoPrompt, PLUM, and LLaMEA, which aim to improve LLM reasoning and performance through various iterative and evolutionary strategies.

Proposed Solution

The study introduces a framework where an agent interacts with an environment, receives rewards, and transitions to the next state.
The agent employs three key modules:

Reflection: Retrospective analysis of the agent’s trajectory. Inspired by Reflexion, this module reviews past sequences of state, action, reward, and next state to identify necessary adjustments.
Oracle: Generates heuristics from past reflections to optimize the agent’s policy using a (1+1) evolutionary strategy. This involves mutating heuristics based on performance feedback.
Planner: Simulates possible action sequences and selects the action with the highest expected cumulative reward.

The LLMs used in the study include LLAMA3-8B, MISTRAL-NEMO-12B, DEEPSEEK-R1-14B, and LLAMA3.3-70B.

Results

The study used SmartPlay environments as benchmarks, including:

Bandit: Tests the balance between exploration and exploitation.
Rock Paper Scissors (RPS): Assesses probabilistic reasoning against an opponent.
Tower of Hanoi: Evaluates planning and spatial reasoning.
Messenger: Examines text understanding, spatial reasoning, and enemy avoidance.

Key findings include:

Larger models generally achieve higher scores, but strategic prompting can narrow the performance gap between smaller and larger models.
Complex prompting can lead to performance degradation compared to the baseline, especially in simpler tasks like TwoArmedBandit.
Advanced prompting can significantly improve performance on tasks like Tower of Hanoi but can also lead to substantial drops in scores.
In the Messenger environment, Reflection + Planner provides the biggest gain for smaller models, but object misidentification and poor spatial awareness frequently lead to agent failures.

Additional Analysis

Modifications to the Hanoi environment, such as showing valid actions, reduce illegal moves.
In the Messenger environment, removing object synonyms yields marginal gains, while reward shaping consistently boosts pickup rates.

Importance

The research highlights the limitations in general reasoning capabilities of current LLMs and the need for dynamic benchmarks.
It demonstrates that while strategic prompting can improve performance, excessive reasoning can harm smaller models on simple tasks.
The study also shows that dense, task-aligned reward signals improve an agent’s decision-making.
The variability in results underscores the brittleness of current techniques, with little evidence for emergent reasoning or self-learning.

Source:

https://arxiv.org/abs/2505.10543