top of page

Reasoning in LLMs: Prompting Strategies & Dynamic Environments

This blog post summarizes a study focused on evaluating the reasoning and adaptive capabilities of Large Language Models (LLMs) in dynamic environments, moving beyond traditional static benchmarks. The research systematically examines the impact of self-reflection, heuristic mutation, and planning as prompting techniques on various LLMs.

Problem Definition

  • The study addresses the challenge of evaluating the reasoning and adaptive skills of LLMs in dynamic environments.
  • It moves beyond static benchmarks to assess how well LLMs can autonomously learn and adapt in complex, interactive text-based games.
  • The research investigates in-context mechanisms for continuous learning and multi-step reasoning in LLM agents.
  • Related work includes techniques like Chain-of-Thought (CoT), Self-Refine, ReAct, Reflexion, AutoPlan, DEPS, RCI, EvoPrompt, PLUM, and LLaMEA, which aim to improve LLM reasoning and performance through various iterative and evolutionary strategies.

Proposed Solution

  • The study introduces a framework where an agent interacts with an environment, receives rewards, and transitions to the next state.
  • The agent employs three key modules:
    • Reflection: Retrospective analysis of the agent’s trajectory. Inspired by Reflexion, this module reviews past sequences of state, action, reward, and next state to identify necessary adjustments.
    • Oracle: Generates heuristics from past reflections to optimize the agent’s policy using a (1+1) evolutionary strategy. This involves mutating heuristics based on performance feedback.
    • Planner: Simulates possible action sequences and selects the action with the highest expected cumulative reward.
  • The LLMs used in the study include LLAMA3-8B, MISTRAL-NEMO-12B, DEEPSEEK-R1-14B, and LLAMA3.3-70B.

Results

  • The study used SmartPlay environments as benchmarks, including:
    • Bandit: Tests the balance between exploration and exploitation.
    • Rock Paper Scissors (RPS): Assesses probabilistic reasoning against an opponent.
    • Tower of Hanoi: Evaluates planning and spatial reasoning.
    • Messenger: Examines text understanding, spatial reasoning, and enemy avoidance.
  • Key findings include:
    • Larger models generally achieve higher scores, but strategic prompting can narrow the performance gap between smaller and larger models.
    • Complex prompting can lead to performance degradation compared to the baseline, especially in simpler tasks like TwoArmedBandit.
    • Advanced prompting can significantly improve performance on tasks like Tower of Hanoi but can also lead to substantial drops in scores.
    • In the Messenger environment, Reflection + Planner provides the biggest gain for smaller models, but object misidentification and poor spatial awareness frequently lead to agent failures.

Additional Analysis

  • Modifications to the Hanoi environment, such as showing valid actions, reduce illegal moves.
  • In the Messenger environment, removing object synonyms yields marginal gains, while reward shaping consistently boosts pickup rates.

Importance

  • The research highlights the limitations in general reasoning capabilities of current LLMs and the need for dynamic benchmarks.
  • It demonstrates that while strategic prompting can improve performance, excessive reasoning can harm smaller models on simple tasks.
  • The study also shows that dense, task-aligned reward signals improve an agent’s decision-making.
  • The variability in results underscores the brittleness of current techniques, with little evidence for emergent reasoning or self-learning.
Source:
bottom of page