Zero-Shot, Few-Shot, and Tool-Using Agents: Choosing the Right Prompting Strategy

Introduction

The rapid evolution of large language models has fundamentally changed how we build intelligent systems, but it has also introduced a new engineering challenge: choosing the right prompting strategy for your use case. When working with models like GPT-4, Claude, or Llama, you face a critical decision early in your design process—should you rely on pure instructions (zero-shot), provide examples (few-shot), or equip your agent with tools? This decision affects not just your system's accuracy, but its latency, cost, maintainability, and ability to scale.

Unlike traditional software engineering where interfaces and contracts are explicit, prompt engineering exists in a space of probabilistic outputs and emergent capabilities. A prompt strategy that works beautifully for classifying customer support tickets may fail spectacularly when extracting structured data from legal documents. The difference between zero-shot and few-shot isn't just academic—it represents a fundamental trade-off between simplicity and control, between hoping the model "just gets it" and explicitly showing it what you want. As your requirements grow in complexity, understanding when to graduate from simple instructions to examples, and eventually to tool-augmented agents, becomes essential to building reliable production systems.

This article provides a practical framework for choosing among these strategies based on task complexity, consistency requirements, and available resources. We'll explore the technical underpinnings of each approach, examine real-world implementation patterns, and build a decision tree you can apply to your own projects. By the end, you'll understand not just what these strategies are, but when and why to use each one.

Understanding the Three Prompting Strategies

Zero-shot prompting represents the simplest form of interaction with language models. You provide clear instructions about what you want the model to do, and rely entirely on its pre-trained knowledge and reasoning capabilities. A zero-shot prompt might say "Classify the following customer email as urgent, normal, or low-priority" followed by the email text. There are no examples, no demonstrations—just the task description and input. This approach works remarkably well for tasks that align closely with the model's training distribution: summarization, translation, basic Q&A, and common classification tasks. The model draws on patterns it learned during training, applying its general understanding to your specific instance.

The power of zero-shot prompting lies in its simplicity and flexibility. You can deploy it immediately without curating examples or building tool infrastructure. It's particularly effective when working with frontier models that have strong instruction-following capabilities and broad knowledge bases. However, zero-shot prompting has clear limitations. When your task requires specific output formatting, domain-specific reasoning patterns, or consistency across edge cases, pure instructions often fall short. The model might understand what you want in general terms but deviate in subtle ways—using different terminology, organizing output inconsistently, or handling ambiguous cases differently from what your downstream systems expect.

Few-shot prompting addresses these limitations by providing concrete examples of correct behavior within the prompt itself. Instead of just describing what you want, you show the model several input-output pairs demonstrating the pattern. For our customer email classifier, a few-shot prompt would include three to five examples: actual emails paired with their correct classifications. This technique leverages in-context learning—the model's ability to recognize and replicate patterns from the prompt context without any parameter updates or fine-tuning. Few-shot learning emerged as one of the breakthrough capabilities highlighted in the GPT-3 paper, demonstrating that large models could learn new tasks from just a handful of demonstrations.

The effectiveness of few-shot prompting stems from how it grounds abstract instructions in concrete reality. When you show examples, you implicitly communicate nuances that are difficult to express in natural language: the appropriate level of detail, the exact output format, how to handle edge cases, and the subtle distinctions that matter for your use case. Few-shot prompting excels for tasks requiring consistent formatting, domain-specific patterns, or novel combinations of capabilities. The cost is increased token usage (examples take up context window space), the need to carefully curate representative examples, and potential brittleness if your examples don't cover the full distribution of inputs you'll encounter in production.

Tool-using agents represent a qualitatively different paradigm. Rather than relying solely on the model's parametric knowledge and reasoning, you explicitly provide functions or APIs the model can call to accomplish tasks. A tool-using agent might have access to a search_database() function, a get_current_weather() API, or a run_python_code() sandbox. The model reasons about which tools to use, generates properly formatted function calls, receives results, and incorporates that information into its response. This approach, popularized by frameworks like ReAct (Reasoning and Acting) and implemented in production systems through function calling APIs, extends what agents can reliably accomplish beyond pure language manipulation.

Tool use transforms language models from knowledge-bounded systems into orchestration layers that can interact with external resources. A zero-shot or few-shot agent is limited to information in its training data and the prompt; a tool-using agent can query databases, perform calculations, retrieve current information, or execute code with perfect accuracy. This strategy is essential when tasks require up-to-date information, precise computation, access to private data, or actions with side effects. However, it introduces significant complexity: you must design stable tool interfaces, handle tool call failures gracefully, manage multi-step reasoning chains, and deal with the reality that function calling itself isn't always reliable. The model might hallucinate function names, pass malformed arguments, or miss opportunities to use available tools.

When to Use Each Strategy: A Decision Framework

The choice between zero-shot, few-shot, and tool-using agents isn't about which is "best"—it's about matching strategy to task requirements. Consider three key dimensions when making this decision: task complexity, consistency requirements, and knowledge bounds.

Task complexity refers to both the inherent difficulty of the task and how well it aligns with the model's training distribution. Use zero-shot prompting when your task is common and well-defined: summarizing articles, translating between major languages, answering general knowledge questions, or performing basic sentiment analysis. These tasks appear frequently in training data, and modern models handle them reliably without examples. Move to few-shot prompting when tasks require domain-specific patterns or unusual output structures. If you're extracting entities from medical discharge summaries, classifying legal documents by obscure jurisdictional rules, or generating responses in a specific brand voice, examples dramatically improve consistency. Graduate to tool-using agents when tasks require capabilities beyond text processing: mathematical computation, database queries, API interactions, code execution, or multi-step workflows where intermediate verification matters.

Consistency requirements might be the most underappreciated factor in this decision. If you can tolerate variations in output format or occasional edge case failures, zero-shot works fine. Many early prototypes and exploratory tools fit this category—a chatbot that mostly gets things right is better than no chatbot. But production systems serving critical business functions demand consistency. A financial report generator can't sometimes use USD and sometimes use "dollars." A resume parser feeding an ATS must produce the same JSON schema every time. Few-shot prompting helps here, but structured output mechanisms (available in modern API offerings) combined with examples work even better. For deterministic operations—calculations, database lookups, running predefined business logic—tools are the only reliable choice. You cannot prompt a model into perfect arithmetic, but you can give it a calculator tool.

Knowledge bounds determine whether the information needed to complete the task exists in the model's parameters or must be retrieved externally. Zero-shot and few-shot prompting assume the model either already knows what it needs or can reason from the information in the prompt. This works for general knowledge, language patterns, and reasoning tasks. But when tasks require current data (today's stock prices, recent news), private information (your company's customer database, internal documentation), or specialized knowledge beyond the training cutoff, tool use becomes necessary. A model cannot tell you the weather without a weather API, cannot query your database without a database tool, and cannot execute complex calculations without a computation tool.

Consider these concrete scenarios: A blog post title generator for a tech company? Start with zero-shot—it's a common task. If the titles don't match your brand voice, add 3-5 examples of previous titles (few-shot). A customer support ticket router? Few-shot prompting with 5-10 examples per category will significantly outperform zero-shot. A system that needs to check inventory levels before generating a response? That's tool territory—no amount of prompting can give the model real-time access to your database. An agent answering complex questions that might require looking up documentation, performing calculations, and reasoning through multi-step problems? You need tool use combined with careful prompting (possibly few-shot) to guide the reasoning process.

The strategies also differ in their iteration and maintenance burden. Zero-shot prompts are quick to deploy and easy to modify—change the instruction and you're done. Few-shot prompts require curating and maintaining example sets; as your task evolves, you need to update examples. Tool-using agents demand the most infrastructure: API stability, error handling, versioning, monitoring, and often complex orchestration logic. Choose the simplest strategy that meets your requirements, and plan for evolution—many production systems start zero-shot, graduate to few-shot when consistency matters, and add selective tool use for specific capabilities.

Implementation Patterns and Code Examples

Let's examine realistic implementations of each strategy, starting with zero-shot prompting for a content moderation task. The goal is to identify whether user-generated content violates community guidelines.

import OpenAI from 'openai';

interface ModerationRequest {
  content: string;
  guidelines?: string;
}

interface ModerationResult {
  isViolation: boolean;
  categories: string[];
  explanation: string;
  confidence: 'high' | 'medium' | 'low';
}

async function moderateContent(
  request: ModerationRequest,
  client: OpenAI
): Promise<ModerationResult> {
  const systemPrompt = `You are a content moderation system. Analyze the provided content and determine if it violates community guidelines.

Guidelines:
${request.guidelines || `
- No harassment or hate speech
- No explicit violence or gore
- No adult content
- No spam or misleading information
`}

Respond with:
- Whether the content violates guidelines (true/false)
- Specific violated categories if any
- Brief explanation
- Your confidence level (high/medium/low)`;

  const response = await client.chat.completions.create({
    model: 'gpt-4',
    messages: [
      { role: 'system', content: systemPrompt },
      { role: 'user', content: request.content }
    ],
    response_format: { type: 'json_object' },
    temperature: 0.1
  });

  return JSON.parse(response.choices[0].message.content);
}

// Usage
const result = await moderateContent({
  content: "Check out this amazing deal!!! Click now!!!"
}, openaiClient);

This zero-shot implementation works well for content moderation because it's a task models encounter frequently in training, and the guidelines can be expressed clearly through natural language. Note the use of structured output (response_format) and low temperature for consistency—these are essential zero-shot practices.

Now consider a more complex scenario: extracting structured data from resumes into a specific schema. This task benefits significantly from few-shot prompting because output consistency and field-level accuracy matter greatly.

from anthropic import Anthropic
from typing import List, Dict
import json

RESUME_PARSING_SYSTEM = """You are a resume parser that extracts structured information.
Extract education, work experience, and skills into the exact format shown in the examples."""

FEW_SHOT_EXAMPLES = [
    {
        "input": """John Smith
Senior Software Engineer at TechCorp (2019-2023)
- Built microservices architecture serving 10M users
- Led team of 5 engineers
Education: BS Computer Science, MIT, 2018
Skills: Python, AWS, Kubernetes""",
        "output": {
            "name": "John Smith",
            "work_experience": [
                {
                    "title": "Senior Software Engineer",
                    "company": "TechCorp",
                    "start_date": "2019",
                    "end_date": "2023",
                    "highlights": [
                        "Built microservices architecture serving 10M users",
                        "Led team of 5 engineers"
                    ]
                }
            ],
            "education": [
                {
                    "degree": "BS",
                    "field": "Computer Science",
                    "institution": "MIT",
                    "graduation_year": "2018"
                }
            ],
            "skills": ["Python", "AWS", "Kubernetes"]
        }
    },
    {
        "input": """Maria Garcia | maria@email.com
Product Manager - Acme Inc (Jan 2020 - Present)
* Launched 3 major features with 200K+ users
* Managed cross-functional teams

Education:
MBA, Stanford GSB, 2019
BA Economics, UC Berkeley, 2015

Technical: SQL, Tableau, Jira""",
        "output": {
            "name": "Maria Garcia",
            "work_experience": [
                {
                    "title": "Product Manager",
                    "company": "Acme Inc",
                    "start_date": "2020-01",
                    "end_date": "present",
                    "highlights": [
                        "Launched 3 major features with 200K+ users",
                        "Managed cross-functional teams"
                    ]
                }
            ],
            "education": [
                {
                    "degree": "MBA",
                    "field": "Business Administration",
                    "institution": "Stanford GSB",
                    "graduation_year": "2019"
                },
                {
                    "degree": "BA",
                    "field": "Economics",
                    "institution": "UC Berkeley",
                    "graduation_year": "2015"
                }
            ],
            "skills": ["SQL", "Tableau", "Jira"]
        }
    }
]

def build_few_shot_prompt(resume_text: str) -> List[Dict[str, str]]:
    """Constructs the full few-shot prompt with examples."""
    messages = [{"role": "system", "content": RESUME_PARSING_SYSTEM}]
    
    # Add few-shot examples as user/assistant pairs
    for example in FEW_SHOT_EXAMPLES:
        messages.append({
            "role": "user",
            "content": f"Parse this resume:\n\n{example['input']}"
        })
        messages.append({
            "role": "assistant",
            "content": json.dumps(example['output'], indent=2)
        })
    
    # Add the actual resume to parse
    messages.append({
        "role": "user",
        "content": f"Parse this resume:\n\n{resume_text}"
    })
    
    return messages

def parse_resume_few_shot(resume_text: str, client: Anthropic) -> Dict:
    """Parse resume using few-shot prompting for consistency."""
    messages = build_few_shot_prompt(resume_text)
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2000,
        temperature=0,
        messages=messages
    )
    
    return json.loads(response.content[0].text)

# Usage demonstrates consistency across varied input formats
test_resume = """
DAVID CHEN | david.chen@example.com
DevOps Engineer @ CloudScale Systems, March 2021 - Current
- Reduced deployment time by 60% through CI/CD optimization
- Managed infrastructure for 50+ microservices

EDUCATION
MS Computer Science - Georgia Tech (2020)
Specialization: Distributed Systems

SKILLS: Docker, Terraform, Go, Python, AWS
"""

parsed = parse_resume_few_shot(test_resume, anthropic_client)
print(json.dumps(parsed, indent=2))

The few-shot implementation shows a critical pattern: examples are formatted as conversation history, with user messages containing inputs and assistant messages containing the exact desired output format. This teaches the model not just what to extract, but precisely how to structure the response. Note that we use temperature=0 for deterministic behavior—essential for production parsing tasks.

Tool-using agents introduce a different architectural pattern. Rather than generating final outputs directly, the model generates function calls that your system executes, returning results that inform subsequent reasoning.

import OpenAI from 'openai';

// Tool definitions following OpenAI function calling spec
const RESEARCH_TOOLS = [
  {
    type: 'function',
    function: {
      name: 'search_academic_papers',
      description: 'Search for academic papers and research articles on a topic',
      parameters: {
        type: 'object',
        properties: {
          query: { type: 'string', description: 'Search query' },
          year_from: { type: 'number', description: 'Filter by publication year' },
          limit: { type: 'number', description: 'Max results to return' }
        },
        required: ['query']
      }
    }
  },
  {
    type: 'function',
    function: {
      name: 'get_paper_citations',
      description: 'Get citation count and impact metrics for a specific paper',
      parameters: {
        type: 'object',
        properties: {
          paper_id: { type: 'string', description: 'Unique paper identifier' },
          include_recent: { type: 'boolean', description: 'Include recent citations' }
        },
        required: ['paper_id']
      }
    }
  },
  {
    type: 'function',
    function: {
      name: 'calculate_statistics',
      description: 'Perform statistical calculations on numerical data',
      parameters: {
        type: 'object',
        properties: {
          operation: { 
            type: 'string', 
            enum: ['mean', 'median', 'std_dev', 'correlation'],
            description: 'Statistical operation to perform'
          },
          values: { 
            type: 'array', 
            items: { type: 'number' },
            description: 'Numerical values to analyze'
          }
        },
        required: ['operation', 'values']
      }
    }
  }
];

// Tool implementations (simplified for demonstration)
async function executeTool(toolName: string, args: Record<string, any>): Promise<any> {
  switch (toolName) {
    case 'search_academic_papers':
      // In production: call Semantic Scholar, arXiv, or PubMed API
      return {
        results: [
          { id: 'paper_123', title: 'Attention Is All You Need', citations: 75000, year: 2017 },
          { id: 'paper_456', title: 'BERT: Pre-training of Deep Bidirectional Transformers', citations: 50000, year: 2018 }
        ]
      };
    
    case 'get_paper_citations':
      // In production: query citation database
      return { citations: 75000, recent_citations_30d: 250, h_index_contribution: 1.2 };
    
    case 'calculate_statistics':
      const { operation, values } = args;
      if (operation === 'mean') {
        return { result: values.reduce((a: number, b: number) => a + b, 0) / values.length };
      }
      return { result: 0 };
    
    default:
      throw new Error(`Unknown tool: ${toolName}`);
  }
}

async function runResearchAgent(query: string, client: OpenAI): Promise<string> {
  const messages: OpenAI.Chat.ChatCompletionMessageParam[] = [
    {
      role: 'system',
      content: `You are a research assistant that helps analyze academic papers and research trends. 
Use the available tools to gather information and perform analysis. 
Reason step-by-step and use multiple tools as needed to provide comprehensive answers.`
    },
    { role: 'user', content: query }
  ];

  let iterations = 0;
  const MAX_ITERATIONS = 10;

  while (iterations < MAX_ITERATIONS) {
    const response = await client.chat.completions.create({
      model: 'gpt-4-turbo-preview',
      messages,
      tools: RESEARCH_TOOLS,
      tool_choice: 'auto'
    });

    const message = response.choices[0].message;
    messages.push(message);

    // If no tool calls, we have the final answer
    if (!message.tool_calls || message.tool_calls.length === 0) {
      return message.content || 'No response generated';
    }

    // Execute each tool call
    for (const toolCall of message.tool_calls) {
      const functionName = toolCall.function.name;
      const functionArgs = JSON.parse(toolCall.function.arguments);
      
      console.log(`Executing tool: ${functionName}`, functionArgs);
      
      const result = await executeTool(functionName, functionArgs);
      
      messages.push({
        role: 'tool',
        tool_call_id: toolCall.id,
        content: JSON.stringify(result)
      });
    }

    iterations++;
  }

  throw new Error('Max iterations reached without final answer');
}

// Usage: complex query requiring multiple information sources
const answer = await runResearchAgent(
  "What has been the trend in citations for transformer architecture papers from 2017-2023? Compare the top 3 papers.",
  openaiClient
);
console.log(answer);

This tool-using implementation demonstrates several important patterns. First, tool definitions are explicit schemas that the model must conform to—precision matters. Second, we implement an agentic loop: the model can make multiple tool calls, receive results, reason about them, and decide whether it needs more information or can provide a final answer. Third, we include safeguards like iteration limits to prevent infinite loops. Fourth, the system prompt guides the agent's reasoning process without providing examples—often tool-using agents combine zero-shot instruction with structured function access.

The choice of strategy dramatically affects your code architecture. Zero-shot implementations are essentially stateless function calls. Few-shot systems require example management—you might store examples in configuration, version them separately from code, or even implement dynamic example selection (retrieving the most relevant examples for each input). Tool-using agents require robust orchestration, state management across multiple LLM calls, error handling for tool failures, and often persistence to allow resuming interrupted workflows.

Combining Strategies: Progressive Enhancement

In production systems, you rarely use these strategies in isolation. The most effective prompt engineering combines elements of multiple approaches, layering them based on specific subtask requirements. Think of zero-shot as your foundation, few-shot as structural reinforcement for critical joints, and tools as specialized equipment for tasks that exceed the base material's capabilities.

Consider a customer service agent designed to handle complex inquiries. The overall system uses zero-shot prompting for general conversation and empathy—these are natural language capabilities the model handles well without examples. But when the conversation reaches a point where the agent needs to check order status, you inject a tool call to your order management system. When the agent generates a response about return policies, few-shot examples might guide the tone and structure to match your brand voice and legal requirements. A single conversation might flow from zero-shot dialogue, to tool use for data retrieval, back to zero-shot for reasoning, then to few-shot-informed generation for the final response that faces the customer.

This progressive enhancement approach provides flexibility and efficiency. You pay the token cost of examples only for the specific subtasks that need them. You introduce tool complexity only where deterministic operations or external data are necessary. A practical pattern is to start with a zero-shot orchestrator that routes to specialized few-shot or tool-using modules:

interface AgentMessage {
  content: string;
  metadata?: Record<string, any>;
}

interface AgentCapability {
  name: string;
  trigger: (input: string) => boolean;
  execute: (input: string) => Promise<string>;
}

class HybridAgent {
  private client: OpenAI;
  private capabilities: AgentCapability[];

  constructor(client: OpenAI) {
    this.client = client;
    this.capabilities = [
      {
        name: 'order_lookup',
        trigger: (input) => /order|tracking|shipment|delivery/i.test(input),
        execute: this.handleOrderLookup.bind(this)
      },
      {
        name: 'policy_question',
        trigger: (input) => /return|refund|warranty|policy/i.test(input),
        execute: this.handlePolicyQuestion.bind(this)
      }
    ];
  }

  async processMessage(message: AgentMessage): Promise<string> {
    // Zero-shot routing: determine if specialized capability is needed
    const capability = this.capabilities.find(cap => cap.trigger(message.content));
    
    if (capability) {
      console.log(`Routing to specialized capability: ${capability.name}`);
      return await capability.execute(message.content);
    }

    // Default: zero-shot conversational response
    return await this.generateZeroShotResponse(message.content);
  }

  private async handleOrderLookup(input: string): Promise<string> {
    // Tool-using: extract order number, call database, format response
    const orderNumberMatch = input.match(/\b[A-Z]{2}\d{6}\b/);
    
    if (!orderNumberMatch) {
      // Fall back to zero-shot to ask for order number
      return await this.generateZeroShotResponse(
        `User asked about an order but didn't provide order number. Ask for it politely.`
      );
    }

    // Simulate tool call
    const orderData = await this.lookupOrder(orderNumberMatch[0]);
    
    // Use few-shot to format response in brand voice
    return await this.formatOrderResponse(orderData);
  }

  private async handlePolicyQuestion(input: string): Promise<string> {
    // Few-shot: use examples to ensure policy information is conveyed correctly
    const policyExamples = [
      {
        user: "What's your return policy?",
        assistant: "We offer a 30-day return policy on most items. To start a return: (1) Log into your account, (2) Select the order, (3) Click 'Return Item'. Refunds are processed within 5-7 business days. Original shipping costs are non-refundable except for defective items."
      },
      {
        user: "Can I return this after 30 days if it's defective?",
        assistant: "Yes! Defective items can be returned beyond the standard 30-day window under our quality guarantee. Please contact our support team at support@example.com with your order number and photos of the defect. We'll arrange a return and full refund or replacement."
      }
    ];

    return await this.generateFewShotResponse(input, policyExamples);
  }

  private async generateZeroShotResponse(input: string): Promise<string> {
    const response = await this.client.chat.completions.create({
      model: 'gpt-4',
      messages: [
        { role: 'system', content: 'You are a helpful customer service agent.' },
        { role: 'user', content: input }
      ],
      temperature: 0.7
    });
    return response.choices[0].message.content || '';
  }

  private async generateFewShotResponse(
    input: string, 
    examples: Array<{user: string, assistant: string}>
  ): Promise<string> {
    const messages: OpenAI.Chat.ChatCompletionMessageParam[] = [
      { role: 'system', content: 'You are a helpful customer service agent.' }
    ];

    // Add examples
    examples.forEach(ex => {
      messages.push({ role: 'user', content: ex.user });
      messages.push({ role: 'assistant', content: ex.assistant });
    });

    messages.push({ role: 'user', content: input });

    const response = await this.client.chat.completions.create({
      model: 'gpt-4',
      messages,
      temperature: 0.3  // Lower temp for consistency
    });
    
    return response.choices[0].message.content || '';
  }

  private async lookupOrder(orderNumber: string): Promise<any> {
    // Tool implementation: database query
    return { 
      orderNumber, 
      status: 'shipped', 
      tracking: 'TRACK123',
      estimatedDelivery: '2026-03-16'
    };
  }

  private async formatOrderResponse(orderData: any): Promise<string> {
    // Use structured output to format tool results
    return `Your order ${orderData.orderNumber} is currently ${orderData.status}. 
Tracking number: ${orderData.tracking}
Estimated delivery: ${orderData.estimatedDelivery}`;
  }
}

This hybrid architecture uses zero-shot routing to identify when specialized capabilities are needed, then applies the appropriate strategy for each subtask. The pattern is common in production systems: a lightweight zero-shot orchestrator delegates to more expensive or complex few-shot and tool-using components only when necessary.

Another powerful combination is few-shot prompting for tool-using behavior itself. When function calling reliability is critical, you can provide examples of correct tool usage within your prompt. This teaches the model not just which tools exist, but when and how to use them effectively:

from openai import OpenAI
import json

TOOL_USAGE_EXAMPLES = """Here are examples of effective tool usage:

User: "What's the weather in Tokyo and should I bring an umbrella?"
Reasoning: Need current weather data and forecast. Should call get_weather, then reason about umbrella.
Action: get_weather(city="Tokyo")
Tool Result: {"temperature": 18, "conditions": "partly cloudy", "precipitation_chance": 20}
Response: "Tokyo is currently 18°C and partly cloudy with only 20% chance of rain. An umbrella is optional but might be good to have just in case."

User: "Compare the weather in Paris and London"
Reasoning: Need weather for two cities. Should make separate calls for each.
Action: get_weather(city="Paris")
Tool Result: {"temperature": 15, "conditions": "rainy", "precipitation_chance": 80}
Action: get_weather(city="London")
Tool Result: {"temperature": 12, "conditions": "foggy", "precipitation_chance": 30}
Response: "Paris is warmer at 15°C but currently rainy (80% precipitation), while London is cooler at 12°C and foggy but drier (30% precipitation). If choosing between them, London has better weather conditions today despite being cooler."
"""

# This example shows the model how to: use tools for current data, make multiple calls when needed, and synthesize results

This few-shot guidance for tool use is particularly valuable when you have complex tool interfaces, when tools should be used in specific sequences, or when you need to teach the agent to choose between multiple overlapping tools. It's a meta-level application of few-shot prompting—not for the final output, but for the intermediate reasoning and action selection process.

Trade-offs, Pitfalls, and Performance Considerations

Each prompting strategy carries distinct trade-offs that become critical at scale. Understanding these trade-offs helps you make informed architectural decisions and avoid common failure modes.

Zero-shot prompting offers the best latency and lowest cost per request. Without examples cluttering your context window, you minimize input tokens and allow faster processing. This matters when you're handling thousands or millions of requests. However, zero-shot prompting has the highest variance in output quality. The same prompt can produce different output structures, terminology, or reasoning approaches across invocations. This non-determinism, even at temperature=0, stems from the model's broad distribution of plausible responses. When zero-shot fails, it often fails silently—the output looks reasonable but is subtly wrong in ways that might only surface downstream. For mission-critical tasks, the cost savings of zero-shot can be overwhelmed by the expense of handling inconsistent outputs or incorrect results.

A common pitfall with zero-shot is over-relying on instruction clarity. Engineers assume that more detailed instructions will yield better results, leading to verbose prompts that paradoxically worsen performance. Models can get lost in lengthy instructions, or latch onto the wrong details. The sweet spot for zero-shot instructions is typically 2-5 sentences of clear, specific direction. Another failure mode is using zero-shot for tasks that require knowledge the model doesn't have—asking it to apply your company's specific coding standards without examples, or to classify content into categories that don't map to common taxonomies.

Few-shot prompting trades token cost and latency for consistency and control. Every request now carries the overhead of your examples—potentially hundreds or thousands of tokens before you even provide the actual input. This increases both cost (you pay for input tokens) and latency (more tokens to process). The benefit is dramatically improved consistency when examples are well-chosen. The model has concrete templates to follow, reducing variance and improving adherence to specific formats. Few-shot learning also enables capabilities the model might not exhibit zero-shot, especially for niche domains or unusual formatting requirements.

The critical pitfall in few-shot prompting is example quality and diversity. If your examples are too similar, the model may overfit to their specific patterns and fail on variations. If examples are too diverse or inconsistent with each other, the model gets confused about which pattern to follow. You need examples that span the input space—covering common cases, edge cases, and the range of valid outputs—while maintaining perfect consistency in what they demonstrate. Another subtle issue is example ordering; models can exhibit recency bias, disproportionately mimicking the last few examples. Rotating or randomizing examples can help, but adds complexity.

Few-shot prompting also struggles with scale in a different dimension: maintaining example sets. As your requirements evolve, you must update examples accordingly. If you're using few-shot in multiple places (different prompts for different tasks), example management becomes a significant engineering concern. Some teams solve this by building example libraries with versioning, testing examples against known inputs, and implementing dynamic example selection based on input similarity.

Tool-using agents offer the highest ceiling for capability—they can access external information, perform precise computations, and take actions in the world. But they introduce substantial complexity and new failure modes. Tool calls themselves can fail: the model might generate malformed JSON arguments, hallucinate function names that don't exist, or call functions in illogical sequences. Even when tool calls are structurally correct, the model might use them inappropriately—searching a database unnecessarily when the answer is in the prompt, or failing to use a tool when it clearly should.

Latency becomes multifaceted with tool-using agents. You incur the base LLM latency, plus tool execution time, plus the latency of subsequent LLM calls to process tool results. A simple question might trigger 3-5 sequential LLM calls with tool executions interleaved. If each LLM call takes 2 seconds and each tool call takes 1 second, you're looking at 10-15 seconds total latency. This makes tool-using agents unsuitable for real-time applications without careful optimization—caching, parallel tool execution, or pruning unnecessary reasoning steps.

Cost modeling is complex with tool-using agents. You pay for multiple LLM calls per user request, often using more expensive models (function calling works best with GPT-4 tier models, not smaller ones). The token costs include tool definitions (which can be substantial for complex APIs), reasoning traces, and tool results. A single user query might consume 5,000-10,000 tokens when you account for all the back-and-forth. These costs are justified when the task requires tool capabilities, but devastating if you use tools where few-shot prompting would suffice.

Error handling philosophies differ across strategies. Zero-shot and few-shot errors are typically output quality issues—you can validate outputs and fall back to rephrasing or retrying. Tool-using errors require defensive programming: validating function arguments before execution, handling tool timeouts, dealing with partial results, and deciding whether to retry with different tools or surface errors to users. Many production systems implement tool call validation layers that check argument types and ranges before executing potentially expensive or destructive operations.

Best Practices for Production Systems

Deploying prompting strategies in production requires moving beyond isolated examples to systematic approaches that ensure reliability, observability, and maintainability. Here are battle-tested practices from production AI systems.

Start simple and measure before adding complexity. Begin with zero-shot prompting and establish baseline metrics: accuracy, consistency, latency, and user satisfaction. Only add few-shot examples or tools when measurements reveal specific failure modes that justify the added complexity. Many teams over-engineer their first implementation, building elaborate tool-using agents when zero-shot would handle 90% of cases. Measure your 90th and 99th percentile performance—if zero-shot handles the common cases well and only struggles with edge cases, consider whether few-shot examples targeting those edge cases would suffice before building tool infrastructure.

Implement prompt versioning and testing infrastructure. Treat prompts as code—they should live in version control, have clear change histories, and be tested systematically. For zero-shot prompts, maintain a test suite of representative inputs with expected outputs. For few-shot prompts, version your example sets separately from code and track which version is deployed. For tool-using agents, integration tests should cover the full agentic loop including tool failures. Prompt changes that seem minor can have outsized effects on model behavior; versioning enables rolling back problematic changes and understanding performance trends over time.

interface PromptTestCase {
  input: string;
  expectedOutput?: any;
  validators: Array<(output: any) => boolean>;
  metadata?: Record<string, any>;
}

interface PromptVersion {
  version: string;
  systemPrompt: string;
  examples?: Array<{input: string, output: string}>;
  tools?: any[];
}

class PromptTester {
  async runTestSuite(
    promptVersion: PromptVersion,
    testCases: PromptTestCase[]
  ): Promise<TestResults> {
    const results = [];
    
    for (const testCase of testCases) {
      const output = await this.executePrompt(promptVersion, testCase.input);
      const passed = testCase.validators.every(validator => validator(output));
      
      results.push({
        input: testCase.input,
        output,
        expected: testCase.expectedOutput,
        passed,
        latency: performance.now()
      });
    }

    return {
      totalTests: testCases.length,
      passed: results.filter(r => r.passed).length,
      failed: results.filter(r => !r.passed).length,
      averageLatency: results.reduce((sum, r) => sum + r.latency, 0) / results.length,
      results
    };
  }

  private async executePrompt(version: PromptVersion, input: string): Promise<any> {
    // Implementation would call LLM with versioned prompt
    // This is a simplified structure
    return {};
  }
}

For few-shot prompting, implement dynamic example selection. Hard-coding examples works for simple cases, but production systems benefit from selecting the most relevant examples for each input. Store examples in a vector database, embed incoming inputs, and retrieve the k most similar examples to include in your prompt. This technique, sometimes called "dynamic few-shot" or "retrieval-augmented prompting," improves performance on diverse inputs while controlling token costs—you always send k examples, but they're the most helpful k for that specific input.

Design tool interfaces for models, not humans. Tool definitions should be concise, unambiguous, and include explicit types and constraints. Avoid overloading single tools with multiple capabilities—it's better to have get_customer_by_email() and get_customer_by_id() as separate tools rather than get_customer(identifier: string, id_type: string). Models struggle with conditional logic in tool parameters. Similarly, tool descriptions matter immensely; they're how the model decides when to use each tool. A vague description like "gets data" is useless, while "retrieves customer profile including order history and preferences given an email address" clearly signals when and why to use the tool.

Implement observability for prompt performance. Production systems should log not just final outputs but intermediate reasoning steps, tool calls made, token usage, and latency breakdown. For tool-using agents especially, you need visibility into which tools are called, how often, and what results they return. This telemetry helps you optimize prompts, identify underutilized tools, and debug unexpected behaviors. Many teams build dashboards showing success rates by prompt strategy, allowing data-driven decisions about when to migrate from zero-shot to few-shot or add new tools.

Guard against prompt injection and adversarial inputs. When user content is interpolated into prompts, especially in few-shot examples or tool parameters, validate and sanitize inputs. An attacker might try to inject instructions that override your system prompt or cause the model to ignore your examples. For tool-using agents, validate all function arguments before execution—never trust that the model will only generate safe parameters. Implement allow-lists for sensitive operations and require human approval for high-stakes actions.

Plan for model upgrades and degradation. Prompts optimized for GPT-4 might behave differently on GPT-4-turbo or GPT-5. Few-shot examples that work perfectly today might become unnecessary as models improve, or might need updates as model capabilities shift. Build flexibility into your architecture to A/B test prompt strategies and gradually migrate between models. Similarly, prepare for model degradation—if API latency increases or quality temporarily decreases, can you gracefully fall back to cached responses, simpler prompts, or alternative models?

Measuring Success and Optimizing Over Time

Choosing a prompting strategy isn't a one-time decision—it's an ongoing optimization process that requires measurement, experimentation, and evolution as your requirements and available models change. Successful production systems implement structured evaluation frameworks that go beyond anecdotal "it seems to work" assessments.

Define task-specific metrics before implementing any strategy. For classification tasks, track precision, recall, and F1 scores against labeled test sets. For extraction tasks, measure field-level accuracy and completeness. For generation tasks, implement both automated metrics (BLEU, ROUGE, or semantic similarity against reference outputs) and human evaluation workflows. Importantly, measure not just correctness but consistency—run the same input through your system multiple times and quantify output variance. Zero-shot prompts might average high quality but have unacceptable variance, while few-shot prompts might have slightly lower average quality but much tighter distributions.

Cost and latency metrics should be first-class alongside quality metrics. Track tokens per request (input and output), wall-clock latency, and LLM API costs. For tool-using agents, break down latency into LLM reasoning time, tool execution time, and orchestration overhead. A few-shot prompt that uses 3,000 additional tokens might cost 10x more per request than zero-shot while only improving accuracy by 5%. That might be worth it for high-value tasks but not for high-volume background jobs. These economic considerations fundamentally shape which strategy is viable.

A/B testing between prompting strategies provides ground truth. When considering moving from zero-shot to few-shot, implement both, route a percentage of traffic to each, and measure differences in your metrics. This reveals whether the added complexity is justified by measurable improvements. Some surprising findings from real systems: few-shot prompting sometimes performs worse than zero-shot when examples are poorly chosen, tool-using agents sometimes have lower user satisfaction despite higher correctness (due to latency), and the "best" strategy varies by user segment or input type.

Implement fallback hierarchies for robustness. A mature system might try zero-shot first (fast and cheap), fall back to few-shot if validation fails (higher quality), and escalate to tool-using agents or humans for the remaining failures. This pattern combines the efficiency of zero-shot for common cases with the reliability of more sophisticated approaches for harder inputs. Track your fallback rates—if 30% of requests are falling back from zero-shot to few-shot, you probably should just start with few-shot.

Continuously curate and improve your prompt artifacts. For few-shot prompting, monitor which examples are most influential and which inputs remain problematic. Implement automated example mining—when human reviewers correct model outputs, add those corrections to your example candidate pool. For tool-using agents, analyze tool usage patterns to identify missing tools, overly broad tools that should be split, or tools that are never called and can be removed to reduce prompt complexity. Prompt optimization is not set-and-forget engineering; it's an ongoing practice analogous to model training in traditional ML.

Advanced Patterns: Chain-of-Thought and Structured Reasoning

Beyond the basic strategies, several advanced prompting patterns combine elements of zero-shot, few-shot, and tool use to unlock even more reliable reasoning. Understanding these patterns helps you tackle the most complex tasks where simple prompting strategies plateau in effectiveness.

Chain-of-thought (CoT) prompting asks the model to show its reasoning before producing a final answer. Rather than jumping directly to output, the model generates intermediate reasoning steps, making its logic explicit. This technique can be applied zero-shot with phrases like "Let's think step by step" or few-shot by providing examples that include reasoning chains. Chain-of-thought is particularly powerful for mathematical reasoning, multi-step logic problems, and complex analysis where showing work improves accuracy.

# Zero-shot Chain-of-Thought
ZERO_SHOT_COT = """Solve this problem. Think step by step before giving your final answer.

Problem: A company's revenue was $800K in Q1, increased by 25% in Q2, then decreased by 10% in Q3. What was Q3 revenue?

Let's think step by step:"""

# Few-shot Chain-of-Thought
FEW_SHOT_COT_EXAMPLES = [
    {
        "problem": "A store has 150 items. It sells 40% in week 1, then restocks 30 items. How many items does it have?",
        "reasoning": """Let me solve this step by step:
1. Start with 150 items
2. Sells 40% in week 1: 150 × 0.40 = 60 items sold
3. Remaining after week 1: 150 - 60 = 90 items
4. Restocks 30 items: 90 + 30 = 120 items
5. Final count: 120 items

The store has 120 items."""
    },
    {
        "problem": "If project A takes 5 days and must finish before project B (3 days) starts, and project C (4 days) can run parallel to A, what's the minimum total time?",
        "reasoning": """Let me work through the dependencies:
1. Project A: 5 days, must complete before B
2. Project C: 4 days, can run parallel to A
3. Project B: 3 days, starts after A completes

Timeline:
- Days 0-5: Projects A and C run in parallel (C finishes day 4)
- Days 5-8: Project B runs after A completes

Total time: 5 days (A/C parallel) + 3 days (B) = 8 days

The minimum total time is 8 days."""
    }
]

Chain-of-thought examples demonstrate the desired reasoning pattern, teaching the model to break down problems systematically. This technique is especially valuable when combined with tool use—the model can reason about which tool to use, call the tool, then reason about the results before generating a final answer.

Structured output modes available in modern APIs (OpenAI's JSON mode, Anthropic's structured output, or grammar-constrained generation) provide a middle path between few-shot consistency and zero-shot simplicity. You specify a JSON schema, and the model guarantees valid JSON conforming to that schema. This eliminates an entire class of few-shot use cases where you were primarily using examples to teach output format. However, schemas only enforce structure, not semantics—the model will produce valid JSON, but might populate fields incorrectly.

import OpenAI from 'openai';
import { z } from 'zod';
import { zodToJsonSchema } from 'zod-to-json-schema';

// Define your exact output schema using Zod
const ResumeSchema = z.object({
  name: z.string(),
  email: z.string().email().optional(),
  work_experience: z.array(z.object({
    title: z.string(),
    company: z.string(),
    start_date: z.string(),
    end_date: z.string(),
    highlights: z.array(z.string())
  })),
  education: z.array(z.object({
    degree: z.string(),
    field: z.string(),
    institution: z.string(),
    graduation_year: z.string()
  })),
  skills: z.array(z.string())
});

type Resume = z.infer<typeof ResumeSchema>;

async function parseResumeWithStructuredOutput(
  resumeText: string,
  client: OpenAI
): Promise<Resume> {
  const response = await client.chat.completions.create({
    model: 'gpt-4-turbo-preview',
    messages: [
      {
        role: 'system',
        content: 'You are a resume parser. Extract structured information from resumes.'
      },
      { role: 'user', content: resumeText }
    ],
    response_format: {
      type: 'json_schema',
      json_schema: {
        name: 'resume',
        schema: zodToJsonSchema(ResumeSchema)
      }
    }
  });

  const parsed = JSON.parse(response.choices[0].message.content);
  
  // Validate against schema
  return ResumeSchema.parse(parsed);
}

This pattern combines zero-shot instructions with schema-enforced structure, often eliminating the need for few-shot examples purely for formatting. You still might use few-shot for semantic guidance (what counts as a "highlight" vs regular job duties), but structural consistency is guaranteed by the schema.

Progressive disclosure is an architectural pattern where you start with zero-shot for initial assessment, then conditionally add examples or tools based on confidence or task requirements. The agent makes an initial zero-shot attempt, evaluates its own confidence, and decides whether it needs examples or tools to improve its answer. This pattern optimizes for the common case (zero-shot suffices) while maintaining high quality for difficult cases.

Conclusion

The landscape of prompting strategies—zero-shot, few-shot, and tool-using agents—represents different points on the trade-off curve between simplicity and capability. Zero-shot prompting offers speed and elegance for well-defined tasks within the model's capability envelope. Few-shot prompting provides the consistency and specificity that production systems often demand, at the cost of token overhead and example maintenance. Tool-using agents unlock entirely new categories of tasks by connecting language models to external knowledge and computation, but require sophisticated orchestration and error handling.

The key insight is that these strategies are not mutually exclusive nor hierarchical. They're complementary techniques to apply based on the specific characteristics of each task in your system. A mature AI application likely uses all three: zero-shot for conversational elements, few-shot for business-critical formatting and domain-specific patterns, and tools for data access and deterministic operations. The craft of prompt engineering lies in recognizing which strategy fits each subtask and composing them into coherent systems.

As language models continue to evolve, the boundary lines between these strategies will shift. Models are getting better at zero-shot complex reasoning, reducing the need for few-shot examples in some domains. Function calling is becoming more reliable and lower latency, making tool use viable for a broader set of applications. Structured output modes are eliminating some traditional few-shot use cases. But the fundamental framework—understanding what each strategy offers and choosing accordingly—remains relevant regardless of model capabilities.

Start with the simplest strategy that might work. Measure its performance honestly. Add complexity only when measurements justify it. Treat prompts as critical infrastructure deserving the same engineering rigor as your core codebase. These principles will serve you regardless of which models you're using or which prompting strategies you choose.

Key Takeaways

Match strategy to task requirements, not preferences. Use zero-shot for common, well-defined tasks; few-shot when consistency and domain specificity matter; tools when external data or deterministic operations are needed.
Measure before optimizing. Establish baseline metrics with zero-shot before investing in few-shot example curation or tool infrastructure. Let data drive your architecture decisions.
Token economics are first-class concerns. Few-shot examples and tool definitions consume context window space on every request. Optimize for the common case and use progressive enhancement for edge cases.
Example quality dominates example quantity in few-shot prompting. Five excellent, diverse examples that span your input distribution outperform twenty similar or inconsistent examples.
Tool-using agents require defensive engineering. Validate function arguments, implement timeouts, handle partial failures, and set iteration limits. The model will eventually generate unexpected tool calls—design for resilience.

80/20 Insight

Twenty percent of your prompting effort should focus on clearly defining success criteria and measuring baseline performance; this drives eighty percent of your architectural decisions. Most teams spend too much time optimizing prompts and too little time defining what "good" looks like with quantitative metrics. Once you have measurements, the right strategy often becomes obvious—zero-shot failure patterns point directly to which few-shot examples you need, and consistency metrics reveal whether tool use is justified. Start with measurement, let data reveal the path forward.

Analogies & Mental Models

Think of prompting strategies as cooking methods. Zero-shot prompting is like asking a professional chef to make "a pasta dish"—they'll produce something good based on their training, but you can't predict exactly what. Few-shot prompting is like showing them three photos of pasta dishes you love—now they understand your specific taste and style preferences. Tool-using is like giving them access to your kitchen, your pantry, and your recipe collection—they can now adapt to ingredients on hand and follow your exact family recipes.

Another useful mental model: zero-shot is public APIs, few-shot is configuration files, and tools are plugins. Public APIs have generic interfaces designed for broad use cases—they work out of the box for common scenarios. Configuration files let you specify exactly how you want behavior customized. Plugins extend core capabilities with new functionality that wasn't built-in. Just as you wouldn't build a plugin when a config option suffices, don't build tools when few-shot examples would work.

References

Brown, T., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." Advances in Neural Information Processing Systems, 33, 1877-1901. [GPT-3 Paper introducing few-shot learning]
Wei, J., Wang, X., Schuurmans, D., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." Advances in Neural Information Processing Systems, 35.
Yao, S., Zhao, J., Yu, D., et al. (2022). "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv preprint arXiv:2210.03629. [Foundation for tool-using agents]
OpenAI Documentation. "Function Calling." OpenAI API Reference. https://platform.openai.com/docs/guides/function-calling
Anthropic Documentation. "Prompt Engineering." Claude API Documentation. https://docs.anthropic.com/claude/docs/prompt-engineering
Liu, P., Yuan, W., Fu, J., et al. (2023). "Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing." ACM Computing Surveys, 55(9), 1-35.
Chase, H. (2022). "LangChain: Building Applications with LLMs Through Composability." GitHub. https://github.com/langchain-ai/langchain
Khattab, O., Singhvi, A., Maheshwari, P., et al. (2023). "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines." Stanford NLP Group.
Zamfirescu-Pereira, J.D., Wong, R.Y., Hartmann, B., et al. (2023). "Why Johnny Can't Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts." CHI Conference on Human Factors in Computing Systems.
Schulhoff, S., Ilie, M., Balepur, N., et al. (2024). "The Prompt Report: A Systematic Survey of Prompting Techniques." arXiv preprint arXiv:2406.06608.z