Amazon Bedrock AgentCore WebSocket Limits: 5 Critical Bottlenecks You Need to KnowUnderstanding connection limits, payload constraints, and timeout issues in real-time AI agent communication

The release of Amazon Bedrock AgentCore's bi-directional streaming capabilities has fundamentally changed how we build voice and chat agents. By moving away from the "stop-and-start" nature of traditional request-response cycles, developers can now create agents that feel truly alive—handling interruptions, adjusting mid-sentence, and providing low-latency feedback. However, as many teams are discovering, migrating to a WebSocket-based architecture isn’t just about changing protocols; it’s about navigating a new landscape of technical constraints that can silently break your production applications if not managed carefully.

Building at scale with AgentCore requires a shift in mindset from simple API orchestration to persistent connection management. While the serverless nature of Bedrock masks much of the underlying infrastructure, the WebSocket implementation introduces specific guardrails designed to maintain service stability across multi-tenant environments. In this post, we’re going to perform a deep dive into the five most critical bottlenecks within the AgentCore WebSocket interface, backed by the latest 2026 AWS technical specifications and documentation.

1. The 100MB Payload and 10MB Chunking Ceiling

One of the most immediate hurdles developers face is the hard limit on message sizes. In the world of AgentCore, you aren't just sending text; you’re often streaming high-fidelity audio or multi-modal data. AWS mandates a maximum total payload size of 100MB for a single session’s request or response cycle. While 100MB sounds generous for text, it can vanish quickly when handling uncompressed audio or high-resolution images passed between the agent and its tools.

More importantly, the streaming chunk size is capped at 10MB. This means that if your agent is generating a massive data export or a long-form audio response, it must be programmatically sliced into these 10MB increments. Failing to adhere to this doesn't just slow down the connection; it often results in immediate session termination. This constraint forces developers to implement robust chunking logic on the client-side and ensures the agent's internal memory isn't overwhelmed by bloated, monolithic messages.

2. Idle Timeouts and the 15-Minute "Execution Wall"

Managing persistent connections is a balancing act between resource availability and cost. By default, AgentCore enforces an idleRuntimeSessionTimeout of 900 seconds (15 minutes). If your WebSocket connection goes silent for more than a quarter-hour—perhaps while waiting for a user to respond or a complex backend process to finish—the session is automatically reaped. While this is adjustable between 60 seconds and 8 hours (28,800 seconds), keeping it at the 15-minute default is the standard for most interactive chat applications.

The "Execution Wall" is even more rigid. Regardless of how active the conversation is, the standard synchronous request timeout is 15 minutes. For developers building long-running research agents or data-processing bots, this necessitates a move toward asynchronous patterns. If your agent's task exceeds 15 minutes, you must utilize background task handling and status polling, as the primary WebSocket connection will not stay open indefinitely for a single unit of work.

3. Concurrency Caps: The 25-Session Bottleneck

Scalability is the hallmark of AWS, but even "serverless" has its limits. In the current 2026 implementation, AgentCore imposes a default quota of 25 concurrent sessions per account per region for specialized tools like the Code Interpreter. While the general Runtime can scale to thousands of sessions, once your agent invokes a managed sandbox for execution, you hit this specific ceiling. This is a critical consideration for B2B platforms that might experience spikes in usage where multiple users require code execution simultaneously.

For standard agent invocations, the default is often 25 invocations per second (IPS) per endpoint. While these are adjustable via a quota increase request, they represent a significant design bottleneck for high-traffic applications. If your app handles 1,000 users at once, you must implement sophisticated queueing or rate-limiting on your application tier to prevent 429 "Too Many Requests" errors from disrupting the user experience during peak loads.

4. The "Ping-Pong" and 30-Second Edge Timeout

When routing your WebSocket through an Amazon API Gateway (a common practice for adding WAF protection and custom auth), you encounter a hidden trap: the 30-second edge-optimized timeout. If you are using edge-optimized endpoints, your connection will drop if no data is transmitted for just 30 seconds. This is much more aggressive than the 15-minute regional endpoint timeout.

To bypass this, you must implement a "Heartbeat" or "Ping-Pong" mechanism. Your agent must either generate "thinking" tokens regularly or send empty metadata frames to keep the connection alive. Below is a simple Python example of how you might handle a basic connection to the AgentCore WebSocket endpoint, ensuring you are prepared to catch timeout exceptions.

import asyncio
import websockets
import json

async def connect_to_agent(uri, auth_header):
    try:
        # Note: In production, use SigV4 signing for the URI
        async with websockets.connect(uri, extra_headers=auth_header) as websocket:
            # Example message following the AgentCore protocol
            initial_msg = {
                "type": "user_input",
                "text": "Analyze the quarterly report.",
                "streaming": True
            }
            await websocket.send(json.dumps(initial_msg))

            while True:
                try:
                    # Setting a timeout slightly below the edge limit (e.g., 25s)
                    response = await asyncio.wait_for(websocket.recv(), timeout=25.0)
                    data = json.loads(response)
                    print(f"Agent says: {data.get('text', '')}")
                except asyncio.TimeoutError:
                    # Send a heartbeat or 'ping' to keep the connection alive
                    await websocket.ping()
                    print("Sent heartbeat to prevent edge timeout.")
                    
    except Exception as e:
        print(f"Connection failed: {e}")

# Run the client
# asyncio.run(connect_to_agent(YOUR_WSS_URI, YOUR_AUTH_HEADERS))

5. Regional Availability and Memory Quotas

Finally, it is vital to remember that AgentCore is not yet ubiquitous. As of early 2026, bi-directional streaming is available in nine specific AWS regions, including US East (N. Virginia), Europe (Ireland), and Asia Pacific (Tokyo). If your application needs to serve users in a region where AgentCore isn't supported, the cross-region latency will significantly degrade the "real-time" feel that WebSockets are supposed to provide.

Additionally, the AgentCore Memory service has its own set of constraints. File uploads for session-based memory are currently limited to 250MB per session. If your agent is "learning" from massive documents over a long conversation, you will eventually hit a wall where the oldest context must be purged or moved to long-term storage. Understanding the interplay between memory limits and WebSocket persistence is key to building agents that don't "forget" their purpose mid-stream.

The 80/20 Rule for AgentCore Performance

To get 80% of the results with 20% of the effort, focus on these three areas:

  1. Use Regional Endpoints: Avoid edge-optimized endpoints for WebSockets unless absolutely necessary to gain a 15-minute idle window instead of 30 seconds.
  2. Chunk Everything: Never send a file or audio stream over 10MB in one go. Build a chunking utility into your client SDK from day one.
  3. Monitor ThrottledRequests: Set up CloudWatch alarms for ThrottledRequests and HTTP 429 errors immediately. This is the first sign that your 25 IPS limit is being breached.

Key Takeaways for Your Implementation

  • Check Your Region: Ensure your stack is deployed in one of the 9 supported regions for bi-directional streaming.
  • Adjust Lifecycle Settings: Don't stick with defaults; set your idleRuntimeSessionTimeout based on your specific use case (e.g., shorter for demos, longer for research).
  • Implement Heartbeats: Use websocket.ping() every 20-25 seconds to stay ahead of aggressive proxy timeouts.
  • Watch the Payload: Keep an eye on the 100MB session cap; use compression (like zlib) for large text or data payloads.
  • Request Quota Increases Early: If you plan to scale, don't wait for a production crash to ask for more than 25 concurrent sessions.

Building with Amazon Bedrock AgentCore is an exercise in managing boundaries. By respecting these five bottlenecks, you can ensure your real-time AI agents remain responsive, reliable, and ready for production-grade traffic.

Would you like me to generate a CloudFormation template to help you set up the CloudWatch alarms for these WebSocket limits?

References