Visualizing Latency Comparisons Between LLM APIs: OpenRouter vs Bedrock

DevOps & Cloud Engineer — building scalable, automated, and intelligent systems. Developer of sorts | Automator | Innovator
Large Language Models (LLMs) are now integral to modern software applications, powering tasks such as summarization, code generation, and technical explanations. When evaluating multiple LLM APIs, latency, response quality, and consistency are critical. Today, I share a detailed analysis of latency comparison between OpenRouter and Bedrock, along with methodology, visualization, and insights.
Experiment Overview
The primary objective of this experiment was to measure and compare response latency for OpenRouter and Bedrock across multiple prompts. The experiment was designed to capture not only the speed of each API but also its consistency across repeated queries.
Prompts Used
Three representative prompts were chosen for the comparison:
Explain Kubernetes in simple terms for a beginner.
Write a Python function to reverse a linked list.
Summarize the book Atomic Habits in three sentences.
Each prompt was sent five times to each API to generate multiple latency measurements for statistical analysis.
Data Collection Methodology
The latency comparison was conducted using Python, with the following approach:
OpenRouter API Calls:
Sent HTTP POST requests to the OpenRouter API with the prompt, specifying the
gpt-oss-20bmodel.Measured start and end timestamps to calculate latency.
Extracted the text response from the API JSON payload.
AWS Bedrock API Calls:
Used the
boto3client to invoke the Bedrock modelopenai.gpt-oss-20b-1.Sent the prompt in the OpenAI-style chat format.
Measured latency from request initiation to response.
Extracted the returned text from the API payload.
Data Storage:
Each query stored the following fields: prompt, repeat number, OpenRouter response, OpenRouter latency, Bedrock response, and Bedrock latency.
All results were saved into a CSV file (
llm_comparison.csv) for analysis and visualization.
This setup ensured a repeatable and reliable dataset for performance analysis and comparison.
Here is a condensed snippet showing the main idea of the comparison script:
for prompt in prompts:
for i in range(REPEATS):
or_text, or_time = call_openrouter(prompt)
print(f"OpenRouter [{i+1}/{REPEATS}] Latency: {or_time:.2f}s")
br_text, br_time = call_bedrock(prompt)
print(f"Bedrock [{i+1}/{REPEATS}] Latency: {br_time:.2f}s")
data_rows.append({
"prompt": prompt,
"repeat": i+1,
"openrouter_response": or_text,
"openrouter_latency": or_time,
"bedrock_response": br_text,
"bedrock_latency": br_time
})
This allowed me to build a structured dataset with both responses and latencies for each prompt and repeat.
Latency Analysis
Using the CSV data, we conducted both statistical and visual analysis to compare the APIs.
OpenRouter Latency
Minimum Latency: 2.32 seconds
Maximum Latency: 7.28 seconds
Average Latency: Approximately 4.60 seconds
Observation: OpenRouter exhibited higher variability, particularly for repeated technical explanation prompts.
Bedrock Latency
Minimum Latency: 2.00 seconds
Maximum Latency: 3.24 seconds
Average Latency: Approximately 3.05 seconds
Observation: Bedrock was consistently faster and more stable across repeats and prompt types.
Prompt-Specific Patterns
Kubernetes Explanation: Bedrock consistently responded under 3 seconds, while OpenRouter spiked to over 7 seconds in one repeat.
Python Code Reversal: Both APIs performed similarly in early repeats, but Bedrock remained slightly faster.
Book Summarization: Bedrock maintained both speed and stability, whereas OpenRouter showed variability in later repeats.
Visualization Approach
To better understand latency differences, the following visualizations were created:
Boxplot: Shows overall latency distribution for each API, highlighting median, quartiles, and outliers.
Lineplot Per Prompt: Displays latency across repeats for each prompt, revealing consistency and spikes.
These visualizations make trends immediately clear, allowing developers to make informed choices between APIs.
Python Script for Plotting
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("llm_latency_comparison.csv")
print(df[['openrouter_latency', 'bedrock_latency']].describe())
plt.figure(figsize=(12,6))
sns.boxplot(data=df[['openrouter_latency', 'bedrock_latency']])
plt.title("Latency Comparison: OpenRouter vs Bedrock")
plt.ylabel("Latency (seconds)")
plt.show()
plt.figure(figsize=(14,6))
for prompt in df['prompt'].unique():
prompt_data = df[df['prompt'] == prompt]
sns.lineplot(x='repeat', y='openrouter_latency', data=prompt_data, label=f'OpenRouter: {prompt}', marker='o')
sns.lineplot(x='repeat', y='bedrock_latency', data=prompt_data, label=f'Bedrock: {prompt}', marker='o')
plt.title("Latency Trends Per Prompt Repeat")
plt.xlabel("Repeat Number")
plt.ylabel("Latency (seconds)")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
We obtained the following from the visualtion:

Insights From the Data

OpenRouter Latency Observations:
Minimum Latency: 2.32 seconds
Maximum Latency: 7.28 seconds
Average Latency: Approximately 4.60 seconds
Variability: Significant across different prompts and repeats, indicating inconsistent performance under certain queries.
Bedrock Latency Observations:
Minimum Latency: 2.00 seconds
Maximum Latency: 3.24 seconds
Average Latency: Approximately 3.05 seconds
Variability: Much lower than OpenRouter, indicating more consistent performance.
Prompt-Specific Trends:
For Kubernetes explanation prompts, OpenRouter latency increased up to 7.28 seconds in the fourth repeat, while Bedrock remained under 3 seconds.
For code generation prompts, both APIs performed similarly in early repeats, but Bedrock consistently had faster responses.
For book summarization, Bedrock was faster and more stable, with lower standard deviation.
Takeaways
Consistency Matters: Bedrock is more predictable, making it preferable for real-time applications.
Measure Repeats: Single API calls can be misleading; repeated measurements reveal stability.
Latency vs. Prompt Complexity: Certain prompts can trigger spikes in OpenRouter latency, which developers should consider for production workloads.
Data-Driven Decision Making: Structured data collection enables informed API selection.
Conclusion
This experiment shows that Bedrock provides lower and more consistent latency across prompts and repeated queries compared to OpenRouter. Collecting and visualizing latency not only reveals performance differences but also helps developers make informed choices about which API to integrate for production systems.
By sharing both the data collection and visualization workflow, I hope to provide a practical template for evaluating LLM APIs for real-world projects.






