Why Raw Benchmarks Don't Tell the Full Story
The Problem with Synthetic Benchmarks
When evaluating programming languages and frameworks, we often see headlines like "X is 10x faster than Y" based on simple benchmarks like hello-world or JSON serialization. These numbers are misleading because they don't reflect real-world workloads.
What Raw Benchmarks Miss
1. Cold Start Times
A language might handle 100k requests per second in a warm state, but if it takes 500ms to start, your actual user experience is terrible.
2. Memory Pressure
High-throughput benchmarks ignore what happens when memory is constrained. Does the framework start swapping? Crash? Or gracefully degrade?
3. Real Network Conditions
Production traffic has latency spikes, connection timeouts, and packet loss. Synthetic benchmarks run in ideal conditions.
4. Complex Interactions
A blog might handle JSON fine, but fail under load when combined with:
- Database connections
- Session management
- Authentication checks
- Rate limiting
- WebSocket connections
Enter oha: HTTP Load Testing
oha is a Rust-based HTTP load testing tool that benchmarks against actual endpoints, not micro-benchmarks.
oha -n 10000 -c 100 https://your-endpoint
Why oha is Better
- Realistic负载: Tests actual HTTP endpoints with real network stacks
- End-to-End: Includes DNS resolution, TLS handshakes, connection pooling
- Global Testing: Can test from multiple regions to simulate international users
- Detailed Metrics: Shows latency percentiles (p50, p95, p99), throughput, and errors
Example: Comparing Frameworks
Instead of running:
wrk -t12 -c400 -d10s http://localhost/hello
Run:
oha -n 100000 -c 100 -q10 https://your-production-url/api/v1/users
This tests:
- Your actual routing
- Middleware stack
- Database queries
- Response serialization
- Connection pool limits
What Matters for Production
When we evaluate Soli, we care about:
| Metric | Why It Matters |
|---|---|
| p99 Latency | Affects worst-case user experience |
| Throughput under load | Can it handle traffic spikes? |
| Memory stability | Does it leak over time? |
| Cold start | Serverless deployment viability |
| Error rate | Under partial failure conditions |
Conclusion
The next time you see "X is faster than Y", ask:
- What workload was tested?
- Was it cold or warm?
- Did it include middleware?
- Was it tested in production-like conditions?
Raw benchmarks are useful for micro-optimizations, but real performance understanding comes from end-to-end testing with tools like oha.