How It Works¶
Understanding shekel's internals: monkey-patching, context isolation, and zero-config design.
Architecture Overview¶
Shekel uses three core techniques:
- Monkey-patching - Intercepts API calls at runtime
- ContextVar isolation - Thread-safe per-context tracking
- Ref-counted patching - Efficient nested context handling
Monkey-Patching¶
Shekel wraps the OpenAI and Anthropic SDK methods to intercept API calls:
# When you enter a budget context
with budget(max_usd=1.00):
# Shekel patches:
# - openai.ChatCompletions.create → shekel wrapper
# - anthropic.Messages.create → shekel wrapper
response = client.chat.completions.create(...)
# Your call goes through shekel's wrapper
# Shekel extracts tokens, calculates cost, checks budget
# Then returns the original response
# When you exit the context
# Shekel restores the original methods
What Gets Patched¶
OpenAI:
- openai.resources.chat.completions.Completions.create (sync)
- openai.resources.chat.completions.AsyncCompletions.create (async)
Anthropic:
- anthropic.resources.messages.Messages.create (sync)
- anthropic.resources.messages.AsyncMessages.create (async)
LiteLLM (when shekel[litellm] is installed):
- litellm.completion (sync)
- litellm.acompletion (async)
Patching Implementation¶
Shekel uses a pluggable ProviderAdapter pattern — each provider registers itself in ADAPTER_REGISTRY. shekel/_patch.py delegates all patching to the registry:
# shekel/_patch.py
def _install_patches():
from shekel.providers import ADAPTER_REGISTRY
ADAPTER_REGISTRY.install_all() # calls install_patches() on each adapter
def _restore_patches():
from shekel.providers import ADAPTER_REGISTRY
ADAPTER_REGISTRY.remove_all() # restores originals for each adapter
Each adapter (e.g. OpenAIAdapter) handles its own SDK:
# shekel/providers/openai.py
def install_patches(self) -> None:
from shekel import _patch
import openai.resources.chat.completions as oai
_patch._originals["openai_sync"] = oai.Completions.create
oai.Completions.create = _patch._openai_sync_wrapper
To add a new provider, implement ProviderAdapter and register it — see Extending Shekel.
ContextVar Isolation¶
Each budget() context is isolated using Python's ContextVar:
from contextvars import ContextVar
_active_budget: ContextVar[Budget | None] = ContextVar("active_budget", default=None)
def set_active_budget(budget: Budget | None):
return _active_budget.set(budget)
def get_active_budget() -> Budget | None:
return _active_budget.get()
This ensures:
- Thread safety: Each thread has its own context
- Async safety: Each async task has its own context
- No global state: Concurrent budgets never interfere
Example: Concurrent Budgets¶
import concurrent.futures
from shekel import budget
def task1():
with budget(max_usd=1.00) as b:
# b1's spend tracked here
...
def task2():
with budget(max_usd=2.00) as b:
# b2's spend tracked here (independent)
...
# Run concurrently - no interference
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.submit(task1)
executor.submit(task2)
Ref-Counted Patching¶
Nested contexts only patch once:
_patch_refcount = 0
def apply_patches():
global _patch_refcount
_patch_refcount += 1
if _patch_refcount == 1: # Only patch on first entry
_install_patches()
def remove_patches():
global _patch_refcount
_patch_refcount -= 1
if _patch_refcount == 0: # Only restore on last exit
_restore_patches()
Why This Matters¶
# Outer budget
with budget(max_usd=5.00):
# Patches installed (refcount: 1)
# Inner budget
with budget(max_usd=1.00):
# No re-patching (refcount: 2)
...
# No restoration yet (refcount: 1)
...
# Patches restored (refcount: 0)
Token Extraction¶
Shekel extracts tokens from API responses:
OpenAI / LiteLLM:
def _extract_openai_tokens(response):
input_tokens = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens
model = response.model
return input_tokens, output_tokens, model
LiteLLM uses the same OpenAI-compatible format regardless of the underlying provider, so the same extraction logic applies.
Anthropic:
def _extract_anthropic_tokens(response):
input_tokens = response.usage.input_tokens
output_tokens = response.usage.output_tokens
model = response.model
return input_tokens, output_tokens, model
Cost Calculation¶
Three-tier pricing lookup:
def calculate_cost(model, input_tokens, output_tokens, price_override):
# Tier 3: Explicit override (highest priority)
if price_override:
return calculate_with_override(...)
# Tier 1: Built-in prices.json
if model in PRICES:
return calculate_with_builtin(...)
# Tier 1b: Prefix match (for versioned models)
if prefix_match := find_prefix(model):
return calculate_with_builtin(prefix_match)
# Tier 2: tokencost (if installed)
if tokencost_available():
return calculate_with_tokencost(...)
# Fallback: warn and return 0
warn_unknown_model(model)
return 0.0
Streaming Support¶
OpenAI Streaming¶
Shekel wraps the stream and collects usage from the final chunk:
def _wrap_openai_stream(stream):
seen_usage = []
try:
for chunk in stream:
if chunk.usage: # Final chunk has usage
seen_usage.append(chunk.usage)
yield chunk
finally:
if seen_usage:
tokens = seen_usage[-1]
_record(tokens.prompt_tokens, tokens.completion_tokens, model)
Anthropic Streaming¶
Shekel tracks message events:
def _wrap_anthropic_stream(stream):
input_tokens = 0
output_tokens = 0
try:
for event in stream:
if event.type == "message_start":
input_tokens = event.message.usage.input_tokens
elif event.type == "message_delta":
output_tokens = event.usage.output_tokens
yield event
finally:
_record(input_tokens, output_tokens, model)
Fallback Implementation¶
When budget is exceeded with a fallback:
def _check_limit(self):
if self.spent > self.max_usd:
if self.fallback and not self._using_fallback:
# Activate fallback
self._using_fallback = True
self._switched_at_usd = self.spent
warn("Switching to fallback model")
return # Don't raise!
# No fallback or already using fallback
raise BudgetExceededError(...)
The next API call automatically uses the fallback:
def _apply_fallback_if_needed(budget, kwargs, provider):
if budget._using_fallback:
# Rewrite model parameter
kwargs["model"] = budget.fallback['model']
Zero Config¶
Shekel requires no configuration because:
- No API keys needed - Uses your existing OpenAI/Anthropic keys
- No external services - Everything runs locally
- No initialization - Just import and use
- No global state - Each context is independent
Performance¶
Shekel adds minimal overhead:
- Patching: Once per budget context (~1μs)
- Per-call overhead: Token extraction + cost calculation (~10μs)
- No network calls: All pricing is local
The overhead is negligible compared to API latency (100-1000ms).
Limitations¶
What Shekel Can't Do¶
- Stop mid-stream: Streaming calls complete before budget check
- Cross-provider fallback: Can't switch from OpenAI to Anthropic
- Predict costs: Only tracks actual usage
- Batch API: Doesn't support OpenAI batch endpoints
- Function calling costs: Tracks tokens but not function execution costs
Thread Safety¶
- ContextVar: ✅ Thread-safe within each thread
- Persistent budget: ❌ Not thread-safe across threads
- Patching: ✅ Thread-safe (uses lock)
Source Code¶
The implementation is ~600 lines across 6 files:
_budget.py- Budget class and context manager_patch.py- Monkey-patching logic_pricing.py- Cost calculation_context.py- ContextVar management_decorator.py- @with_budget decoratorexceptions.py- BudgetExceededError
All code is available at github.com/arieradle/shekel.
Next Steps¶
- Extending Shekel - Add new providers or models
- Contributing - Contribute to the project
- API Reference - Complete API docs