HuggingFace Integration¶
One pip install shekel[huggingface] and one with budget(): — shekel intercepts every HuggingFace InferenceClient call, enforces hard spend limits, and shows you exactly what was spent. Hard caps, fallback models, nested budgets, and BudgetExceededError all work identically to OpenAI and Anthropic.
Installation¶
Why a dedicated adapter?¶
HuggingFace's InferenceClient uses its own HTTP layer — it does not call the OpenAI SDK under the hood. Without a dedicated adapter, budget() would be completely blind to HuggingFace spend.
Shekel's HuggingFaceAdapter patches InferenceClient.chat_completion at runtime. Since client.chat.completions.create() delegates to chat_completion internally, all calls through either interface are tracked automatically.
Important: Custom Pricing Required¶
No bundled HuggingFace pricing
HuggingFace hosts thousands of models with varying pricing. Shekel has no standard pricing table for HuggingFace models.
Always pass price_per_1k_tokens to budget() so Shekel can calculate costs:
If you omit this, b.spent will always be 0.0 even though tokens were consumed.
Basic Integration¶
from huggingface_hub import InferenceClient
from shekel import budget
client = InferenceClient(token="your-hf-token")
with budget(
max_usd=1.00,
price_per_1k_tokens={"input": 0.001, "output": 0.001},
) as b:
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-1B-Instruct",
messages=[{"role": "user", "content": "Explain transformers in one sentence."}],
max_tokens=50,
)
print(response.choices[0].message.content)
print(f"Cost: ${b.spent:.6f}")
Streaming¶
with budget(
max_usd=1.00,
price_per_1k_tokens={"input": 0.001, "output": 0.001},
) as b:
full_text = ""
for chunk in client.chat.completions.create(
model="meta-llama/Llama-3.2-1B-Instruct",
messages=[{"role": "user", "content": "List three ML frameworks."}],
max_tokens=60,
stream=True,
):
delta = chunk.choices[0].delta.content
if delta:
full_text += delta
print(delta, end="", flush=True)
print()
print(f"Cost: ${b.spent:.6f}")
Streaming usage availability
Many HuggingFace-hosted models do not return usage data in streaming chunks. In that case, b.spent will be 0.0 for streaming calls even if tokens were consumed. Non-streaming calls generally do return usage data.
Nested Budgets¶
with budget(
max_usd=5.00,
name="pipeline",
price_per_1k_tokens={"input": 0.001, "output": 0.001},
) as total:
with budget(
max_usd=1.00,
name="step-1",
price_per_1k_tokens={"input": 0.001, "output": 0.001},
) as step1:
client.chat.completions.create(
model="meta-llama/Llama-3.2-1B-Instruct",
messages=[{"role": "user", "content": "Summarise this document."}],
max_tokens=100,
)
print(f"Step 1: ${step1.spent:.6f}")
print(f"Total: ${total.spent:.6f}")
Budget Enforcement¶
from shekel import BudgetExceededError
try:
with budget(
max_usd=0.10,
price_per_1k_tokens={"input": 0.001, "output": 0.001},
) as b:
for _ in range(100): # Shekel stops this when budget runs out
client.chat.completions.create(
model="meta-llama/Llama-3.2-1B-Instruct",
messages=[{"role": "user", "content": "Analyse this."}],
max_tokens=50,
)
except BudgetExceededError as e:
print(f"Stopped at ${e.spent:.4f}")
Free vs Paid Models¶
HuggingFace offers two tiers for inference:
| Tier | Description | Pricing |
|---|---|---|
| Free (Serverless) | Limited RPM, shared infrastructure | Free but rate-limited |
| PRO / Inference Endpoints | Dedicated infrastructure | Pay per token / per hour |
For most chat models, use InferenceClient with an hf_* token. Free-tier models may return 503 when overloaded — add retry logic for production use.
Tips for HuggingFace + Shekel¶
- Always set
price_per_1k_tokens— there is no default pricing for HuggingFace models - Use non-streaming calls for accurate cost tracking — many models omit usage in streaming
- Check model availability — not all models are available on HuggingFace's serverless API
- Handle 503 errors — free-tier endpoints can be temporarily unavailable under load
- Use
max_tokensto cap response length and control costs