Preview lesson

Practical use cases — security, parameters, observability

You can apply security best practices (API-key config, what NOT to log), tune LLM parameters intelligently (temperature, top_p, max_tokens, stop, presence_penalty, frequency_penalty, seed), track token usage, configure custom base URLs for proxies, and apply the chapter's production checklist.

Free to read — no subscription required.

Explore Complete Lesson

Practical use cases — security, parameters, observability

Introduction

Engineers often ship the same chat.completions.create call from development straight to production, only to find their API key in a log aggregator, runaway token costs from a missing max_tokens, and zero visibility into spend because response.usage was never captured. A working API call and a production-ready one are not the same thing. After this lesson you will be able to keep secrets out of source code and logs, set each of the seven generation parameters — temperature, top_p, max_tokens, stop, presence_penalty, frequency_penalty, and seed — with deliberate intent, capture per-call token usage for cost tracking, and route traffic through a custom base_url proxy for observability.

Key Terminology

temperature — A float from 0.0 to 2.0 controlling output randomness by scaling the model's logits before sampling; 0.0 makes generation near-deterministic (best for extraction and classification), while higher values flatten the distribution and increase variety (best for brainstorming and creative drafting).
top_p — Nucleus-sampling parameter (0.0–1.0) that restricts sampling to the smallest set of tokens whose cumulative probability reaches p; top_p=0.1 keeps only the most likely tokens, an alternative to temperature that you tune instead of, not alongside, temperature.
stop — A string or list of up to four strings that halt generation the moment any of them is produced; the stop sequence itself is excluded from the returned text, letting you terminate cleanly at a delimiter like "\n\n" or "###" without post-processing.
presence_penalty — A float from -2.0 to 2.0 that penalizes a token once it has appeared at all, nudging the model toward introducing new topics; positive values reduce the chance the response fixates on a single subject.
frequency_penalty — A float from -2.0 to 2.0 that penalizes tokens in proportion to how often they have already appeared, suppressing verbatim repetition and looping within a single response.
seed — An integer that requests best-effort deterministic sampling: identical inputs plus an identical seed return near-identical output, making runs reproducible for testing and debugging; the response's system_fingerprint tells you whether the backend configuration changed between calls.

Concepts

Loading diagram...

Secrets In, Never Out

Security for an LLM client has two directions, and most engineers only think about one. The inbound direction — sourcing api_key from os.environ instead of a string literal — keeps the credential out of version control. The outbound direction is the one that gets missed: your key, and often user PII inside the prompt, must never reach your logs. Logging the full request object for debugging serializes the Authorization header and the entire messages array straight into your log store, where a compromised log pipeline becomes a credential-and-data breach. The rule is to log metadata about a call — model name, token counts, latency, a request ID — but never the key and never raw prompt or completion text unless it has been explicitly scrubbed.

Parameters as Intent, Not Defaults

Each generation parameter answers a specific product question, and copying defaults means those questions go unanswered. temperature and top_p both control randomness but through different mechanisms — you tune one and leave the other at default, never both, because their effects compound unpredictably. For a data-extraction task you want temperature=0 for reproducibility; for ideation you want temperature=0.9. max_tokens is your cost ceiling per call. stop gives you structural control — halting at a delimiter is cheaper and more reliable than generating extra tokens and trimming them afterward. presence_penalty and frequency_penalty are repetition controls: reach for frequency_penalty when the model loops on the same phrase, and presence_penalty when you want broader topic coverage. seed buys reproducibility for your test suite.

Observability — Usage and Proxies

You cannot manage what you cannot measure, and LLM cost is measured in tokens. Every response carries response.usage with prompt_tokens, completion_tokens, and total_tokens; capturing these on every call and multiplying by the model's per-token price is the only accurate way to attribute spend. The second observability lever is base_url: pointing the client at a proxy routes all traffic through a single choke point that can log request metadata, enforce org-wide rate limits, redact PII before it leaves your network, and centralize cost accounting — without changing a line of application code (see Code Walkthrough).

Code Walkthrough

Now that you have a mental model of the production checklist — secrets inbound never outbound, parameters set with intent, usage captured per call — the code below assembles all three layers into one runnable script.

The client sources both its key and proxy URL from the environment so no secret or endpoint is baked into source:

Code snippetpython
1import os
2import logging
3from openai import OpenAI
4
5logger = logging.getLogger("llm")
6
7client = OpenAI(
8    api_key=os.environ.get("OPENAI_API_KEY"),     # inbound secret — never a literal
9    base_url=os.environ.get("OPENAI_PROXY_URL"),  # route through the observability proxy
10    timeout=60.0,
11    max_retries=3,
12)
13
14response = client.chat.completions.create(
15    model="gpt-4",
16    messages=[
17        {"role": "system", "content": "Extract the invoice total as JSON."},
18        {"role": "user",   "content": "Total due: $1,240.50 by March 3rd."},
19    ],
20    temperature=0,          # deterministic — this is extraction, not ideation
21    max_tokens=256,         # hard cost ceiling per call
22    stop=["\n\n"],          # halt cleanly at a blank line, no trimming needed
23    presence_penalty=0.0,   # no topic-broadening for structured extraction
24    frequency_penalty=0.3,  # gently suppress repeated tokens
25    seed=42,                # reproducible across runs for the test suite
26)

Both api_key and base_url are read from os.environ — the key never appears in source, and the proxy URL makes every request auditable at a network choke point without touching individual call sites. Each generation parameter is set deliberately: temperature=0 paired with seed=42 makes this extraction reproducible; max_tokens=256 caps cost per call; stop=["\n\n"] terminates cleanly at a blank line; frequency_penalty=0.3 suppresses repeated tokens. top_p is left at its default — tuning it alongside temperature=0 is redundant and unpredictable, so only one randomness axis is used at a time.

The second block extracts the reply, then logs and tracks cost using only safe metadata:

Code snippetpython
1reply = response.choices[0].message.content
2
3# SAFE: log metadata only — never the key, never raw prompt/completion text
4logger.info(
5    "llm_call model=%s prompt_tokens=%d completion_tokens=%d total_tokens=%d",
6    response.model,
7    response.usage.prompt_tokens,
8    response.usage.completion_tokens,
9    response.usage.total_tokens,
10)
11
12PRICE_PER_1K = {"prompt": 0.03, "completion": 0.06}  # gpt-4 example rates (USD)
13cost = (
14    response.usage.prompt_tokens     / 1000 * PRICE_PER_1K["prompt"]
15    + response.usage.completion_tokens / 1000 * PRICE_PER_1K["completion"]
16)
17print(f"reply={reply!r}  cost=${cost:.4f}")

The log line records only token counts and model name — it deliberately omits the API key and all prompt or reply text, so a compromised log store leaks neither the credential nor the invoice data. response.usage then drives an exact per-call cost calculation: each token count is multiplied by the model's published rate, giving you the foundation for any cost dashboard or budget alert.

You'll know it works when the script prints a JSON-shaped reply and a non-zero dollar cost, and the log line contains three token counts but no sk- key fragment and no invoice text.

Do's and Don'ts

Having assembled the full production shape above, the following Do's and Don'ts distill it into the chapter's checklist.

Do's

✓Do log only metadata — model name, token counts, latency, request ID — and never the API key or raw prompt/completion text — a debugging log line that serializes the full request object writes your credential and any user PII straight into your log store, turning a compromised log pipeline into a data breach.
✓Do set every generation parameter with intent: temperature=0 for extraction, higher for ideation, max_tokens as a cost ceiling, stop for clean termination, and seed for reproducibility — copied defaults mean each of those product decisions went unmade, and you'll discover the wrong one in production.
✓Do capture response.usage.prompt_tokens, completion_tokens, and total_tokens on every call and multiply by the model's rate — token counts arrive free on every response and are the only accurate basis for cost attribution and budgeting.

Don'ts

✗Don't tune temperature and top_p at the same time — both control sampling randomness through different mechanisms, and adjusting them together produces compounding, unpredictable output; pick one lever, leave the other at its default.
✗Don't leave max_tokens unset when cost matters — an unbounded completion can consume far more tokens than expected before it returns; max_tokens is the only client-side hard cap, and pairing it with a stop sequence terminates generation even more precisely at a known delimiter.
✗Don't route production traffic directly to the API when you need observability — bypassing a base_url proxy forfeits centralized request logging, org-wide rate limiting, PII redaction, and unified cost accounting; drive the proxy host from OPENAI_PROXY_URL so the choke point exists without any code change.

Everything in this lesson — plus the hands-on labs, quizzes, and your full learning path.

Explore Complete Lesson See plans — from →