I shipped a 12-question crypto security audit in 2 hours
I had ~2 hours on a Saturday. A committee of two AI models (Gemini Pro + Claude) had just argued me into building a "gamified security audit" tied to our existing in-app currency (💎). 500💎 → 12-question scan → AI report. $49 upsell to a human pentest.
Here's the technical breakdown.
The product in one sentence
User answers 12 questions about their crypto hygiene (seed storage, 2FA, DeFi approvals, etc.). They get a 0–100 score, a per-category breakdown, and a top-3 vulnerabilities report with named tools and 2024–25 incident cases. They burned 500💎 they earned through activity.
Live: askoracle.site/audit — try it.
Question schema
Every question carries a weight (sum across all 12 = exactly 100) and each option has a risk value language-independent of the labels. Example:
{
"id": "seed_storage",
"category": "wallet",
"weight": 14, # contributes max 14 points to score loss
"title": { # per-language headers
"ru": "Где хранится твоя главная seed-фраза?",
"en": "Where do you store your main seed phrase?",
"es": "¿Dónde guardas tu frase semilla principal?",
},
"options": [
{"v": "metal", # language-independent enum
"ru": "Металлический backup (Cryptosteel)",
"en": "Metal backup (Cryptosteel / Billfodl)",
"es": "Backup metálico (Cryptosteel / Billfodl)",
"risk": 0},
{"v": "cloud_notes",
"ru": "Заметки в iCloud/Google Drive/Notion",
"en": "Notes in iCloud / Google Drive / Notion",
"es": "Notas en iCloud / Google Drive / Notion",
"risk": 10},
# ...
],
}
The same risk arithmetic runs for all 3 languages. Only the rendering and the AI prompt swap on ?lang=ru|en|es.
Scoring
def _calc_score(answers: dict, lang: str) -> tuple[int, dict, list]:
cat_lost = {"wallet": 0, "twofa": 0, "defi": 0, "operational": 0}
vulns = []
for q in QUESTIONS:
w = q["weight"]
max_r = max(o["risk"] for o in q["options"])
opt = next((o for o in q["options"] if o["v"] == answers.get(q["id"])), None)
risk = opt["risk"] if opt else max_r # missing answer = max risk
cat_lost[q["category"]] += (risk / max_r) * w if max_r else 0
if risk >= max_r * 0.6 and max_r > 0:
vulns.append({
"qid": q["id"],
"severity": "high" if risk >= max_r * 0.85 else "medium",
# ...
})
score = max(0, round(100 - sum(cat_lost.values())))
return score, breakdown, vulns[:5]
If a user doesn't answer (impossible via UI, but defensive), they take max risk. Score caps at 0 — no negative scores.
LLM fallback chain
The most useful piece. I have 5 free Groq API keys, a paid DeepSeek key, and the Vertex AI Pro CLI as a last resort. Each handoff happens on rate-limit or empty response:
def _generate_report(answers, score, breakdown, vulns, lang):
text, provider = _chat_complete(system_msg, user_msg, max_tokens=2200)
if text and len(text) > 400:
return text, provider # Groq1..5 or DeepSeek worked
# Subprocess fallback to ask_pro CLI (Vertex AI)
try:
with tempfile.NamedTemporaryFile("w", suffix=".md", delete=False) as f:
f.write(system_msg + "\n\n---\n\n" + user_msg)
r = subprocess.run(
["/usr/local/bin/ask_pro", "-f", f.name, "--max", "3000"],
capture_output=True, text=True, timeout=90,
)
out = re.sub(r"\n\[tokens:.*?\]\s*$", "", r.stdout.strip())
if out and len(out) > 400:
return out, "vertex-pro-fallback"
except Exception:
pass
# Last resort: deterministic template
return _fallback_report(score, breakdown, vulns, lang), "fallback-template"
The deterministic template is 30 lines of f-strings that produce a usable (if bland) report from raw answers. It cannot fail. Users always get something.
In E2E testing: Groq key1 was rate-limited that morning; the chain landed on groq3 in 3 seconds. Cost: $0.
The AI prompt — one rule that mattered
"""
You are a senior crypto-security analyst at GuardLabs. ...
IMPORTANT:
- Never mention "AI", "ChatGPT", "LLM", "model" — you are a GuardLabs analyst
- Use specific tool names: revoke.cash, Pocket Universe, YubiKey 5C,
SimpleLogin, Bitwarden, Cryptosteel Capsule
- Anchor to real incident cases when possible (Lazarus, Anubis,
Inferno Drainer, address poisoning waves)
"""
That third bullet is what makes the report not feel generic. The model knows Inferno Drainer (May 2023, $87M total), it knows address-poisoning (Q2 2024 surge), it knows SIM-swapping by Lazarus (FTX, AscendEx). Without the prompt grounding it just says "your funds could be stolen."
Sample paragraph from a test run (user said: seed in iCloud, browser-only wallet, SMS for 2FA):
> SMS codes for 2FA — severity: high > Using SMS codes for 2FA is significantly vulnerable due to SIM-swapping attacks. In 2020 a hacker used SIM-swapping to steal over $100,000 in crypto from a victim's exchange account. SMS codes can be intercepted or bypassed, granting attackers access. Consider U2F or TOTP via SimpleLogin or Bitwarden. > > Quick fix: Replace SMS with a YubiKey 5C or an authenticator app, and enable 2FA on your email using a password manager like Bitwarden.
XSS defense in the report renderer
The report comes from an LLM. LLMs are influenced by user input. User input is enum values (validated server-side against an allowlist) — but I still html-escape the LLM output before re-wrapping our markdown patterns:
def _report_html(audit_row, report_md, lang):
import html as _html
md = _html.escape(report_md or "") # escape FIRST
md = re.sub(r"^### (.+)$", r"\1
", md, flags=re.M)
md = re.sub(r"^## (.+)$", r"\1
", md, flags=re.M)
md = re.sub(r"\*\*(.+?)\*\*", r"\1", md)
# ...
Yes, this means a literal in the LLM's output becomes <script> — visible as text, not executed. The pattern is "escape everything, then deliberately unescape only the markdown I expect." Cheap and correct.
Rate-limit on paid endpoints
/api/audit/start (creates a draft, costs nothing): 5/min, 20/hour.
/api/audit/submit (burns 500💎, calls LLM): 3/min, 10/hour.
Per-IP. Defends against malicious DOS of paid resources. Used Flask-Limiter, already wired in the host app.
Cost shipped
| Component | Cost | |---|---| | Postgres schema migration | $0 | | Routes (~750 lines initial, ~1170 after i18n) | $0 | | Veo / image generation | not needed | | E2E test runs (RU + EN + ES) | $0 (Groq free tier) | | Vertex Pro fallback (not used in tests) | would be ~$0.03 | | Total | $0 |
Time: ~2 hours active development. The pre-existing in-app currency, atomic spend SQL, and LLM fallback chain in the parent app saved most of the work.
What I'd do differently next time
1. Cache the LLM response on (answers-hash, lang) — same user repeating shouldn't re-pay. Currently they can't (one-time per user enforced), but if I add a "refresh" feature this matters.
2. Sign the audit_id with HMAC — currently /audit/report/ is enumerable. Anyone with the UUID sees the report. For free-tier scans this is fine; for paid scans I'd want a signed URL.
3. Move the AI prompt to a version-controlled file with a build-time pin. Right now changing the prompt is a code edit. Should be a config artifact with a hash.
Try it
Free 12-question scan: askoracle.site/audit (after earning 500💎 of activity).
If you want the full manual audit (wallet forensics, phishing simulation, custom playbook), that's $49 / one-time.
The interesting bit: this whole pipeline cost less in dev hours than most teams spend writing the spec doc for an equivalent feature. The LLM fallback chain is reusable infrastructure — the 12 questions are the actual IP.
Code conventions: Postgres + SQLAlchemy + Flask, multi-agent fallback, html-escape-then-unescape rendering. All boring. The product is the work, not the framework.
— @sspoisk / GuardLabs
Комментарии
Отправить комментарий