Kodecraft
Tech Enrichment

Lance Alexander Ventura · AI Engineer Intern
Problem Context

Tech links scroll into oblivion.

Kodecraft developers regularly post tech and AI links into Mattermost as a way to share discoveries with the team. The links live in a communication medium, not a knowledge base.

Past links become practically unfindable — they scroll out of view, mix with unrelated chat, and cannot be referenced systematically. A link posted six months ago that would solve today's problem is effectively lost.

Volume is high enough that even recently-shared links get missed. There is no filter for what matters to Kodecraft, no record of past evaluations, no signal when a tool is superseded.

What this Project Does

A continuous pipeline from Mattermost links to a curated knowledge base.

  • Ingestion — polls scoped Mattermost channels, canonicalizes URLs, dedupes
  • Two-stage LLM evaluation — cheap classifier short-circuits a capable evaluator
  • Outline knowledge base — workflow, dependency, general, and history docs
  • Mattermost discovery hub — top-5 ranking per category, edited in place
  • HITL signaling — 🟢 / 🔴 vote reactions feed lifecycle decisions
  • Lifecycle & freshness — disposition + freshness axes, scheduled re-evaluation
System Overview · Pipeline Architecture
Architecture

Six stages, two surfaces.

Periodic poll fetches new posts from scoped Mattermost channels. URLs are canonicalized and deduped, then a heuristic filter rejects obvious non-tech links before any LLM call.

A cheap classifier (Stage 1) labels surviving links and narrows project context. A capable evaluator (Stage 2) produces the structured record that drives writes to Outline (full record) and the Mattermost Hub (top-5 ranking).

Mattermost #tech-links Ingest canonicalize · dedup Heuristic Filter domain · URL rules Stage 1 class · gpt-oss-120 Stage 2 eval · gpt-oss-120 Outline KB workflow · dep · general MM Hub top-5 per category
Where it Starts

Kodecraft already shares links — in #tech-links.

Developers post AI tools, libraries, frameworks, and articles in Mattermost as fast as they encounter them. This is the input surface, and it already exists.

The pipeline does not change how people share. It changes what happens after a link is posted: every URL is captured, canonicalized, deduped, and routed through evaluation — without anyone needing to do anything new.

Tech Enrichment Hub · Outline side

Outline holds the full record.

Every evaluated link lives in Outline as a structured entry: classification, relevance score, pros/cons, project routing, freshness state. Multiple docs split the surface — Workflow, per-project Dependencies, General, and an append-only Evaluation History.

This is the source of truth. Search, audit, freshness — all here.

switching to Outline
Tech Enrichment Hub · Mattermost side

Mattermost Hub is the top-5 triage surface.

One pinned post per category in #tech-enrichment-hub: Workflow, Dependency (per project), General. Each post is edited in place on every dirty-flag dispatch — no scrollback, no growing thread.

Outline = full record. Mattermost Hub = "what should I look at next?"

switching to Mattermost
Human-in-the-Loop · Signaling
HITL Signaling

The LLM curates. Humans signal.

Each new entry gets a threaded reply in #tech-links with two seed reactions: 🟢 ("I used this and it worked") and 🔴 ("I tried it; it didn't fit"). No third "haven't used" — that creates social pressure and noisy signal.

Non-engagement is computed, not declared. Reactions are retractable; current state is authoritative.

Reaction model
  • 🟢I used it · positive signal
  • 🔴didn't fit · negative signal
  • silence · computed, never asked
Evaluation · Stage 0 · Heuristic Pre-Filter
Stage 0

Reject the obvious before spending an LLM token.

URLs are canonicalized — tracking parameters stripped, redirects resolved, GitHub URLs collapsed to owner/repo. Duplicates against the link registry are dropped here.

A small YAML rule set (config/heuristic_rules.yaml) rejects domains and URL patterns that aren't worth evaluating: pure social media, internal Kodecraft URLs, unrelated content.

Reject reasons
  • 1duplicate of canonical URL
  • 2blocklisted domain
  • 3internal Kodecraft URL
  • 4known non-artifact pattern
Evaluation · Stage 1 · Cheap Classifier
Stage 1

Cheap model labels, narrows context, short-circuits.

A small model (gpt-oss-120) sees the link, fetched content, the project taxonomy, and all per-project PRDs. It labels the link as workflow / dependency / both / general / neither.

Output also carries candidate_slugs — projects most likely affected — which Stage 2 will use to narrow its expensive context. general with medium+ confidence short-circuits Stage 2 entirely.

Stage 1 output
  • 1classification
  • 2candidate_slugs[]
  • 3confidence (low/med/high)
  • 4rationale
Evaluation · Stage 2 · Decision Rules
Stage 2

Full evaluation with Decision Rules.

A capable model (gpt-oss-120) consumes per-project PRD + dependency context, narrowed by Stage 1's candidate_slugs within a 4000-token budget per axis.

Output is a strict JSON schema. matches_existing filters per project but preserves the global adopted signal — already-adopted excludes that project, not the link. Fetched content is treated as untrusted data, not instructions.

Output schema (strict)
  • 1classification
  • 2relevance_score · 0.0–1.0
  • 3matches_existing
  • 4affected_projects[]
  • 5pros / cons
  • 6confidence
Operations · Cost & Token Usage Evaluation
Cost & Token Usage

gpt-oss-120 on both stages — tokens observed, cost estimated.

Langfuse captured 950 observations across 575 traces (Apr 26 – May 8) but model pricing wasn't configured — cost_details are null for all calls.

Figures are estimated at $0.15 / 1M input · $0.60 / 1M output applied to observed token counts from 51 clean two-stage runs.

Token & cost (est.) — 51 runs
  • S1Stage 1 / run · ~3,330 tok · ~$0.00065
  • S2Stage 2 / run · ~6,043 tok · ~$0.00125
  • Pipeline / run · ~9,373 tok · ~$0.0019
  • $51-run total · ~$0.097 est.
  • Avg latency · ~5.35 s / run
Lifecycle · Disposition + Freshness
Two-axis state model

Disposition answers "what's our stance?" Freshness answers "how trustworthy is the analysis?"

Disposition (mutually exclusive, pipeline-managed)
StateMeaning
🟢 activeCurrent recommendation or informational entry.
✅ adoptedTeam signals indicate Kodecraft is using it.
🔴 rejectedTeam or pipeline determined it's not useful.
⚫ obsoleteTool itself is no longer viable (archived, deprecated).
🟡 supersededAnother candidate replaced it as the recommendation.
Freshness (overlay, separate from disposition)
StateMeaning
✓ freshLatest evaluation is recent enough to trust.
◷ due_for_reviewStill usable, but the check window has expired.
⏳ awaiting_evaluationQueued for a re-check.
⚠ reevaluation_failedLast re-check failed; prior analysis remains visible.
Lifecycle · Re-evaluation Cadence
Scheduled re-checks

Re-evaluation cadence by record type.

Default review windows
Record typeReview cadence
active recommendation30 days
alternative candidate14 days
adopted60 days
general current-state item60 days
rejectedno scheduled review
obsolete or supersededno scheduled review unless manually targeted

Failed re-evaluations preserve the prior payload; obsolete requires evidence-backed reason; superseded requires a successor reference. Age alone never makes a tool obsolete.

Lifecycle · Low-Engagement Follow-Up
Gentle prompting, never silent removal

Low-signal entries get a follow-up nudge, not a deletion.

Entry signal lifecycle
PhaseTriggerSystem action
Newly surfacedEntry just createdThreaded reply with 🟢 / 🔴 seed reactions
ActiveEntry visible, accumulating signalsCounts tracked; no notification
Low-engagementAge > 6 weeks, low reaction rateFollow-up prompt in channel
Persistent silenceLong quiet period after promptsMarked "awaiting evaluation"; remains visible
Adopted / RejectedSufficient 🟢 / 🔴 signalsDisposition transition; appended to history
Future Direction

RAG, when retrieval actually starts to hurt.

Today's retrieval is direct: PRD bodies and dep doc sections are loaded from Outline and injected into prompts within token budgets. This is simple and sufficient for the current corpus.

RAG enters when two signals appear: the LLM misses semantic matches (adjacent concepts not surfacing because queries target specific names), or token cost from full-document loads becomes a material fraction of evaluation spend. Either signal earns RAG its place — neither earns it preemptively.

Parallel future input: claude-mem workflow telemetry, captured from real Claude Code / OpenCode sessions, will feed workflow-track context once that project ships.

Thank you

Questions?

Lance Alexander Ventura · AI Engineer Intern · lance@kodecraft.dev