ETCLOVG · O — Observability & Operations (관측 가능성 · 운영)

실행을 보는 계층. trace, cost, failure, reliability signal — 그리고 이 신호를 사후 디버깅이 아닌 1급 자산 으로 격상시키는 운영 도구들.

정의 (Scope)

Observability & operations.

서베이의 1차 프로젝트 수: 15개. ETCLOVG가 기존 6-component framework와 차별화되는 가장 중요한 지점 중 하나 — 관측 가능성을 independent architectural concern으로 승격 시킨다.

서베이의 명시적 관찰: 오픈소스에는 얇게 존재. 상용 플랫폼, SDK 내장, 엔지니어링 글에 더 많이 산다 → 운영 통제는 런타임·벤치마크보다 늦게 성숙.

관심사	내용
trace 캡처	모든 step의 input/output/tool call/error를 구조화된 trace로
cost 추적	토큰·달러·시간 — 단위 작업당 비용, 회귀 추적
failure 시그널	runtime exception, schema 위반, eval fail, drift detection
reliability	SLO/SLI, retry 패턴, 재현성
trace-native 진단	trace에서 자동으로 outcome score / trajectory quality / failure attribution / regression test 산출

LangChain: tracing이 단순 디버깅이 아니라 evaluation을 enable 하는 인프라 (2026-04-30-improving-deep-agents-with-harness-engineering)
Anthropic Managed Agents: 세션 로그 · 하네스 루프 · 샌드박스의 decoupling = observability를 시스템 인프라로 분리하는 메타-하네스 (7-이-서베이가-우리-위키에서-갖는-위치)

2026-05-25-threads-geun-daeng-harness-benchmark — 오픈소스 하네스 벤치마크 주장, observability/eval 기반 비교
2026-04-19-anatomy-of-agent-harness — 하네스 해부 — observability 위치
2026-03-29-agent-eval-checklist — eval과 observability의 접점

Trace-native failure diagnosis — 트레이스를 primary object 로:

문제의 출발점: 광범위한 observability 채택과 훨씬 드문 offline evaluation 사이의 간극.

→ V (Verification)과 직접 연결 — observability 신호가 offline eval feedstock이 되어야 함.