SwiftLM — Native MLX Swift LLM Inference Server

100% 네이티브 Apple Silicon MLX 추론 서버. Python 런타임 없음, Global Interpreter Lock 오버헤드 없음. SSD Expert Streaming으로 100B+ MoE 모델(M3 Max에서 Qwen3.5-397B) 실행 가능.

Key Features

100% Native Apple Silicon: Metal + Swift로 컴파일된 단일 바이너리
SSD Expert Streaming: NVMe SSD에서 GPU로 가중치를 스트리밍하여 100B+ MoE 모델 실행 (M3 Max에서 Qwen3.5-397B 가능)
TurboQuant KV Compression: V2+V3 하이브리드 아키텍처, 정확도 손실 거의 없으면서 ~3.5× KV 캐시 압축
Multimodal Support: Vision-Language (VLM) + Audio-Language (ALM) 네이티브 지원
Speculative Decoding: 소형 draft 모델로 대형 모델 가속
iOS Companion: SwiftBuddy iPhone/iPad 앱 포함
OpenAI Compatible API: Python 런타임 없이 MLX 모델 서빙

Performance Benchmarks

Gemma 4-26B (M5 Pro 64GB)

Configuration	512 ctx	100K ctx	Memory Used
Dense/Vanilla	33.0 tok/s	15.7 tok/s	56.7 GB
SSD Stream	10.8 tok/s	9.0 tok/s	27.6 GB
SSD + TurboQuant	11.4 tok/s	1.6 tok/s	22.3 GB

DeepSeek-V4-Flash (126GB Model on 64GB RAM)

SSD + TurboQuant: 40K 컨텍스트에서 4.16 tok/s (RAM 16.8GB만 사용)
TurboQuant: 롱컨텍스트에서 일반 SSD 스트리밍 대비 13× 빠른 이유 — 레이어 스트리밍 횟수 감소

Technical Deep Dives

SSD Expert Streaming (MoE 10× Speedup)

Concurrent NVMe pread (QD=24): NVMe 컨트롤러 서칭, 24개 프로젝션을 병렬로 읽음
AsyncEval Pipeline: GPU 연산과 SSD I/O 오버랩, 다음 토큰용 expert 사전 로딩 (히트율 ~70%)
Cross-Projection Batching: ~1,400회 expert 호출 → 토큰당 ~48회로 감소

TurboQuantization: KV Cache Compression

V3 품질을 V2 속도로 제공하는 하이브리드 엔진:

K-Cache (4.25 bits/dim): 3-bit PolarQuant + 1-bit QJL 잔여 보정
V-Cache (3.125 bits/dim): 3-bit PolarQuant (25% 추가 절약, QJL 비활성화)
Activation: 2048 토큰 이후 자동 활성화 (100K+ 롱컨텍스트 처리)

Speculative Decoding & Auto-Capping

--stream-experts + --draft-model 사용 시 --num-draft-tokens이 1로 자동 캡됨:

Higher draft counts create an I/O fan-out that regresses SSD throughput. Draft acceptance rate ≥ 50%이면 net positive.

Supported Models

Category	Families
Text (LLM)	Gemma (2, 3, 4), Qwen (2, 2.5, 3, 3.5), Llama (3.1-3.3), DeepSeek V3, Phi (3, 4), Mistral/Mixtral, GLM 4, OLMo, Command-R, Jamba, Falcon H1
Vision (VLM)	Qwen2-VL/Qwen3-VL, Pixtral, PaliGemma, Idefics 3, SmolVLM 2
Audio (ALM)	Gemma 4 Omni

Quick Start

# Download binary from Releases, then run:
./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413

Key CLI Flags

Option	Description
`--stream-experts`	MoE 모델용 SSD 스트리밍 활성화 (10× 속도 향상)
`--turbo-kv`	서버 전체 3-bit KV 압축 활성화
`--vision` / `--audio`	멀티모달 처리 활성화
`--gpu-layers`	GPU에 할당할 레이어 수 수동 설정
`--draft-model`	Speculative decoding용 소형 모델 경로

Per-Request API Parameters

kv_bits: (4 or 8) MLX 네이티브 그룹 양자화 활성화
repetition_penalty: (e.g., 1.15) 루프 방지

Requirements & Build

OS: macOS 14.0+ / iOS 17.0+
Hardware: Apple Silicon (M1–M5)
Build: ./build.sh → mlx.metallib + Swift 바이너리 컴파일
Custom MLX: SwiftLM은 상류 MLX에 없는 out-of-core memory-mapped 실행을 지원하는 SharpAI/mlx 포크 사용

맥 미니 사용자를 위한 의의

사용자가 공유한 맥 미니(M4) 환경에서의 환호성:

Qwen 3 Coder Next 80B: 8GB RAM에서 50 tok/s (DFlash + Flash+MoE)
Qwen 3.6 35B (DFlash+Flash MoE): RAM 소비량의 1/3으로 AR 디코딩 능가

→ SwiftLM의 SSD Expert Streaming과 TurboQuant가 맥 미니 같은 메모리 제약 환경에서 대규모 모델 실행을 가능하게 하는 핵심 기술

2026-04-19-dflash-mlx-apple-silicon-inference — Block-Diffusion Speculative Decoding for Apple Silicon (3-4.6× 속도 향상)
2026-04-13-flash-moe-metal-inference — C/Metal 기반 Flash-MoE (Qwen3.5-397B M3 Max 실행 가능)
2026-04-22-qwen3-6-27b-open-source-agentic-coding — Qwen3.6 에이전트 코딩 관련

Sources

SharpAI/SwiftLM — GitHub

LLM Wiki

탐색기

SwiftLM — Native MLX Swift LLM Inference Server

SwiftLM — Native MLX Swift LLM Inference Server

Key Features

Performance Benchmarks

Gemma 4-26B (M5 Pro 64GB)

DeepSeek-V4-Flash (126GB Model on 64GB RAM)

Technical Deep Dives

SSD Expert Streaming (MoE 10× Speedup)

TurboQuantization: KV Cache Compression

Speculative Decoding & Auto-Capping

Supported Models

Quick Start

Key CLI Flags

Per-Request API Parameters

Requirements & Build

맥 미니 사용자를 위한 의의

Sources

그래프 뷰

목차

백링크

LLM Wiki

탐색기

SwiftLM — Native MLX Swift LLM Inference Server

SwiftLM — Native MLX Swift LLM Inference Server

Key Features

Performance Benchmarks

Gemma 4-26B (M5 Pro 64GB)

DeepSeek-V4-Flash (126GB Model on 64GB RAM)

Technical Deep Dives

SSD Expert Streaming (MoE 10× Speedup)

TurboQuantization: KV Cache Compression

Speculative Decoding & Auto-Capping

Supported Models

Quick Start

Key CLI Flags

Per-Request API Parameters

Requirements & Build

맥 미니 사용자를 위한 의의

Related Notes

Sources

그래프 뷰

목차

백링크