Insanely Fast Whisper - 초고속 Whisper ASR CLI

NVIDIA GPU & Mac을 위한 초고속 Whisper 명령행 인터페이스 - 150분 오디오를 98초 미만으로 변환

핵심 기능

⚡️ 엄청난 속도: A100 80GB에서 150분 오디오 변환 시간

OpenAI Whisper Large v3 (Flash Attention 2): 1분 38초
Distil Whisper Large v2 (Flash Attention 2): 1분 18초
기존 Transformers (fp32) 대비 28배 빠름

🚀 다양한 최적화 옵션

Flash Attention 2 지원
BetterTransformer 지원
배치 처리 (batch size 기본 24)
fp16/8-bit 양자화
CUDA 및 Apple Silicon (mps) 지원

설치 및 사용

설치

# pipx 권장
pipx install insanely-fast-whisper==0.0.15 --force
 
# Python 3.11.x 호환성 이슈 있는 경우
pipx install insanely-fast-whisper --force --pip-args="--ignore-requires-python"

기본 사용

insanely-fast-whisper --file-name <audio-file-or-URL>

고급 옵션

# Flash Attention 2 사용
insanely-fast-whisper --file-name <file> --flash True
 
# Distil Whisper 사용
insanely-fast-whisper --model-name distil-whisper/large-v2 --file-name <file>
 
# Mac (Apple Silicon)
insanely-fast-whisper --file-name <file> --device-id mps
 
# 화자 분리 (diarization)
insanely-fast-whisper --file-name <file> --hf-token <HF_TOKEN> --diarization_model pyannote/speaker-diarization --num-speakers 2

벤치마크 결과 (NVIDIA A100 80GB)

모델	최적화	150분 오디오 소요 시간
Whisper Large v3	Transformers (fp32)	31분 1초
Whisper Large v3	fp16 + 배치(24) + BetterTransformer	5분 2초
Whisper Large v3	fp16 + 배치(24) + Flash Attention 2	1분 38초
Distil Large v2	fp16 + 배치(24) + Flash Attention 2	1분 18초
Whisper Large v2	Faster-Whisper (fp16)	9분 23초
Whisper Large v2	Faster-Whisper (8-bit)	8분 15초

주요 특징

🎯 다양한 모델 지원

OpenAI Whisper (large-v3 권장)
Distil Whisper (더 작고 빠름)
커스텀 HuggingFace 모델

⚙️ 최적화 기술

Flash Attention 2: 2-3배 추가 가속
BetterTransformer: Hugging Face 최적화
대량 배치 처리: 동시 여러 오디오
양자화: fp16, 8-bit 지원

🌐 하드웨어 지원

NVIDIA GPU: CUDA,Various execution providers
Apple Silicon: Mac M1/M2/M3 (mps)
CPU: 제한적 (권장하지 않음)

📊 고급 기능

화자 분리 (Diarization): PyAnnote 기반
타임스탬프: 청크/단어 수준
언어 자동 감지: Whisper 기본
번역任务: transcribe 또는 translate
API 모드: Python pipeline 직접 사용 가능

Python API 사용

import torch
from transformers import pipeline
from transformers.utils import is_flash_attn_2_available
 
pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3",
    torch_dtype=torch.float16,
    device="cuda:0",  # 또는 "mps" (Mac)
    model_kwargs={"attn_implementation": "flash_attention_2"} if is_flash_attn_2_available() else {"attn_implementation": "sdpa"},
)
 
outputs = pipe(
    "<FILE_NAME>",
    chunk_length_s=30,
    batch_size=24,
    return_timestamps=True,
)

시스템 요구사항

필수

GPU: NVIDIA 권장 (CUDA), 또는 Apple Silicon Mac
Python: 3.10-3.11 (주의: 3.11.xx 버그 있음)
메모리: A100 80GB 기준 테스트, 일반 GPU는 batch-size 조정 필요

설치 의존성

pip install --upgrade transformers optimum accelerate
pip install flash-attn --no-build-isolation  # Flash Attention 2

Mac 특별 주의

# M1/M2/M3에서 pipx 설치 시
pipx install insanely-fast-whisper --force --pip-args="--ignore-requires-python"
 
# 실행 시
insanely-fast-whisper --file-name <file> --device-id mps --batch-size 4  # 메모리 제한

주요 옵션

--model-name MODEL_NAME     # 기본: openai/whisper-large-v3
--task {transcribe,translate}
--language LANGUAGE         # None이면 자동 감지
--batch-size BATCH_SIZE     # 기본: 24 (OOM 발생 시 줄이기)
--flash FLASH              # Flash Attention 2 사용 (기본: False)
--timestamp {chunk,word}   # 타임스탬프 레벨
--hf-token HF_TOKEN        # PyAnnote diarization용 HuggingFace 토큰
--diarization_model        # 기본: pyannote/speaker-diarization
--num-speakers NUM         # 정확한 화자 수 지정
--max-speakers MAX         # 최대 화자 수 범위
--device-id DEVICE_ID      # GPU 번호 또는 "mps" (Mac)
--transcript-path PATH     # 출력 파일 (기본: output.json)

주의사항

Windows OOM: Torch 재설치 필요할 수 있음
Mac 메모리: mps는 cuDA보다 메모리 효율 낮음 → batch-size 4 권장
Python 버전: 3.11.xx에서 pipx 버전 문제 → --ignore-requires-python 사용
Flash Attention 설치: pipx runpip insanely-fast-whisper install flash-attn --no-build-isolation

프로젝트 정보

라이선스: MIT
기반 기술: Hugging Face Transformers, Optimum, flash-attn
커뮤니티 기반: 커뮤니티 요청에 따라 기능 추가
원래 목적: Transformers 벤치마크 → CLI 도구로 발전

도구	최적화	속도 (150분)	특징
Insanely Fast Whisper	Flash Attention 2 + BetterTransformer	1-2분	CLI, 최신 기술
Faster-Whisper	CTranslate2	8-9분	경량, CPU 지원
OpenAI Whisper (원본)	None	~31분	기준, 정확도 높음
Whisper (Transformers)	BetterTransformer	~5분	중간

사용 사례

팟캐스트/인터뷰 자동 변환
대규모 오디오 아카이브 처리
실시간/배치 스트리밍 전사
다국어 콘텐츠 자막 생성
법률/의료 기록 변환

참고사항

NVIDIA A100 80GB 기준 벤치마크, 소비자 GPU는 성능 다름
Flash Attention 2 설치가 까다로울 수 있음
Mac MPS 백엔드는 여전히 개선 중
화자 분리 필요시 HuggingFace 토큰 필요 (PyAnnote)
기본 모델: openai/whisper-large-v3 (최신)

Context Vault

탐색기

Insanely Fast Whisper - 초고속 Whisper ASR CLI

Insanely Fast Whisper - 초고속 Whisper ASR CLI

핵심 기능

설치 및 사용

설치

기본 사용

고급 옵션

벤치마크 결과 (NVIDIA A100 80GB)

주요 특징

🎯 다양한 모델 지원

⚙️ 최적화 기술

🌐 하드웨어 지원

📊 고급 기능

Python API 사용

시스템 요구사항

필수

설치 의존성

Mac 특별 주의

주요 옵션

주의사항

프로젝트 정보

관련 도구 비교

사용 사례

참고사항

관련 노트

그래프 뷰

목차