How to Convert Any Video or Audio Into a Transcription-Ready File
The most common cause of a disappointing transcript isn't the transcription engine — it's the file you fed it. A two-hour MKV with five audio tracks, a phone memo recorded at 8 kHz, a video where the speech sits under background music: each makes the model work harder than it should. Here's how to turn any recording into the clean, predictable input that speech-to-text engines are built for — with copy-paste commands, and an honest note on when you can skip all of this.
What "transcription-ready" actually means
Almost every offline speech engine wants the same thing: 16 kHz sample rate, mono, 16-bit PCM, in a WAV container. That target isn't arbitrary — each part earns its place:
16 kHz — human speech lives below about 8 kHz, and modern ASR models (Whisper, Parakeet, wav2vec2, Kaldi) are trained on 16 kHz audio. They resample anything else to 16 kHz internally, so giving them 16 kHz directly wastes nothing and skips a step. Mono — a transcription model reads one stream of speech; folding a stereo file down to mono is what it would do anyway. 16-bit PCM in a WAV container — uncompressed and lossless, so the decoder never has to guess, and WAV is the one container every tool opens without complaint.
You often don't need to do any of this by hand
A good offline transcriber (ours included) accepts common video and audio formats directly and resamples internally. Reach for the commands below when you have an unusual container, a large batch, or audio that needs trimming or cleanup before it reaches the model.
The one command you'll use most
Extract the audio from a video — or convert any audio file — into a 16 kHz mono WAV with ffmpeg:
ffmpeg -i input.mp4 -vn -ac 1 -ar 16000 -c:a pcm_s16le output.wav
Reading the flags: -i input.mp4 is your source (any format), -vn drops the
video, -ac 1 makes it mono, -ar 16000 sets 16 kHz, and
-c:a pcm_s16le writes 16-bit PCM. The same line works whether the input is
.mp4, .mkv, .mov, .mp3, .m4a, or
.ogg — ffmpeg detects the input for you. (ffmpeg is free and open-source; on Windows,
winget install ffmpeg installs it.)
For the cases that one command doesn't cover
Just lift the audio out, no re-encoding (fastest). Copies the original audio stream untouched — use it when your tool already accepts compressed audio and you only need to strip the video:
ffmpeg -i input.mkv -vn -c:a copy audio.m4a
Trim to the part that matters before transcribing a long file:
ffmpeg -i input.mp3 -ss 00:01:30 -to 00:05:00 -ac 1 -ar 16000 -c:a pcm_s16le clip.wav
Rescue quiet or uneven audio. loudnorm evens out levels so a soft-spoken
participant isn't lost:
ffmpeg -i input.wav -af loudnorm -ac 1 -ar 16000 -c:a pcm_s16le normalized.wav
Two speakers split left and right? Split the channels and transcribe each one for clean, perfect speaker separation:
ffmpeg -i interview.wav -map_channel 0.0.0 speakerA.wav -map_channel 0.0.1 speakerB.wav
Convert a whole folder at once.
Windows PowerShell:
Get-ChildItem *.mp4 | ForEach-Object { ffmpeg -i $_.Name -vn -ac 1 -ar 16000 -c:a pcm_s16le "$($_.BaseName).wav" }
macOS / Linux:
for f in *.mp4; do ffmpeg -i "$f" -vn -ac 1 -ar 16000 -c:a pcm_s16le "${f%.*}.wav"; done
What you can feed in
ffmpeg reads essentially every container and codec you'll meet. A quick map of what's safe as a source:
| Source type | Common examples | Notes |
|---|---|---|
| Video (extract audio) | MP4, MOV, MKV, AVI, WebM | -vn drops the picture and keeps the sound |
| Lossy audio | MP3, AAC / M4A, Ogg / Opus, WMA | Fine as input — decoding to PCM loses nothing further |
| Lossless audio | WAV, FLAC, AIFF | Already ideal; just match sample rate and channels |
| Telephony / VoIP | 8 kHz WAV, AMR, Opus | Upsample to 16 kHz; never feed a model below 16 kHz |
One myth worth killing: converting an MP3 to WAV does not improve accuracy. The MP3 already discarded data when it was encoded; re-wrapping it as WAV just hands the engine a container it reads without resampling. You convert for compatibility and speed, not fidelity.
Or skip the terminal entirely
For most files you don't need any of this. The 360Converter Offline Transcriber takes common video and audio formats as they are, resamples to what the model needs internally, and keeps everything on your machine. The ffmpeg route earns its place for the edge cases — an exotic container, a 200-file batch, or audio you want to trim and clean before it ever reaches the model.
Try the 360Converter Offline Transcriber
Drop in MP4, MOV, MP3, M4A and more — transcription, speaker diarization, and GPU acceleration on Windows and macOS, with your audio never leaving your machine.
Learn more & downloadFrequently asked questions
Does converting to WAV make the transcript more accurate?
Not by itself. Accuracy comes from the model and the quality already baked into the recording. A 16 kHz mono WAV mainly makes the job faster and removes a resampling step — it can't recover detail a lossy file has already thrown away.
Why 16 kHz and not 44.1 kHz?
Speech information sits below 8 kHz, and ASR models are trained at 16 kHz. Higher sample rates are downsampled before recognition anyway, so they only cost you time and disk space.
Mono or stereo?
Mono for ordinary transcription. Keep the channels separate only when each speaker is recorded on their own channel and you want a clean per-speaker transcript.
Do I even need ffmpeg?
Often not — a transcriber that accepts your format directly is simpler. Keep ffmpeg around for trimming, batch jobs, loudness fixes, and the occasional unusual file.
Commands target ffmpeg 5.x and newer and are stable across recent versions. ffmpeg is free and open-source (ffmpeg.org). 16 kHz mono PCM is the standard input for Whisper-, Parakeet-, and Kaldi-based speech engines.
