← All articles

How to Convert Any Video or Audio Into a Transcription-Ready File

· 360Converter Team

The most common cause of a disappointing transcript isn't the transcription engine — it's the file you fed it. A two-hour MKV with five audio tracks, a phone memo recorded at 8 kHz, a video where the speech sits under background music: each makes the model work harder than it should. Here's how to turn any recording into the clean, predictable input that speech-to-text engines are built for — with copy-paste commands, and an honest note on when you can skip all of this.

What "transcription-ready" actually means

Almost every offline speech engine wants the same thing: 16 kHz sample rate, mono, 16-bit PCM, in a WAV container. That target isn't arbitrary — each part earns its place:

16 kHz — human speech lives below about 8 kHz, and modern ASR models (Whisper, Parakeet, wav2vec2, Kaldi) are trained on 16 kHz audio. They resample anything else to 16 kHz internally, so giving them 16 kHz directly wastes nothing and skips a step. Mono — a transcription model reads one stream of speech; folding a stereo file down to mono is what it would do anyway. 16-bit PCM in a WAV container — uncompressed and lossless, so the decoder never has to guess, and WAV is the one container every tool opens without complaint.

You often don't need to do any of this by hand

A good offline transcriber (ours included) accepts common video and audio formats directly and resamples internally. Reach for the commands below when you have an unusual container, a large batch, or audio that needs trimming or cleanup before it reaches the model.

The one command you'll use most

Extract the audio from a video — or convert any audio file — into a 16 kHz mono WAV with ffmpeg:

ffmpeg -i input.mp4 -vn -ac 1 -ar 16000 -c:a pcm_s16le output.wav

Reading the flags: -i input.mp4 is your source (any format), -vn drops the video, -ac 1 makes it mono, -ar 16000 sets 16 kHz, and -c:a pcm_s16le writes 16-bit PCM. The same line works whether the input is .mp4, .mkv, .mov, .mp3, .m4a, or .ogg — ffmpeg detects the input for you. (ffmpeg is free and open-source; on Windows, winget install ffmpeg installs it.)

For the cases that one command doesn't cover

Just lift the audio out, no re-encoding (fastest). Copies the original audio stream untouched — use it when your tool already accepts compressed audio and you only need to strip the video:

ffmpeg -i input.mkv -vn -c:a copy audio.m4a

Trim to the part that matters before transcribing a long file:

ffmpeg -i input.mp3 -ss 00:01:30 -to 00:05:00 -ac 1 -ar 16000 -c:a pcm_s16le clip.wav

Rescue quiet or uneven audio. loudnorm evens out levels so a soft-spoken participant isn't lost:

ffmpeg -i input.wav -af loudnorm -ac 1 -ar 16000 -c:a pcm_s16le normalized.wav

Two speakers split left and right? Split the channels and transcribe each one for clean, perfect speaker separation:

ffmpeg -i interview.wav -map_channel 0.0.0 speakerA.wav -map_channel 0.0.1 speakerB.wav

Convert a whole folder at once.

Windows PowerShell:
Get-ChildItem *.mp4 | ForEach-Object { ffmpeg -i $_.Name -vn -ac 1 -ar 16000 -c:a pcm_s16le "$($_.BaseName).wav" }

macOS / Linux:
for f in *.mp4; do ffmpeg -i "$f" -vn -ac 1 -ar 16000 -c:a pcm_s16le "${f%.*}.wav"; done

What you can feed in

ffmpeg reads essentially every container and codec you'll meet. A quick map of what's safe as a source:

Source type Common examples Notes
Video (extract audio) MP4, MOV, MKV, AVI, WebM -vn drops the picture and keeps the sound
Lossy audio MP3, AAC / M4A, Ogg / Opus, WMA Fine as input — decoding to PCM loses nothing further
Lossless audio WAV, FLAC, AIFF Already ideal; just match sample rate and channels
Telephony / VoIP 8 kHz WAV, AMR, Opus Upsample to 16 kHz; never feed a model below 16 kHz

One myth worth killing: converting an MP3 to WAV does not improve accuracy. The MP3 already discarded data when it was encoded; re-wrapping it as WAV just hands the engine a container it reads without resampling. You convert for compatibility and speed, not fidelity.

Or skip the terminal entirely

For most files you don't need any of this. The 360Converter Offline Transcriber takes common video and audio formats as they are, resamples to what the model needs internally, and keeps everything on your machine. The ffmpeg route earns its place for the edge cases — an exotic container, a 200-file batch, or audio you want to trim and clean before it ever reaches the model.

Try the 360Converter Offline Transcriber

Drop in MP4, MOV, MP3, M4A and more — transcription, speaker diarization, and GPU acceleration on Windows and macOS, with your audio never leaving your machine.

Learn more & download

Frequently asked questions

Does converting to WAV make the transcript more accurate?

Not by itself. Accuracy comes from the model and the quality already baked into the recording. A 16 kHz mono WAV mainly makes the job faster and removes a resampling step — it can't recover detail a lossy file has already thrown away.

Why 16 kHz and not 44.1 kHz?

Speech information sits below 8 kHz, and ASR models are trained at 16 kHz. Higher sample rates are downsampled before recognition anyway, so they only cost you time and disk space.

Mono or stereo?

Mono for ordinary transcription. Keep the channels separate only when each speaker is recorded on their own channel and you want a clean per-speaker transcript.

Do I even need ffmpeg?

Often not — a transcriber that accepts your format directly is simpler. Keep ffmpeg around for trimming, batch jobs, loudness fixes, and the occasional unusual file.

Commands target ffmpeg 5.x and newer and are stable across recent versions. ffmpeg is free and open-source (ffmpeg.org). 16 kHz mono PCM is the standard input for Whisper-, Parakeet-, and Kaldi-based speech engines.