How to Transcribe Bad Audio: Clean It Up Before You Transcribe

Automatic speech recognition (ASR) tools like Otter.ai, Rev, Google Speech-to-Text, and Whisper produce dramatically different results based on audio quality. A clean recording might transcribe at 95%+ accuracy; the same words in a noisy recording might transcribe at 60-70%, requiring extensive correction that often takes longer than transcribing manually.

If you're trying to transcribe bad audio — meeting recordings, interview recordings, phone calls, archival audio — the most effective approach is often to clean up the audio first, then transcribe.

Why Audio Quality Matters for Transcription

Speech recognition models are trained on clean audio. They perform best with:

Clear, consistent audio levels
Low background noise relative to the voice
Clear consonants (particularly challenging: P, B, T, D, K, G sounds)
Single speaker or clearly separated speakers

When audio quality degrades, the model's performance drops significantly:

Background music or voices cause word substitution errors
Low SNR (noisy relative to speech) causes deletion errors (words the model can't detect)
Reverb smears consonants, causing confusion between similar-sounding phonemes
Multiple simultaneous speakers cause massive accuracy drops

Even a moderate improvement in audio quality can boost transcription accuracy from 70% to 90% — halving the correction work.

Step 1: Assess Whether Cleanup Is Worthwhile

Not every audio problem benefits equally from cleanup before transcription:

Most helped by cleanup:

Consistent background noise (HVAC, traffic) — noise reduction dramatically helps
Muffled audio — EQ and clarity processing improves ASR performance
Multiple speakers with level imbalance — normalization helps

Somewhat helped:

Room echo and reverb — de-reverb can improve some transcription accuracy
Occasional interruptions or background voices — reduce but don't eliminate errors

Transcription may be the only option:

Heavy clipping where content is genuinely unclear
Very low SNR recordings where noise dominates
Non-standard accents combined with poor audio — human transcription is more reliable than ASR

Step 2: Clean the Audio for Transcription

Essential processing for transcription-quality audio:

Noise reduction: The single most important improvement. Profile and remove consistent background noise. For transcription purposes, you can apply more aggressive noise reduction than for listening purposes — slight artifacts are acceptable if intelligibility improves.

High-pass filter: Remove low-frequency rumble below 80-100 Hz. Rumble confuses ASR models.

Level normalization: Ensure all speech sections are at consistent, moderate levels. ASR models perform worse on audio that's too quiet or inconsistent.

Compression: Reduce level variation so the ASR model has consistent input throughout.

For multi-speaker recordings: Bring all speakers to approximately equal levels before transcription. Quieter speakers are disproportionately transcribed incorrectly.

iZotope RX for transcription prep:
RX's Dialogue Contour and Dialogue Isolate modules are specifically designed for speech clarity enhancement — they're optimized for exactly the use case of making speech more intelligible, which aligns perfectly with transcription preparation.

Step 3: Choose the Right Transcription Tool

For clean audio after restoration:

Whisper (OpenAI): Extremely accurate, free/open source, handles multiple languages
Otter.ai: Good accuracy, real-time and upload, speaker identification
Rev AI: High accuracy, API access for workflow integration
Google Speech-to-Text: Good for Google ecosystem users

For audio you can't fully clean up:

Rev (human transcription): Most expensive but most accurate for difficult audio
Temi: Low-cost human-assisted transcription
Scribie: Manual transcription service for complex audio

For historical or legal audio:
Human transcription is appropriate for:

Legal proceedings where accuracy is critical
Historical recordings with significant background noise
Non-standard accents or vocabulary that defeats ASR models
Audio that has been through extensive restoration (document the restoration process)

Common Transcription Audio Scenarios

Meeting Recordings (Zoom, Teams, Phone)

Meeting recordings frequently transcribe poorly due to:

Multiple speakers talking simultaneously
Varied microphone quality between participants
Background noise from home offices
Compression artifacts from video conferencing platforms

Cleanup approach:

Noise reduction for consistent background noise
Level normalization to equalize speaker volumes
De-reverb for participants with obvious room echo
Speaker separation processing if available

Realistic accuracy improvement: From 65-75% with no cleanup to 85-90% after professional cleanup — typically worth the processing time for important meetings.

Phone Recording Transcription

Phone recordings have inherent bandwidth limitations (300 Hz - 3400 Hz) but respond well to:

Presence EQ boost in the 2-3 kHz range
Noise reduction for line noise
Level normalization

For legal/compliance phone recording transcription, maximum accuracy requires both audio cleanup and human review of the ASR output.

Archival and Historical Recording Transcription

Old recordings on tape or early digital formats often need significant restoration before any transcription is attempted.

WefixSound provides audio restoration specifically prepared for transcription use cases — optimizing the processing chain for maximum speech intelligibility rather than just listening quality. We work with:

Oral history projects transcribing decades of interview recordings
Legal professionals needing historical deposition transcription
Journalists working with archival interview footage
Corporate clients needing meeting archives made text-searchable

Our free 60-second sample demonstrates what speech clarity is achievable from your specific recording. Better cleanup = better transcription accuracy = less correction time.

After Transcription: Correction Workflow

Even with clean audio and good ASR, correction is usually required:

Technical jargon and proper nouns often need correction
Speaker labels may need adjustment
Numbers, dates, and specific names need verification

Efficient correction workflow:

Listen at 1.25x speed while following the transcript
Stop and correct at each error
Focus on meaning-critical errors first; stylistic corrections can come later
Use a transcript editor that syncs text to audio (Otter, Descript, oTranscribe)

When Bad Audio Makes Human Transcription Unavoidable

Some recordings simply require human transcription regardless of cleanup:

Extreme background noise situations (live event recordings, street interviews)
Heavy accents combined with poor recording quality
Technical or specialized vocabulary outside ASR training data
Legal/forensic standards requiring human attestation

For these cases, professional transcription services (Rev, Scribie, Verbit) have human transcriptionists experienced in difficult audio.

Cleaning audio before transcription is often the most cost-effective investment in transcription quality. The processing time is small compared to the correction time saved by going from 70% to 90% accuracy. For important recordings, WefixSound's transcription-optimized audio restoration delivers the clarity that makes both ASR and human transcription more accurate and efficient.

How to Transcribe Bad Audio: Clean It Up Before You Transcribe

How to Transcribe Bad Audio: Clean It Up Before You Transcribe

Why Audio Quality Matters for Transcription

Step 1: Assess Whether Cleanup Is Worthwhile

Step 2: Clean the Audio for Transcription

Step 3: Choose the Right Transcription Tool

Common Transcription Audio Scenarios

Meeting Recordings (Zoom, Teams, Phone)

Phone Recording Transcription

Archival and Historical Recording Transcription

After Transcription: Correction Workflow

When Bad Audio Makes Human Transcription Unavoidable

Related Articles

Ready to restore your audio?