All articles

How to Transcribe Bad Audio: Clean It Up Before You Transcribe

Bad audio makes automatic transcription unreliable and manual transcription exhausting. Learn how to improve audio quality before transcription for faster, more accurate results.

November 23, 20255 min readBy WefixSound Engineers

Ready to restore your audio?

Free sample within 24–48 h. You only pay if you're happy.

Get Free Sample

How to Transcribe Bad Audio: Clean It Up Before You Transcribe

Automatic speech recognition (ASR) tools like Otter.ai, Rev, Google Speech-to-Text, and Whisper produce dramatically different results based on audio quality. A clean recording might transcribe at 95%+ accuracy; the same words in a noisy recording might transcribe at 60-70%, requiring extensive correction that often takes longer than transcribing manually.

If you're trying to transcribe bad audio — meeting recordings, interview recordings, phone calls, archival audio — the most effective approach is often to clean up the audio first, then transcribe.

Why Audio Quality Matters for Transcription

Speech recognition models are trained on clean audio. They perform best with:

  • Clear, consistent audio levels
  • Low background noise relative to the voice
  • Clear consonants (particularly challenging: P, B, T, D, K, G sounds)
  • Single speaker or clearly separated speakers

When audio quality degrades, the model's performance drops significantly:

  • Background music or voices cause word substitution errors
  • Low SNR (noisy relative to speech) causes deletion errors (words the model can't detect)
  • Reverb smears consonants, causing confusion between similar-sounding phonemes
  • Multiple simultaneous speakers cause massive accuracy drops

Even a moderate improvement in audio quality can boost transcription accuracy from 70% to 90% — halving the correction work.

Step 1: Assess Whether Cleanup Is Worthwhile

Not every audio problem benefits equally from cleanup before transcription:

Most helped by cleanup:

  • Consistent background noise (HVAC, traffic) — noise reduction dramatically helps
  • Muffled audio — EQ and clarity processing improves ASR performance
  • Multiple speakers with level imbalance — normalization helps

Somewhat helped:

  • Room echo and reverb — de-reverb can improve some transcription accuracy
  • Occasional interruptions or background voices — reduce but don't eliminate errors

Transcription may be the only option:

  • Heavy clipping where content is genuinely unclear
  • Very low SNR recordings where noise dominates
  • Non-standard accents combined with poor audio — human transcription is more reliable than ASR

Step 2: Clean the Audio for Transcription

Essential processing for transcription-quality audio:

Noise reduction: The single most important improvement. Profile and remove consistent background noise. For transcription purposes, you can apply more aggressive noise reduction than for listening purposes — slight artifacts are acceptable if intelligibility improves.

High-pass filter: Remove low-frequency rumble below 80-100 Hz. Rumble confuses ASR models.

Level normalization: Ensure all speech sections are at consistent, moderate levels. ASR models perform worse on audio that's too quiet or inconsistent.

Compression: Reduce level variation so the ASR model has consistent input throughout.

For multi-speaker recordings: Bring all speakers to approximately equal levels before transcription. Quieter speakers are disproportionately transcribed incorrectly.

iZotope RX for transcription prep:
RX's Dialogue Contour and Dialogue Isolate modules are specifically designed for speech clarity enhancement — they're optimized for exactly the use case of making speech more intelligible, which aligns perfectly with transcription preparation.

Step 3: Choose the Right Transcription Tool

For clean audio after restoration:

  • Whisper (OpenAI): Extremely accurate, free/open source, handles multiple languages
  • Otter.ai: Good accuracy, real-time and upload, speaker identification
  • Rev AI: High accuracy, API access for workflow integration
  • Google Speech-to-Text: Good for Google ecosystem users

For audio you can't fully clean up:

  • Rev (human transcription): Most expensive but most accurate for difficult audio
  • Temi: Low-cost human-assisted transcription
  • Scribie: Manual transcription service for complex audio

For historical or legal audio:
Human transcription is appropriate for:

  • Legal proceedings where accuracy is critical
  • Historical recordings with significant background noise
  • Non-standard accents or vocabulary that defeats ASR models
  • Audio that has been through extensive restoration (document the restoration process)

Common Transcription Audio Scenarios

Meeting Recordings (Zoom, Teams, Phone)

Meeting recordings frequently transcribe poorly due to:

  • Multiple speakers talking simultaneously
  • Varied microphone quality between participants
  • Background noise from home offices
  • Compression artifacts from video conferencing platforms

Cleanup approach:

  1. Noise reduction for consistent background noise
  2. Level normalization to equalize speaker volumes
  3. De-reverb for participants with obvious room echo
  4. Speaker separation processing if available

Realistic accuracy improvement: From 65-75% with no cleanup to 85-90% after professional cleanup — typically worth the processing time for important meetings.

Phone Recording Transcription

Phone recordings have inherent bandwidth limitations (300 Hz - 3400 Hz) but respond well to:

  • Presence EQ boost in the 2-3 kHz range
  • Noise reduction for line noise
  • Level normalization

For legal/compliance phone recording transcription, maximum accuracy requires both audio cleanup and human review of the ASR output.

Archival and Historical Recording Transcription

Old recordings on tape or early digital formats often need significant restoration before any transcription is attempted.

WefixSound provides audio restoration specifically prepared for transcription use cases — optimizing the processing chain for maximum speech intelligibility rather than just listening quality. We work with:

  • Oral history projects transcribing decades of interview recordings
  • Legal professionals needing historical deposition transcription
  • Journalists working with archival interview footage
  • Corporate clients needing meeting archives made text-searchable

Our free 60-second sample demonstrates what speech clarity is achievable from your specific recording. Better cleanup = better transcription accuracy = less correction time.

After Transcription: Correction Workflow

Even with clean audio and good ASR, correction is usually required:

  • Technical jargon and proper nouns often need correction
  • Speaker labels may need adjustment
  • Numbers, dates, and specific names need verification

Efficient correction workflow:

  1. Listen at 1.25x speed while following the transcript
  2. Stop and correct at each error
  3. Focus on meaning-critical errors first; stylistic corrections can come later
  4. Use a transcript editor that syncs text to audio (Otter, Descript, oTranscribe)

When Bad Audio Makes Human Transcription Unavoidable

Some recordings simply require human transcription regardless of cleanup:

  • Extreme background noise situations (live event recordings, street interviews)
  • Heavy accents combined with poor recording quality
  • Technical or specialized vocabulary outside ASR training data
  • Legal/forensic standards requiring human attestation

For these cases, professional transcription services (Rev, Scribie, Verbit) have human transcriptionists experienced in difficult audio.

Related Articles

Cleaning audio before transcription is often the most cost-effective investment in transcription quality. The processing time is small compared to the correction time saved by going from 70% to 90% accuracy. For important recordings, WefixSound's transcription-optimized audio restoration delivers the clarity that makes both ASR and human transcription more accurate and efficient.

Ready to restore your audio?

Submit your file and receive a free sample within 24–48 hours. You only pay if you're happy with the result.

Get Free Sample
How to Transcribe Bad Audio: Clean Up First | WefixSound