In brief, the difficulty and quality of a transcription depend on the quality of the audio recording, the noise level, and the number of speakers. It’s hardly a surprise that transcribing a low-quality audio recording with loads of background noise and several speakers is more demanding (and thus, more expensive).

The better the recording, the easier it is to transcribe, so you might want to read my blog post on how to improve audio recordings and make transcription cheaper and easier.

Contributing factors that complicate transcription include, but are not limited to:

  • Background noise
  • Loud noises
  • Bad acoustics
  • Multiple speakers speaking simultaneously or interrupting each other
  • Low volume or inaudible speech
  • Low recording quality (e.g. distorted audio, missing audio, reverberation)
  • Speech featuring jargon, slang, idiolect, heavy accents, dialects, etc.
  • Verbatim transcription (includes every single word exactly as spoken, even “um”, “hm”, and “like”)
  • Transcription requires or presupposes research of subjects not considered common knowledge, e.g. technical or medical research
  • Transcription requires timestamps


Although you can take certain precautions to eliminate noise pollution recording quality obviously also depends on your recording equipment, among other things. Using a professional microphone enhances the quality, but isn’t an option for everyone or in every setting. However, even if you use your laptop or phone for recording audio (and video) you can optimise the settings of, among other things, framerate, resolution, compression (when exporting/converting), and format, and you can keep your laptop drivers updated. Needless to say, doing a soundcheck before recording could point out potential problems beforehand.

Noise level

Noise is any unwanted sound that interferes with hearing the speaker(s) in an audio recording. Although your hardware (microphone) limits the quality of the recording itself, and although some noises are unavoidable, you can effectively reduce the noise level by thinking ahead and taking precautions to eliminate potential noise pollution. Read my blog post on easy steps to make better recordings and easier and cheaper transcriptions.

Number of speakers

Usually, an interview includes two persons – an interviewer and an interviewee, and because the purpose of an interview is to hear the interviewee, the interviewer talks less.

The more participants in a conversation, the more voices – and thus, higher risk of interruptions, speaking over each other, etc. That makes transcription considerably more complicated (time-consuming and expensive), especially if other factors simultaneously weigh in, e.g. dialect, noise, bad recording quality. Transcribing an interview with more than two speakers is more difficult because keeping track of the various speakers can be incredibly difficult – or at least more time-consuming. Formatting alone is more cumbersome and laborious for each added participant.