In the game I’m working on I have to play several short consecutive voice-clips to form a complete sentence. Example (each bracket is a different voice clip):
[Bob here,] [we're at] [some town] [and are on our way to] [some city].
Stitching together different voice-clips like this makes it sound stilted and disconnected. This is because there are unnatural pauses when switching clips, and the pitch and tone of the speaker changes.
My current efforts include two methods for removing the unnatural pauses:
- starting the next clip early if a silence is detected at the end of the preceding clip
- skipping the first few milliseconds of the new clip up to the first detected ‘sound’.
These work OK at removing the unnatural pausimh, but detecting what ‘silence’ is is difficult, especially when dealing with multiple voice-actors and microphones.
How could I make stitching together voice-clips sound more natural? Any advice would be appreciated. This has to be done in real-time inside the game (I’m using Unity), and can’t be pre-processed or done ahead of time.