How to get consecutive voice clips to sound natural

In the game I’m working on I have to play several short consecutive voice-clips to form a complete sentence. Example (each bracket is a different voice clip):

[Bob here,] [we're at]  [some town] [and are on our way to] [some city].

Stitching together different voice-clips like this makes it sound stilted and disconnected. This is because there are unnatural pauses when switching clips, and the pitch and tone of the speaker changes.

My current efforts include two methods for removing the unnatural pauses:

  1. starting the next clip early if a silence is detected at the end of the preceding clip
  2. skipping the first few milliseconds of the new clip up to the first detected ‘sound’.

These work OK at removing the unnatural pausimh, but detecting what ‘silence’ is is difficult, especially when dealing with multiple voice-actors and microphones.

How could I make stitching together voice-clips sound more natural? Any advice would be appreciated. This has to be done in real-time inside the game (I’m using Unity), and can’t be pre-processed or done ahead of time.

You have to look for Markov Chain Models (Wikipedia: Markov chain - Wikipedia) and Hidden Markov Model. Since I don’t have enough knowledge to resume it cleverly, I can only forward you to this article: An introduction to part-of-speech tagging and the Hidden Markov Model