Synchronized text with voice like karaoke

Hello,
I am trying to do a book for children. The idea is that a voice reads the tale for them and at the same time they can see the text with the word that they are hearing highlighted. (like a karaoke).
The voice and the stand out word must be synchronized word by word (no line by line or sentence by sentence). Do someone have an idea about how I can do this effect in an efficient way?.
My idea: For each book page create a layer with all the text for that page (base layer), and many transparent layers, each one with a highlighted word. Then I should show the base layer all the time and synchronize with the voice that only the layer with the listened word be shown in front of the base layer. But I think it will consume a lot of resources and it is not going to be efficient and fast enough. What do you think?
Any other idea?
Thanks in advance!!

Unfortunately I imagine that a manual timestamp editing process will be required. Any automated method you could devise would likely involve an equal-or-greater amount of work.

Obviously that depends upon the total length of this spoken dialog. If you’ve got a dictated a novella, an automated analysis program begins to make a lot more sense.

Assuming you want the entirety of each word to highlight the instant the word is spoken, and it should return to normal the instant that piece of audio stops, each word will require two timecodes. If the next spoken word can trigger the de-highlight event of the previous word, you cut your timestamping work in half.

If it were me, I’d use the following structure:

Your application plays through a collection of design-time constructed Sentence objects.

A Sentence object contains:

A string representing the spoken dialog

A reference to a single Text object which is part of a canvas. When the sentence is loaded, the text changes to reflect the new string. By enabling “rich text” you will be able to colorize a given word independently with a teensy bit of coding. You can write a handy WrapStringInColorTags method like mine.

An audio clip of the specific sentence or a timestamp and duration for accessing that segment from a larger track, whichever is more convenient.

A List of floats as the timecodes for each word – more on this…

The method you’ll be invoking at each timecode need only advance an indexer and colorize the corresponding word in the Text object’s string. Each call nullifies the previous highlighting, and highlights the appropriate word.

Write a helper script for yourself which holds and plays Sentences at runtime. Play the clip source at a reduced pitch if necessary for greater precision, but note you’ll have to convert the recorded timecodes back relative to a normal pitch if you do this, since pitch control in Unity affects playback time.

As the audio plays, just as each spoken word begins, tap the space bar to make a note of the elapsed time in a List of floats. Save this list of floats as the Sentence’s timecodes. This will require some serialization trickery, but you could have a function which saves this list of recorded floats to the corresponding Sentence object. Gotta be the best way, I’d think.