[Proof of Concept] Auto adjusting subtitles timing, using Text to Speech recognition

(starting with) The problem

      We have all tried to watch a movie or a TV show, downloaded from the Internet, that had mis-matched subtitle timings. And what was the solution? At least for me, I tried to find other subtitles that payed the bill or, worst case scenario, I had to readjust timings with a program such as TimeAdjuster, etc. Of course, there was always the possibility that subtitles seemed to be just fine in the beginning, but got worse later on, and had to "play live" with the subtitles delay within BSPlayer or XMBC :(

(why not) The idea

      First of all, I wanted to say that I didn't want to utilize a Text to Speech (TTS) system just to create new subtitles from scratch and there two reasons at least I can think of why:

  • Creating new subtitles from scratch using TTS, means somebody has to correct any errors afterwards which is quite superfluous when there are already good subtitles for the same show out there
  • Using TTS won't help if we want subtitles in a different language than the movie. Of course, using Google Translate or any other automatic translation system may do the dirty work for a while, however this is not at all a decent solution (at least for the Greek language).

(probably) The algorithm

      Let's assume we have a movie in English to fix, The-A-Movie. For our movie, we have downloaded some subtitles that we have found, English-Subtitles-A. At this point, we care only for English subtitles, and we will check out later for non-English ones. First of all, we should extract the audio track from the movie, The-Audio-Track-1, just to play around more easily. Afterwards, we should feed the TTS system with our The-Audio-Track-1 and extract as much text as we can, while saving the resulted subtitles in a different file, English-Subtitles-B. Of course these two files, English-Subtitles-A and English-Subtitles-B, both have text with timestamps but in different format - so extra care should be taken to support everything in need. Next, in order to make any time adjustments, we should create a matching score between these two files, using text distance metrics like the Levenstein distance (1). Also, the minimum path has to be noted down so as to identify which subtitle phrases in English-Subtitles-A correspond to which ones in English-Subtitles-B, without caring for the vice versa part at all. Somebody could identify possible secondary minimum paths as a problem and (2) is a possibly solution. However, texts will be so lengthy that I can imagine that this won't be problem at all. At this point, we can save the English-Subtitles-A file readjusted, using the timings of English-Subtitles-B.

      The same procedure can be used also for non-English subtitles. Let's say we have also downloaded Greek-Subtitles-A for the above movie The-A-Movie. In order to proceed, we have to make the match between the Greek-Subtitles-A and English-Subtitles-A using the timestamps. If we see timestamps as a number (for example seconds from the beginning of the file), then what we have is just two one-dimensional vectors. Even this is going to sound quite fuzzy, using a distance metric (like again Levenstein distance) we can match which point from the first vector corresponds to the seconds vector. In this way, we have a mechanism to hop between the two vectors, and equivalently, a mechanism to hop between English-Subtitles-A and Greek-Subtitles-A. Then, we "just" follow the algorithm in the previous paragraph to adjust the English subtitles (adjust English-Subtitles-A from English-Subtitles-B) and then adjust the non-English subtitles (adjust Greek-Subtitles-A from English-Subtitles-A).

(definitely) The coclusion

      Anyway, this is just an idea of a virtual system that could help this problem by both utilizing preexisting subtitles and an open source TTS system. I 'll try to elaborate the above idea with a figure or two - but I have to construct them first :D
If somebody likes the idea of constructing it or helping constructing it, don't hesitate to send! :)

(sort of) References

  1. Levenshtein distance - http://en.wikipedia.org/wiki/Levenshtein_distance
  2. Edit/Levenstein distance with time stamp - different paths with similar (minimal) cost - http://stackoverflow.com/questions/7027444/edit-levenstein-distance-with...

 

Comments

Obtaining subtitle timings by

Obtaining subtitle timings by speech recognition definitely is a good idea. Measuring timings does not require precise recognition, and entering them by hand requires watching a video, hence consumes a lot of time. This is in contrast to creating subtitles by speech recognition which is not that useful because it produces a text with too many mistakes. Did you consider how a transcript is split into subtitles? This process is based on grammatical structure of sentences, but also on word timings, and may be done by hand. Knowing timings is necessary because no subtitle is allowed to hang on a screen for too long.

Linked In Profile