I tinkered a bit with making almost fully automated Korean translation & caption syncincg for the Eyes on the Prize documentary for today’s staff meeting (which apparently was cancelled) but the results are not very good. Ath the same time though, it’s not too much text – only 1,000 lines (not sentences) in Episode 3
1. extract english captions from youtube
2. parse text out of the caption file timecode syntax
3. if the line is shorter than the median, assume there is a paragraph break at the end of that line. (captions include no punctuation)
4. Merge the paragraph lines into single paragraphs. (I used MediaWiki’s behavior of merging lines into one to do this)
5. Review the paragraphed text to ensure the punctation and paragraph separation makes sense. I noticed even though there is no punctuation, catpions include periods in middle name initiials and things like “Mr.”
6. Run the text through Google translate. It can only handle around one page of letter size page at a time.
7. removed all punctation from the korean, merged all the paragraphs.
8. Now to time sync the Korean.. I divided the total runtime (1 hour) by the time length between each caption point, then multiplied it by the total character length of the korean caption. since each catpion gets way too close to zero, I buffed them up a bit by giving them 40% more than what they are supposed to get (so they gain extra characters)
9. Create the caption file and run. Results were pretty disappointing. Because captioning density throughout the film is extremely irregular, the Korean caption was almost never on time with the English lines being said at the time.
Another approach could be doing the korean lines proportional to the length of the english captions, instead of the lenght of time. Yeah.. actually that may not be too bad.