View Full Version : closed caption extraction
FrooBrar
12-24-2006, 05:40 PM
I have seen some info about extracting closed captions from ttg files (t2sami), but only for certain models. It has been said that for 540 models, that it is currently impossible. In writing tivodecode, I have gotten a report about "creepy crawly pixels" along the top of the screen, which someone in the sourceforge forum has identified as the line 21 caption info. Since apparently the caption info for these models seems to have been located, does anyone know of free/open source software to extract these into some usable format. If not, perhaps someone could figure out how to write such a thing. It could be either a stand-alone post-conversion program, or I could possibly integrate such a thing into tivodecode if it is not too complex.
jmemmott
12-24-2006, 07:22 PM
This is probably not the time to go into the gory technical details but decoding the programs has never been the problem. As has been noted in other threads on tivodecode, within a short period of time after TivoToGo was available, it became apparent how to achieve everything tivodecode can accomplish using other means as long as one worked in Windows.
The real issue is that the hardware in the 240 does all the heavy lifting with respect to extracting the CC information – it digitizes the VBI waveform, bit slices line 21, extracts the CC packet and stores it in the PES stream in the clear. The 540 doesn’t do one or more of these things. To make matters worse, the technical details of what is being done are kept pretty tightly under wraps by Tivo and Broadcom. I suspect this is because it overlaps with other uses of the VBI such as onscreen thumbs, replacement commercials, etc., that these companies consider important to their business model. As a result even when you know where the VBI data resides in the 540 and can even extract it on a frame by frame basis, there are still plenty of blind alleys to get lost in.
I would be more than happy to compare notes on those alleys with anyone interested enough to pursue this path but I would also reiterate that these details are a little too gory and tedious for most people.
FrooBrar
12-26-2006, 05:00 PM
This is probably not the time to go into the gory technical details but decoding the programs has never been the problem. As has been noted in other threads on tivodecode, within a short period of time after TivoToGo was available, it became apparent how to achieve everything tivodecode can accomplish using other means as long as one worked in Windows.
The real issue is that the hardware in the 240 does all the heavy lifting with respect to extracting the CC information – it digitizes the VBI waveform, bit slices line 21, extracts the CC packet and stores it in the PES stream in the clear. The 540 doesn’t do one or more of these things. To make matters worse, the technical details of what is being done are kept pretty tightly under wraps by Tivo and Broadcom. I suspect this is because it overlaps with other uses of the VBI such as onscreen thumbs, replacement commercials, etc., that these companies consider important to their business model. As a result even when you know where the VBI data resides in the 540 and can even extract it on a frame by frame basis, there are still plenty of blind alleys to get lost in.
I would be more than happy to compare notes on those alleys with anyone interested enough to pursue this path but I would also reiterate that these details are a little too gory and tedious for most people.
I am interested. Don't know what the appropriate forum for such discussions is, however. Also, I had some success with getting captions out of my converted files using this tool: http://www.geocities.com/mcpoodle43/SCC_TOOLS/DOCS/SCC_TOOLS.HTML#CCExtract
Maybe this will help you see how to grab the captions, I am still way over my head...
EDIT: I guess the code is written in some sort of language for something called General Parser. The code looks fairly straightforward to port to C, should make for a nice project for me to try...
jmemmott
12-27-2006, 01:59 PM
I am interested. Don't know what the appropriate forum for such discussions is, however. Also, I had some success with getting captions out of my converted files using this tool
Might as well leave it here for now. I don’t know of a better venue than this to find other people that might be able to help. Who knows, there might even be someone in Alviso that would listen to the case for exchanging minimal assistance for a cost effective means to satisfy one of the needs of their hearing impaired customers while protecting their own interests…
The technologies described in the links you point to are closest to the model 240 situation. The main difference is that these links start from the authoring point of view, which implies you already have all the timing data. In the 240 we are going the opposite direction. The user data ( 0x1b2 ) packets exists and contain 608 CC encoded data but they do not contain timing information. For this reason, other mpeg2 PES headers need to be tracked and decoded as well or you cannot build a CC stream that can be synchronized with the visual images. This is what T2Sami does.
The 540 does not use this mechanism. Since I have not unraveled this technology yet, I am speculating, but the 540 user data packets do not appear to contain CC information. They do not contain 608 information and if you examine the way they change from frame to frame, they do not appear to have enough variable information content to account for the amount of information in CC data. Instead, the line of static that people have noticed across the top of the 540 image during playback lead me to believe that the VBI data is being encoded using an alternate technique that I have seen used with PC video capture cards.
This mechanism is more flexible, allowing arbitrary and unknown type of VBI information to be preserved along with the CC information. Using this mechanism, the incoming analog VBI signal is digitized and inserted into each frame as if it were a normal line of video. It is then compressed along with the normal video and stored in the mpeg stream for each frame. Upon playback, this line is decompressed, pulled out of the video frame and used to regenerate the appropriate VBI line. Since this process simply stores and regenerates the analog VBI signal, the 540 does not need to interpret or extract any of the information during the storage and retrieval process. This means that if we want the information that is stored in that VBI line, we must undertake the entire task of extracting it ourselves.
There are open source libraries originally developed for use with video capture cards that can assist but to use them, the incoming data must be remapped into a format they understand. The Tivo data is similar but not the same as any of the formats these libraries are designed to use. In an attempt to bridge the gap I have put together decoders and filters that allow me to dump and view the luminance signal in these scan lines (the black and white static at the top of the 540 frame). So far, it looks promising but I am still trying to determine if they contains line 21 data, something else or both. If I get through that successfully, I will still have to locate the data bits within the analog signal so they can be bitsliced accurately. Then and only then will there be a chance to actually see the 608 CC stream. Finally, all of this would then plug into the existing T2Sami code...
FrooBrar
12-27-2006, 03:37 PM
The technologies described in the links you point to are closest to the model 240 situation. The main difference is that these links start from the authoring point of view, which implies you already have all the timing data. In the 240 we are going the opposite direction. The user data ( 0x1b2 ) packets exists and contain 608 CC encoded data but they do not contain timing information. For this reason, other mpeg2 PES headers need to be tracked and decoded as well or you cannot build a CC stream that can be synchronized with the visual images. This is what T2Sami does.
What I was trying to say was that the CCExtract tool on that page, which runs in an app called General Parser, is actually sucessful in retrieving the captions from an mpeg file from tivodecode. It needs to use the CCExtract_VES.gp file rather than the regular one, but it does get captions. Whatever that code does appears to know how to extract the caption data. I intend to try to port this into C. It does not look to be too horribly difficult, it is already mostly C-like syntax with only a couple features that C does not provide. If you are interested in what this code does, to try to include into t2sami, take a look at http://www.geocities.com/mcpoodle43/SCC_TOOLS/CCExtract.bdl which is the source for the CCExtract tool. It is fairly straightforward to understand if you are familiar with C. I will try to put printfs or something in there and see what paths the code takes, if I can...
jmemmott
12-27-2006, 05:39 PM
What I was trying to say was that the CCExtract tool on that page, which runs in an app called General Parser, is actually sucessful in retrieving the captions from an mpeg file from tivodecode.
Sometimes I miss the point - thanks for taking the time to clarify.
You are right, it appears the logic in the code will allow me to extract the closed captions from both 240's and 540's. I stole the basic idea from the code you are pointing to (only the ATSC caption logic is needed) and rewrote the T2Sami code to incorporate it. I have run it against files from both models and am getting data out. It is somewhat scrambled because I haven't added the reordering logic for I, B and P frames yet but that should not take too long.
I should be able to release a new version of T2Sami shortly. It already incorportates a download desktop and automatic conversion logic to bring the files from the Tivo and automatically extract the captions as part of the download. Automatic trancoding to DVD Author format with subtitles is also well under way but in view of this ability to support the 540, I don't think I will wait to finish that part before I put out another version. DVD's will have to follow later.
I and a lot of others owe you another one.
Thanks!
FrooBrar
12-27-2006, 06:03 PM
You are right, it appears the logic in the code will allow me to extract the closed captions from both 240's and 540's. I stole the basic idea from the code you are pointing to (only the ATSC caption logic is needed) and rewrote the T2Sami code to incorporate it. I have run it against files from both models and am getting data out. It is somewhat scrambled because I haven't added the reordering logic for I, B and P frames yet but that should not take too long.
Do you think you could put the code up somewhere? What language is it? I don't do Windows unless absolutely necessary (which is why tivodecode happened), so I want to try to get some sort of portable caption extractor together, and it looks like it may be easier to port the guts of t2sami to POSIX C rather than starting off with the CCExtract which is what I was going to try. Looks like it can generate .srt files already, right?
PeteEMT
01-07-2007, 02:30 PM
I've tried the beta but am having an issue, it seems it reads the first caption and then stops.
The sml file has them all extracted correctly.
I've attached it, the only line I get is the , tried this with numerous shows...Sometimes text but always just the first line.
Using Windows Media Player 11, Extracted from a S2DT with t2sami desktop.
FrooBrar
01-07-2007, 03:38 PM
I've tried the beta but am having an issue, it seems it reads the first caption and then stops.
The sml file has them all extracted correctly.
I've attached it, the only line I get is the , tried this with numerous shows...Sometimes text but always just the first line.
Using Windows Media Player 11, Extracted from a S2DT with t2sami desktop.
Maybe you need double quotes around the attribute values? I think these are required for valid XML, and I noticed that the first line has quotes around the start attribute on the sync tag while the others do not...
EDIT: I Just tried this and mplayer does not display subtitles for smi files with double quotes around the sync start attribute. Guess this is not it...
jmemmott
01-07-2007, 06:01 PM
I've tried the beta but am having an issue, it seems it reads the first caption and then stops.
The sml file has them all extracted correctly.
I've attached it, the only line I get is the , tried this with numerous shows...Sometimes text but always just the first line.
Using Windows Media Player 11, Extracted from a S2DT with t2sami desktop.
Actually it is a problem at the bottom of the file :
<SYNC Start=2159481>
<P CLASS=ENUSCC>HEY, BUGS, YOU WOULDN'T KNOW</P>
</SYNC>
<SYNC Start=10000>
<P CLASS=ENUSCC> </P>
</SYNC>
</BODY>
</SAMI>
It doesn't like those lines where start drops to 10000. Use WordPad to edit the .smi file and delete those three lines. I believe you will get captions then. (Now to figure out why they are there???)
PS. - I found the cause. T2Sami wants to clear the last caption ~ 10 seconds after it is displayed but I had to rewrite all of the timing code when I got the 540 captioning working. This was something I missed updating so it is using 10000 ms after nothing as a value. I was planning to put out a new version tonight with updated documentation anyway. I will put a fix in for this before I do.
FrooBrar
01-07-2007, 07:52 PM
Actually it is a problem at the bottom of the file :
<SYNC Start=2159481>
<P CLASS=ENUSCC>HEY, BUGS, YOU WOULDN'T KNOW</P>
</SYNC>
<SYNC Start=10000>
<P CLASS=ENUSCC> </P>
</SYNC>
</BODY>
</SAMI>
It doesn't like those lines where start drops to 10000. Use WordPad to edit the .smi file and delete those three lines. I believe you will get captions then. (Now to figure out why they are there???)
I have been getting these as well, all along, but I have never run into problems with it in mplayer at least.
PS. - I found the cause. T2Sami wants to clear the last caption ~ 10 seconds after it is displayed but I had to rewrite all of the timing code when I got the 540 captioning working. This was something I missed updating so it is using 10000 ms after nothing as a value. I was planning to put out a new version tonight with updated documentation anyway. I will put a fix in for this before I do.
Cool. If you send me a copy of the updates I can patch it into my unix port as well. BTW, I just updated the command line test app to set the various options which live in the options dialog in your gui code (font size/family/weight, sync bias, cutoff duration) as optional command line options in my cvs repository.
PeteEMT
01-07-2007, 09:51 PM
Yep we're good! Thanks!
(I did use a regex to add the " " around attributes but that didnt have any effect either way)
thorus
02-25-2007, 01:48 AM
Is there a linux version of this available?
FrooBrar
04-01-2007, 07:19 PM
Is there a linux version of this available?
I just put up an alpha source release on the tivodecode sourceforge project:
http://sourceforge.net/project/showfiles.php?group_id=183716
thorus
04-01-2007, 11:57 PM
Woot. You rock :D
professore
10-19-2007, 10:57 AM
I should be able to release a new version of T2Sami shortly. It already incorportates a download desktop and automatic conversion logic to bring the files from the Tivo and automatically extract the captions as part of the download.
Thanks!
Jmemmott:
I have a deaf student, and have been trying to arrange for closed-captioned videos. Have your T2Sami installed, and used the desktop to download the video from the TiVo box. It worked flawlessly, but I still don't see closed captioning when I play the video. I see the .smi and .xml files, but their functions don't seem to be incorporated into the recording. What should I do?
professore
jmemmott
10-19-2007, 12:24 PM
If the .smi file contains the text of the captions (you can open it with NotePad or WordPad to see) and are not seeing the captions during playback, then you do not have the captioning turned on in your playback software. Some players require an external filter to add the captions but if you are using Microsoft Windows Media player, the function is built-in. For WMP 10 & 11, under the Tools Menu, select Options; then the security tab and make sure the "Show local captions when present" check box has been checked.
professore
10-19-2007, 12:54 PM
Thanks, that worked. I had previously set WMV to show closed-captioning, but for unknown reasons, it was no longer set that way. The video shows closed-captioning now.
One more question: I frequently need to edit videos, excerpt a clip, and so on, and I note from previous posts that I should edit the file before downloading it through T2sami. But the products that are commercially available for windows will not allow me to edit TiVo files. I purchased the entire Roxio MY DVD full suite, but the program only permits me to burn entire programs to DVD. I need excerpts in digital format, not on DVD.
I also have a MAC, and have been using freeware, a temporarily available program, until Toast is redesigned, but even if it is, I fear it will operate like Roxio, not allowing me to make excerpts. Are there anyother options? I use these videos for educational purposes.
professore
jmemmott
10-19-2007, 02:00 PM
I am not a MAC person so I can't help you there. On the Windows side, I have been a long time user of VideoReDo and strongly believe that it is the best choice. If you check this forum, you will find a number of discussion about its strengths, weaknesses and capabilities. The clips it creates contain the closed captioning information so you can post process them to get it out when you are done editing.
professore
10-19-2007, 07:45 PM
Thanks very much. I will check it out.
professore
neumeier
05-31-2009, 07:33 PM
I see this thread has not been updated in a while so here is a quick update for those still interested. CCExtractor works like a charm. Get the Windows installed here: ccextractor.sourceforge.net/ccextractor_for_windows.html
Good luck
Zeev
jmemmott
05-31-2009, 11:43 PM
I see this thread has not been updated in a while so here is a quick update for those still interested. CCExtractor works like a charm.
I think that the main reason this thread has not been updated in a while is that most of the discussion with respect to t2sami and closed captioning shifted into threads with a broader appeal such as the streambaby and pytivo threads where they become part of the larger solution rather than an isolated topic. As a result, this thread’s contents with respect to the capabilities and/or limitations of t2sami are well out of date.
That said, ccextractor is a good program and I have recommended it to non-widows users as a substitute for t2sami. These two programs do have different target audiences and different purposes, however.
ccextractor excels at extracting closed captions from a broad range of video sources.
T2sami is focused on captioning support for the Tivo user and TiVoToGo. For this reason, t2sami, unlike ccextractor, is two-way. It will extract captions from .tivo and mpeg files but it can also inject captions back into mpeg files that can be viewed on a Tivo. Within this “Coming back” category of capabilities, t2sami can use a range of sources such as .srt, .mkv and DVD .vob files as the captioning source.
For example with DVDs, t2sami will process either closed captions or subtitle streams and convert them into usable formats. I believe, ccextractor can only handle closed captions. T2sami also understands the DVD structure so it can correctly process titles that are contained in a single .vob files containing multiple titles such as background features and serial episodes. It can also handle long titles that span multiple .vob files. I believe a DVD needs to be preprocessed by additional external tools for ccextractor to handle either case correctly.
The GUI portion of t2sami is a separate issue, it supports a range of Tivo capabilities such a downloading, playback and decryption to minimize the number of different programs a hearing impaired Tivo user must learn to get full captioned TivoToGo capability. The total set of these capabilities is beyond the scope I want to address here. If you are interested, I suggest downloading it and taking a look. Like ccextractor, t2sami is free to download and use.
The primary criticism of t2sami is that it has been a Windows only program. This has been a reasonable criticism and now that t2sami has matured under Windows, I am taking steps to address this as well. I am currently compiling and running the t2sami command line utilities t2extract and t2merge using Code::Blocks and gcc under Ubuntu. I expect to be able to release these for public use in the next month. Unfortunately the GUI desktop will take longer as it will take a much more major architectural transformation to move it to Linux. I don’t have a target date for that yet.
ricksd
10-08-2009, 02:53 AM
[...] I am currently compiling and running the t2sami command line utilities t2extract and t2merge using Code::Blocks and gcc under Ubuntu. I expect to be able to release these for public use in the next month.
Hi! I would really love to have access to Linux versions of t2extract and t2merge. Any chance you could make those available (source code form would be fine as well) since ideally I would like to run them on Centos 5.3.
I have verified that the programs do what I need (by moving files to Windows, doing the processing, and then moving back to my Linux server). I tried to get the Windows versions to run on Wine, but there were some DLL problems.
Thanks!
jmemmott
10-08-2009, 11:34 AM
Hi! I would really love to have access to Linux versions of t2extract and t2merge. Any chance you could make those available (source code form would be fine as well) since ideally I would like to run them on Centos 5.3.
With respect to the source, releasing it is unlikely: first, it contains some code that I am free to use but not free to release so I would have to strip that functionality out and/or entirely rewrite it. Second, I tried it once before and it was a failure. When t2sami was much simpler, I created a source version of t2extract that was free from any code that I couldn’t release and added it to the tivodecode project on SourceForge. There was no one on the Linux side to pick it up and I found it increasingly difficult to keep it in sync with my Windows version. Eventually it was abandoned as is.
On the positive side…
Linux is important and my intension for some time has been to rectify that earlier failure and provide a suitable version. To that end, I created a single development environment that supports both Microsoft Visual Studio (the original environment) and Code::Blocks. My current sources for t2extract and t2merge build and run under Windows with either development environment. Most of the functions in t2extract build and run under Ubuntu with Code::Blocks. DVD subtitle processing is still under construction. t2merge comes next.
My development is subject a squeaky wheel principal so I have been a bit distracted this year. First by significant architectural changes to the Windows code to support differences in captioning protocols with some of the all-digital transmission formats. Secondly, I added DFXP support for a demonstration of Silverlight captioned video streaming of so I could wade into the Netflix IW controversy. That work is pretty much behind me and the current economic environment is leaving me more discretionary time to work on this so my current plan is to release a Ubuntu version soon. After that I will try to work out a mechanism to distribute an object library that can be linked with the GCC runtime library to allow it to be ported to other Linux variants. I will be in new territory with this last task so I am open to advice.
wmcbrine
10-08-2009, 03:28 PM
As I Linux user, I have to tell you, we aren't very interested in closed source software, even if it's free (gratis). If you thought you got a poor reception before, when you did open it, I can guarantee, the reception will only be worse for binary blobs.
jmemmott
10-08-2009, 05:08 PM
As I Linux user, I have to tell you, we aren't very interested in closed source software, even if it's free (gratis). If you thought you got a poor reception before, when you did open it, I can guarantee, the reception will only be worse for binary blobs.
Cool…
You and I have touched on this before and my goals still do not match the profile you seem to keep trying to put me into. T2Sami was written to help people I personally knew, not to claim a marginal bit of fame or attract a following for the software. I release and support it as I do because I have been lucky enough to meet additional hard of hearing people that find it useful as it is. It is something I can do and give back. That would be a satisfying place to end the story as far as I am concerned.
The Linux question keeps coming up and I do care about hearing impaired individuals that prefer Linux - so I try to find a compromise that works in my world. If there are Linux users that find it helpful when I am done – great. If not, no harm done. If that doesn’t fit the party line for Linux programmer’s – that’s fine. It’s not my intension to be involved in that anyway.
If you want to pick up that torch and help improve a program like ccextractor that is already open source and cross platform, I will help you. Once it meets everyone’s needs and does everything t2extract and t2merge do, you can put me out of business. I can then switch to other captioning projects under Windows that are already on my radar anyway.
bicker
10-08-2009, 05:20 PM
That sounds like a wonderful solution for all involved.
moyekj
10-08-2009, 05:55 PM
FYI, I tried ccextractor briefly and can tell you that t2sami is MUCH better from my experience. I could not get anything useful out of ccextractor for most video files I tried (originating from TiVo recordings). There was a request for a linux based solution for captions extraction as part of kmttg which is why I looked into it. Turns out however that t2sami seems to work perfectly fine using "wine" under linux and so that seems to be a better solution. The person requesting the feature settled on writing a wine-based t2extract wrapper script which he configured as the t2extract executable in kmttg and it worked just fine.
txporter
11-11-2009, 12:41 PM
Resurrecting this old thread as I don't really know where to put it. I had been using ccExtractor to pull closed captions off of DVD with no subtitles. I actually had good success with it, but I wanted to try a bit more automated solution. I tried T2SAMI and it seemed good. But then I started noticing that fairly often 2 characters from the closed caption stream were "lost" or dropped. They simply were not extracted. I tried the same source material in ccExtractor and all of the characters are there. (I was doing this with Rescue Season 4, Disc 2, Ep. 6).
I like that I can convert a bunch of DVD episodes with VRD into single mpg files and then open T2SAMI and quickly batch convert those episodes. But I don't like the dropped characters. I also like that ccExtractor can remove the ALL CAPS from those captions that are written as such. I spent a couple of hours making a Names file that ccExtractor uses for capitalization rules, so I can get most of what I want from it now.
Is there some setting in T2SAMI that I can change to affect the dropped characters? I can post some snippets of .srt files between T2SAMI and ccExtractor later tonight if you want see what I am talking about.
jmemmott
11-11-2009, 03:11 PM
I can post some snippets of .srt files between T2SAMI and ccExtractor later tonight if you want see what I am talking about.
Unfortunately, looking at the .srt file doesn't help much - the damage is already done. To resolve anything, I would need a short piece from the .mpg file that suffers from the problem. Since you have VRD, it would be simplest to use it to create a small sample that suffers from the bug. If you want to do that and need a place to put it to get it to me, PM me and I will give you a location.
orinaccio
11-13-2009, 04:48 PM
I noticed this same problem too, which is why im still using ccextractor. The dropped character bug renders T2Sami useless for me which is a shame, because great apps like iTivo relies upon it for extracting .srt files.
At the moment I use KTTMG and ccextractor to generate video + .srt files.
txporter
11-13-2009, 04:54 PM
I have exchanged some emails with James on this problem. I gave him a clip to show the problem and he is working on it. Hopefully he will find a solution soon!
Jason
lrhorer
11-13-2009, 05:19 PM
You and I have touched on this before and my goals still do not match the profile you seem to keep trying to put me into.
It's your code, so you are free to limit the appeal as much or as little as you choose.
The Linux question keeps coming up and I do care about hearing impaired individuals that prefer Linux - so I try to find a compromise that works in my world. If there are Linux users that find it helpful when I am done – great. If not, no harm done. If that doesn’t fit the party line for Linux programmer’s – that’s fine. It’s not my intension to be involved in that anyway.
It's not Linux programmers. It's Linux, period. The entire point of Linux is to foster an open, platform independent environment. I am not a Linux developer, but I agree with Wiliam 250%, here. Closed source Linux applications are not well received. One big reason is they only work on a limited and potentially outdated set of hardware, or vice-versa may not work on legacy hardware.
jmemmott
11-13-2009, 08:40 PM
I have exchanged some emails with James on this problem. I gave him a clip to show the problem and he is working on it. Hopefully he will find a solution soon!
Jason
I have uploaded a new version, 3.2.0066, for you to try on some of your longer videos. It is now working correctly on the clip I obtained from you.
I have also added a “Sentence Case conversion” check box to the captioning options dialog. If you check it, it will monitor the captions as it generates them and convert ALL CAPS captions to simple sentence case, , i.e. the first letter of the sentence is capitalized, with the rest being lower case. It is not correct English but it is the simplest to implement without language processing and achieves most of the benefit you are seeking. It will not try to convert captions if it detects a mixture of upper and lower case as that would likely remove correct capitalization from existing caption. This seems to be working with .srt and timed text captions but there are still issues with SAMI captions. I will update again as soon as I rectify this but I know the dropped characters are your main priority and did not want to wait to give you something to try out.
Let me know if it is working with the rest of you videos and/or you have any other problems. Thanks for catching this for me.
txporter
11-13-2009, 10:46 PM
I have uploaded a new version, 3.2.0066, for you to try on some of your longer videos. It is now working correctly on the clip I obtained from you.
I have also added a “Sentence Case conversion” check box to the captioning options dialog. If you check it, it will monitor the captions as it generates them and convert ALL CAPS captions to simple sentence case, , i.e. the first letter of the sentence is capitalized, with the rest being lower case. It is not correct English but it is the simplest to implement without language processing and achieves most of the benefit you are seeking. It will not try to convert captions if it detects a mixture of upper and lower case as that would likely remove correct capitalization from existing caption. This seems to be working with .srt and timed text captions but there are still issues with SAMI captions. I will update again as soon as I rectify this but I know the dropped characters are your main priority and did not want to wait to give you something to try out.
Let me know if it is working with the rest of you videos and/or you have any other problems. Thanks for catching this for me.
Thanks for the quick turn, James! I tried it on the rest of the Rescue Me episode and it works great. Then I tried it on an episode of House and an episode of Mentalist. It missed quite a bit on House still and much more rarely now on Mentalist but still here. I have added clips to the same place as before as well as a complete rip of the Rescue Me captions if you want to see it.
Dropped characters on Mentalist 16 and 30. House still drops characters on most caption blocks.
I haven't tried out the cap change stuff yet but will give that a whirl soon.
Jason
jdratlif
11-14-2009, 09:18 PM
At the moment I use KTTMG and ccextractor to generate video + .srt files.
And you don't have any problems with ccextractor and Tivo files? It fails on more than 80% of things I get from my Tivo.
I emailed them and sent them a sample, but I never heard back. I was using ccextractor 0.55. Windows and Linux versions both failed.
t2sami isn't perfect, but it worked a lot better than ccextractor for me.
txporter
11-15-2009, 08:46 PM
And you don't have any problems with ccextractor and Tivo files? It fails on more than 80% of things I get from my Tivo.
I emailed them and sent them a sample, but I never heard back. I was using ccextractor 0.55. Windows and Linux versions both failed.
t2sami isn't perfect, but it worked a lot better than ccextractor for me.
This is interesting. I have had no issues with ccExtractor. Up until about the last week, I had been using 0.53. I have since installed 0.55 (both on Vista64). Neither one gave me any issues. What do you mean when you say failed? It doesn't even find a stream?
I am wondering if different regions of the county do things differently with Line 21 in the MPEG stream?
Jason
orinaccio
11-18-2009, 03:27 PM
I had issues with iTivo and ccextractor - When I used KTTMG and ccextractor it worked perfectly, and Ive been using it since.
I used the OSX build of ccextractor by the way, I cannot recall the version number but last upgrade that i installed was a month ago.
oregonalex
12-28-2009, 01:17 AM
I too am trying to extract captions from TiVo HD recordings into .srt files using t2extractor.exe via kmttg.
On about 20% of caption lines there are character pairs either missing or not in the right place (usually 5 characters forward).
It is program independent, the same problem happens on all downloaded TiVo programs I have tried. The funny thing is that ccExtractor makes the exact same mistakes on these clips. The TiVo playback itself, however, displays the captions correctly.
I'd love to hear if anyone is getting flawless extraction and any idea on how to fix this (short of hand editing the .srt as I am forced to do now).
As I am not yet allowed to post links, the samples are in the following paths on my web site at cyber-strategy.org
The t2extract generated sample is here:
/priv/NatureT2.srt
Subtitles 2,7,11,17 are clobbered. Here is a hand fixed version of the CC as displayed by TiVo HD:
/priv/NatureT2Cor.srt
Here is the corresponding program snippet {60.0 MB - 62,945,784 bytes). It is cut with Video Redo to manageable size, but I have confirmed that the same problem happens with the original .tivo files untouched by any editing software:
/priv/NatureT2.mpg
TIA for any input.
jmemmott
12-28-2009, 12:21 PM
I picked up the files and will look at them in depth to see if there is anything I can do but I am not hopeful.
I ran two other experiments with your clip. First, I used the closed captioning display in VideoReDo to see how the captions look in that program. Then I sent your clip to my Tivo using pyTivo and played it back. In both cases, it is showing the same results that t2sami is seeing.
I will put forth a hypothesis that it is a timing issue that is created while remuxing the program data. Captions in .tivo/.mpg files do not carry timing information directly. It has to be inferred from the location of the caption data relative to picture data which does carry that information. If caption data isn't positioned correctly, captions come out scrambled. The Tivo does not use the .tivo or .mpg format internally. It stores the incoming digital broadcast format on disk and uses that for playback. If you take it off using TivoToGo, it is remuxed into an .mpg program stream. It doesn't appear to get the all of the caption data in the right place in this video when it does this. I am not sure it will be possible to correct for this on the PC side.
oregonalex
12-28-2009, 11:28 PM
I will put forth a hypothesis that it is a timing issue that is created while remuxing the program data.
Thank you very much for your very educational and enlightening post. I was not aware that TiVo remuxes the program when I download it to the PC. If that's the case, then your hypothesis sounds very feasible. I will also try to pyTivo the original .tivo file back to my TiVo HD, but I am sure it will only confirm what you are seeing.
Judging from the TiVo CC treatment in general, handling captions is obviously very low priority for them so I am not holding my breath, but I will report it to them anyway. I have found another bug in CC UI on the TiVo that I want to report also.
It is too bad, as the corruption appears on ALL programs I download. I imagine everybody must see the same problems. I wonder if non HD TiVOs have the same problem.
Spell checking the .srt file catches many, but not all the glitches. Oh well...
Thanks again for taking the time to look into this.
jmemmott
12-30-2009, 03:18 PM
It is too bad, as the corruption appears on ALL programs I download. I imagine everybody must see the same problems. I wonder if non HD TiVOs have the same problem.
Fortunately for most people, this appears to be the exception rather than the rule. Doing development work on t2sami and using it in my household has allowed me to see a fair number of clips from my own provider (Santa Cruz Comcast digital) as well as a selection of problem clips from other providers. Most are reasonably free of errors. When there are problems, it usually is associated with specific networks such as TMC and MPLEX. Problems on HD are more common than on SD. Different providers and/or channels encode their programming in their own unique ways and the Tivo does not handle the conversion of all of them equally well.
oregonalex
12-31-2009, 06:30 AM
...Problems on HD are more common than on SD. Different providers and/or channels encode their programming in their own unique ways and the Tivo does not handle the conversion of all of them equally well.
We are apparently very much on the bleeding edge here. As I only work with HD clips, I may be more afflicted than others.
By the way, yesterday I encountered a clip with a different problem - the srt timing info is seriously off. This time, however, ccExtractor generated srt is correct, only T2Sami has the problem. I have uploaded it to my server (cyber-strategy.org), so if you want to look at it, you are welcome to it.
/priv/timing.mpg 94.8 MB (99,497,988 bytes)
/priv/timingofft2sami.srt T2Sami generated
/priv/timingccext.srt ccExtractor generated
Hope it helps.
vBulletin® v3.6.8, Copyright ©2000-2012, Jelsoft Enterprises Ltd.