Open Source Transcription Software? 221
sshirley writes "I am beginning to do some interviews with family members and will do some audio journals for genealogy purposes. I would really love to be able to run the resulting MP3 or WAV files through some software a get a text file out. I know that software like this exists commercially. But does this exist in the open source world?"
CMU Sphinx (Score:3, Informative)
Looks active.
Sphinx (Score:5, Informative)
Best Idea (Score:3, Informative)
just upload it to youtube, its genius google transcription technology will make everything sense out of it.
youtube (Score:1, Informative)
upload to youtube and let it create closed captions. the results won't be perfect, but it will be better than most software.
But Windows Speech Recognition... (Score:5, Informative)
Most Windows Vista or Win7 machines come with a built in transcribing feature, that you can enable in the control panel (Win7, under ease of access, Speech recognition).
However - the only way it works properly is if you train it to understand you personally. You load your profile, and it'll run you through a whole bunch of test sentences. The FULL test takes you about 20 minutes I think (It's been a while since I've used it) - and actually works quite well. There is a cut off point at about 2 and a half minutes if you want to stop and try it out. It actually makes it keyboard and mouseless if you want. When you open a browser it highlights everything on the web page thats clickable and assigns it a number, and you simply say "Click 7" and it hits the reply button for you. Then you talk when the textbox has focus and it'll transcribe every word you say.
I did this for my girlfriend's paper once, I read it aloud (you have to mention things like comma, end paragraph, etc) and put it into a Word document. Out of a 15 page single spaced Essay - it got 3 sentences wrong - and that's only because I was mentioning some of the more Obscure greek names (she's a history major). It managed to get full sentences regarding Octavia and her fondness of libraries without error, which I thought was odd since thats not a name you hear every day.
Anyways - if he wants to do this, he should record the test phrases (there will be a lot though) and have each of his interviewees read the test sentences so he can then relay those through the computer and train the computer for each person.
All in all - he may still run across a few errors, but its not nearly as bad as say Google Voice Mail, which tries to figure out what you're saying without having any previous knowledge on how that person speaks. Windows Speech Recognition is something that will handle what he's after though.
Re:I've wondered about this too (Score:4, Informative)
Re:Dear aunt, (Score:3, Informative)
It isn't comedically awful; and it likely beats typing with your stumps, or your eyelids, or whatever; but "pretty good" is being very generous.
(Again, unless things have improved markedly since then) the software works best when used interactively, which allows it to suggest corrections, and you to make them, in real time. It also helps if it has been trained to your voice beforehand. The results of using it non-interactively, on a recording of somebody that it hasn't been trained for, will produce results error-filled enough that you might actually find manual transcription faster than manual editing(or, if you don't mind your family sounding like they've suffered head trauma or exposure to Dadaism, you can just store the recordings, make do with the text, and re-run the process in the future, when the software is better).
USB foot control (Score:2, Informative)
I looked, but still do it manually (Score:5, Informative)
I've worked on loooads of transcripts. I did most of these:
* http://wiki.fsfe.org/Transcripts [fsfe.org]
The best technique I've found is to have mplayer play the audio at 60% normal speed and have a text editor (emacs is my preference) in another window, flick between them with alt-TAB and hit Space to start and pause mplayer.
Re:CMU Sphinx (Score:5, Informative)
For open source you have two main options: CMU Sphinx and Julius/Julian. Both options are just back-ends, you'll have to write a front-end. However, it shouldn't be too hard to do that (the source for the CMU Sphinx demos show how to get input from a mic/wav file (if you've got something other than PCM you'll just need to convert it) and set up various engines.
CMU Sphinx [sourceforge.net] appears to be mainly for research purposes. You can run it in a few different modes: one with a fixed grammar (for command systems, Gnome's voice control uses sphinx in this mode), one (what you'd be looking for) uses a weighted dictionary. I didn't train it to my voice (and you wont be able to train it for transcriptions) and I was getting fairly lousy recognition rates with my $20 Logitech USB Microphone. It might work better with a high quality headset, but I imagine you wont both be wearing one.
Julius/Julian [sourceforge.jp] lacks a good acoustic model for English. VoxForge [voxforge.org] is working on one, but it isn't anywhere near complete.
Here is a good article that sums up the current projects [eracc.com]
the command line (Score:5, Informative)
To play an audio file at 60% normal speed:
mplayer -af scaletempo=scale=0.6 the_file.ogg
And then to check the transcript, change the 0.6 to 1.5 (or 2.0 for someone like Richard Stallman who speaks slowly and clearly).
High quality recordings now, transcription later (Score:2, Informative)
Foot Pedal and Express Scribe best option (Score:4, Informative)
Re:Dear aunt, (Score:5, Informative)
Funny, considering my job is training doctors to use voice recognition to do all their reporting. Actually, it works fairly well. I also don't mean dictating something that goes to transcriptionists. The doctors dictate the report. The dictation is transcribed into text. They review it and sign off. We got rid of all our transcriptionists years ago. The time for a report to get done went from 24 hours with transcriptionists to 24 minutes with voice recognition. The amount of errors was cut in half. The doctor's work load was also lessened as they could check the final version while still dealing with the data rather than having to go back and review everything all over again a day or two later. Speech recognition was a problem seven years ago, but hardly at all in the last five or so. Yes, the have to go over their dictations and occasionally make some minor corrections. There's always background noise to worry about and some people's accents are hard even for another person to get through, but for things that require quick turn around and need to be verified by the person who is doing it, voice recognition already is the gold standard.
PS several of the doctors like it so well they bought Dragon (pretty much everybody but Phillips use Dragon for their speech engine) for home and use it there for all their email and other writing.
Coding Horror article (Score:2, Informative)
Coding Horror recently posted an article about the current voice recognition technology.
http://www.codinghorror.com/blog/2010/06/whatever-happened-to-voice-recognition.html [codinghorror.com]
There is a poem which got transcribed, and the title became like this:
"a poem by Mike Bliss --> a poem by like myth"
The rest of the poem is equally funny. So basically you better transcribe it manually.
Re:Dear aunt, (Score:3, Informative)
by making informed guesses based on context, which a computer program cannot do.
The Perl interpreter can.
Re:CMU Sphinx (Score:2, Informative)
Sphinx is what many companies use to get started with, but it's far too raw to be useful by itself. You need to update the HMM back-end extensively... and train it. Even still, your success rate is only 80%... meaning: 1 in 5 words, if spoken slowly, will still be wrong.
Re:XTrans (Score:4, Informative)
Ubuntu uses PulseAudio on the ALSA audio subsystem, but that error message indicates XTrans is trying to use the OSS audio subsystem instead. To work around this, try using the Pulse OSS wrapper or temporarily disable Pulse. From the commandline, "padsp xtrans" or "pasuspender xtrans".
Re:Dear aunt, (Score:2, Informative)
The parent's link is exactly what I set up on a client's machine. They purchased the headset and pedals but the software itself was free and worked wonderfully.
State of Speech Reco (Score:4, Informative)
Re:I looked, but still do it manually (Score:3, Informative)
I'd agree. I did some part-time work transcribing audio a while back for extra pennies. One thing I would add is that instead of using Alt-Tab to switch applications and then hitting space to start/stop I found it was less frustrating to set up global keys for the purpose (I was using KDE at the time, I expect most desktops offer this).
I assigned F12 to skip back 5 seconds and F9 to pause/restart. Using those (esp F12) it was relatively easy to keep up to speed with what was being said without switching away from the editor.
Re:I looked, but still do it manually (Score:3, Informative)
I've done loads of transcripts too.
The best software I found was the Olympus DSS Player 2002, which came bundled with the expensive Olympus digital recorder (but the cheap ones had a bare-bones software). It was like the old mechanical tape transcribing machines, except much better, with adjustable back pedal, 50% slow speed, 200% fast speed, fast forward, fast back, etc. Newest version is probably better.
Problem was it was optimized for the Olympus proprietary *.DSS format, although you could use *.WAV with some limitations on features.
Sony and the other digital recorders also had playback software; I haven't checked them out but they're probably equivalent.
NCH Scribe (free) could have been a clone of the Olympus player; NCH didn't work as well when I tried it, although later versions may work better.
These programs can be overkill, mplayer at 60% speed sounds like it would work well.
Re:CMU Sphinx (Score:5, Informative)
"... unless you're not a programmer."
Seriously, since when is "program it yourself" a solution to "are there any open source software packages that do what I want?"? The answer you're looking for is "no". That's the correct answer.
Here, here's a nice car analogy since we're on Slashdot: when you need a car do you buy a kit car, or do you buy one factory built? This is like telling someone who wants a car to drive to work that they should simply buy Chevy big block engine and build the rest from scratch. Just because I need a car doesn't mean I must be an automotive engineer and metal fabricator. Similarly, just because I need dictation software doesn't make me a software architect or a linguist. Directing this person to program their own software is not answering the question.
Cripes. People wonder where the "open source is only free if your time has no value" line came from.
Re:Dear aunt, (Score:2, Informative)
13 years ago, when I entered the medical transcription industry, the fellow who sold us our dictation system told me that he was a dead man walking: voice recognition was going to KILL the transcription industry, and he almost felt guilty selling us the system. When we mentioned we had looked at Dragon, he practically cried. 13 years later, that salesman is now deceased, and the transcription industry is larger than ever. Voice recognition in transcription is like Linux on the desktop: every year, articles pop up saying that THIS year will be the year medical transcription dies at the merciless hands of voice recognition.
For a guy in an industry that Netcraft has confirmed is deader than FreeBSD, I'm doing pretty well.
I now own a medical services company that does transcription, so my opinion is certainly biased here, but I fail to see the economic logic in turning a physician, who makes between $120-250K per year, into a clerical worker editing his own files. Especially when said clerical worker can be seated in India. Time is money, and the time of physicians and surgeons is one of the most expensive line items on your medical bill. Even with transcription prices as they are today, tacking 20 minutes of extra editing time onto a doctor's already long work day means that I can do it cheaper with manual labor. Voice recognition just means that I need one MT and a voice recognizer instead of one transcriptionist and a QA person.
Internally, we use a batch speech recognizer based on Sphinx, as the Dragon source is too expensive to license in the volumes that we do. As one of the earlier posters said, the code is the easy part...it's generating the speech corpus that's the really expensive part. Developing that was easily a seven-figure outlay in labor, which is why you don't see any usable free medical speech corpi available for free* on the Internet. You'd think with all the federal money being thrown at making medical records electronic that they could spare a few million to develop an open-source speech corpus, but that would make too much sense.
As long as physicians and surgeons are better paid than the rest of us, someone will be doing transcription.
--
* If you know of one, post a link...believe me, we've looked.