Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Software

Recorded Speech to Text Software? 66

shfted! asks: "Recently, I've been given the task of transcribing several dozen audio tapes of interviews to typed word, that is, listening for 10 seconds, write what was said, repeat. At around 4 hours per hour long tape, I would like to automate the process somehow. Recording the tape into the computer is no problem, but I need some software that will do the speech recognition accurately more than quickly -- several hours per tape is not an issue (I have access to several machines running 24/7). I will still have to go over the computer's work to correct any mistakes. A free solution for Linux would be best, non-free and Windows solutions are okay, but a working solution is highest priority. Can anyone point me in the right direction(s)?"
This discussion has been archived. No new comments can be posted.

Recorded Speech to Text Software?

Comments Filter:
  • by Txiasaeia ( 581598 ) on Friday January 23, 2004 @09:27PM (#8072094)
    Several hours per tape is acceptable? Well, if you can do one tape in four hours, then two people can do one tape in two hours. In other words, hire a college student at minimum wage for a contract position (I.e. until the tapes are transcribed) and go to it.

    It's cost effective, as fast as you need it to be and best of all more accurate than any software solution to date. Most software packages are still at only about 90% accuracy, so that's still 24 minutes per four hour tape that you'll need to correct, and you'll still probably have to listen to the whole thing over again in order to verify the accuracy of any software program.

    • by Anonymous Coward

      Well, if you can do one tape in four hours, then two people can do one tape in two hours.

      More accurately, two people can do two tapes in four hours. While this is equivalent to one tape in two hours, it doesn't mean that two people can listen to two portions of the same tape at once.

      When I used to install DSL equipment, we often found that our suppliers substituted items on us, like 20 2" screws instead of 40 1" screws. I mean, it's all still 40", right?

      • Tapes can be copied on off time. If they are standard audio cassette tapes, then they are not more than 45 minutes per side anyway so you are looking several tapes anyway.

        Even assuming the worst case, 1 tape that is 4 hours long, you can feed the output of the player into the input of a computer, do a ogg (mp3) rip on the stream, and then fast forward to different places. There will be issues merging the copies, but still much less time per person than one person doing the entire thing. (but more work

        • by Anonymous Coward

          Tapes can be copied on off time.

          They can, but then you're just adding to the total time -- now the students still have to listen to the tapes, and someone has to copy them. Better to just give the students two separate tapes in the first place.

          It was something of a joke, of course.

    • by shfted! ( 600189 ) on Saturday January 24, 2004 @12:14AM (#8072866) Journal
      Actually, I am a college student hired to transcode these tapes at $40 CAN a tape, which at 4 hours a tape is just a little above minimum wage where I live. I want to make more than minimum wage, thus my desire to automate things somewhat :) Again, my intent was to have the machine do the first pass, then I could listen and correct errors as I went. Why? I can type continuously at about 70 wpm, but people speak around 150 to 200. However, if I have a 90% accurate copy, that means I only need to type 15 to 20 wpm to keep up, correcting on a single pass, thus reducing my time per tape to the duration of the tape.
      • LOL! Looks like somebody else thought of your question and... outsourced it to you :) I have sympathy, tho, as a fellow Canadian uni student... sucks to have to work while you're in school. Anyway. There *are* tape decks out there that allow you to modify the output speed, so maybe if you slowed it down to 70% normal you'd be able to keep up with the speech. There's also a tape playback system which has floor pedals, like a car; press the "gas" and it goes forward, press the "brakes" and it rewinds. I
      • Your comment just gave me an idea, why not slow down the audio in half (keeping the pitch, common in most editing software) so that you don't need to pause/play? You could just keep yourself typing and finish a tape in two hours instead of four.

        Automatic silence removal would also speed things up.
      • Or you could outsource to India...
    • If slashdotters are a typical sample you'd end up "loosing data". You'll get the "you're = your" and all the other crap.

      Still I agree a software solution is unlikely to be better. What he should do is rapidly rip the tape to memory and then play it back at whatever speed he wants.

      Maybe there's some software that could guess the words for him and he just has to decide if the program is right or wrong in real time. Could cut it down to 1.5 hours to do 1 hour audio. Heck might even be faster if the speaker i
      • The problem is that he is being payed by the tape. Thus if he playes the tape at a slower speed, it will take him more time to finish the tape. This will result in less dollars per hour. If he can get a computer to transcribe the tape in 4 hours, and correct the computer on the fly, then he'll make $10/hour. If he has to play the tape at 70% speed, it'll take him almost 6 hours, at only $7 an hour. Splitting the work up between two people would be the same as far as payment goes, because tapes are bein
        • The tapes are 1 hour long. You didn't even need to read the article to see that. Four hours is how long it takes to use the start/stop method of transcription. Slowing the tape down to 70% speed and never start / stopping would take 1 hour and 26 minutes.
  • Simple Suggestion (Score:2, Insightful)

    by Anonymous Coward
    Given that half-decent speech recognition is still struggling, might I suggest:

    1) Give your neighbour's kid $10 to transcribe the tape one afternoon
    2) ...
    3) Text!
  • by bluGill ( 862 ) on Friday January 23, 2004 @09:31PM (#8072112)

    The technology to do this isn't really there. If the machine can learn how you speak, it can do it. If you limit yourself to just a few words (1000 perhaps?) it is easier. To do it in general for random speakers though?

    The problem is people are too varied. I have trouble understanding people from the "deep south". The accent is too think for my ears. I'm sure they have the same problem with my accent.

    That isn't to say don't try it, but don't get your hopes up. Vocie recignition is hard, and isn't done well. Just be glad you only have a few to do, my sister's full time job is typing things like that. (most of less interest as she describes it)

    • To do it in general for random speakers though?

      The state of the art for arbitrary news broadcasts is about a 20% word error rate. While this isn't good enough for the poster's needs, it turns out to be almost good enough for indexing.

      Wonder when we'll start seing Google return audio and video along with text documents? There's a research project demo of this happening here [limsi.fr].
  • Existing software (Score:3, Interesting)

    by skinfitz ( 564041 ) on Friday January 23, 2004 @09:37PM (#8072153) Journal
    What about simply plugging the tape into a system running Dragon Naturally Speaking or [dragontalk.com] IBM ViaVoice? [ibm.com]

    From the Dragon page:
    True Continuous Speech - Speak to your computer naturally and at a normal pace--without pausing between words. Your spoken words swiftly appear on your computer screen.
    • by Anonymous Coward

      Speak to your computer naturally and at a normal pace

      Yeah, drag in soft wear works wwww under fully four me. I'm you sing it write now.

    • Re:Existing software (Score:2, Interesting)

      by lambent ( 234167 )
      The problem is that you have to train the software for your voice before you can obtain any high degree of accuracy. Not possible with pre-recorded speech.

      Hell, the new voice-mail voice activated menus that have been popping up when i dial customer service sometimes force me to say out my phone number. And even to do that accurately, I have to speak very slowly and quite loud. More ofen, I just press random buttons until I get dumped to a live operator. (Try it, it works!)

      • Random buttons can also cause some computers to dump your call.

        Some computers that ask your address only have pre-programmed values, so if your street doesn't match one of their values (a common non-matching value is "THE"), you'll have to repeat it 3 times (while in between listening to a computer voice say that it didn't understand and ask you to repeat) before it will give up and connect you to someone live.
      • I just press random buttons until I get dumped to a live operator. (Try it, it works!)

        I do that too but while pushing the random keys last week while being forced to listen AGAIN to the long winded and patronising "did I know about metalink?" message while calling Oracle tech support and desperately hoping to MAKE IT STOP, however it dumped me out to an operator in Switzerland.
      • The # is rarely used in those systems, so I just hit that a few times. Works well.
  • You must be joking (Score:4, Insightful)

    by Radical Rad ( 138892 ) on Friday January 23, 2004 @09:38PM (#8072158) Homepage
    Just do the tapes. It will take longer to screw with software setup and cleanup than to just do it. But if you either buy or rig up a foot switch to play/rewind the tape I think it would help. Also I am assuming you are a touch typist. If not then get someone who is to do this job for you.
    • Oops. I just noticed it said several dozen, not several.
    • by splattertrousers ( 35245 ) on Friday January 23, 2004 @09:46PM (#8072203) Homepage
      If not then get someone who is to do this job for you.

      Court reporters do this kind of thing for a living and some (all?) are contract workers. They can do it in real time and would probably be quite happy to be able to do it all at home rather than in a deposition room or court room. Oh, and their accuracy would be a lot higher than if you did it yourself without checking or if you hired a student to do it.

      Though a tech solution would be cool...

      • by T-Ranger ( 10520 )
        Court reporters are not typing on a QWERTY keyboard, its something Wikipedia calls a "syllabic chord keyboard".

        Basicly, rather then typing in characters to form words, they are typing in syllables to form words. Sometime later they transcribe the shorthand into full text. So while recording speech in real time, they are not transcribing it into full text.

        And somewhere back in my brain ISTR that prety much all US court procedings have been recorded on audio tape for decades. I know for a fact that the loca

      • Medical transcriptionists typically have access to good equipment, and some work from home on an as-needed basis and might be able to fit you in quickly. They have the added benefit of going from tape directly to English, rather than to short hand. And they are easy to find- put up a note on a bulletin board in your local hospital. But the message is the same- professionals will give you nice, accurate results. Especially if there might be more work in the future.
    • I needed to subtitle a movie once, i used mplayer (www.mplayerhq.hu) to listen the audio, it was easier than another means because you have shortcuts for rewind <- rows back 10 seconds, and you can adjust how much it rewind ( i changed to 4 seconds).

      I didn't know by the time but you can also use it to slowdown the speech, like slow motion for video, i think it's much easier to type.

      I don't know if it was because i also needed to translate/syncronize and wasn't a native speaker i gave up on it. I advice
  • by billh ( 85947 ) on Friday January 23, 2004 @09:48PM (#8072220)
    Slow the playback down and type them as you listen. If you can't do this, hire someone who can. I know many people that can keep up with spoken conversations in real-time.

    Years ago, I improved my own typing speed and accuracy by transcribing phone conversations with friends. It just takes some practice.

    Of course, if you are listening to this guy [demon.co.uk], you can disregard my advice.

    • Interesting. My tape player doesn't support slow playback, but that would be trivial to do on the computer. That would take two hours a tape then, but might be viable.
      • If you can make the player run on batteries, try dropping the voltage. I know that would slow older players down. Can't speak for current devices, as I haven't used tapes in quite some time.
    • It can be easier to transcribe real time (or, rather, easier to get higher typing speeds) by getting a better keyboard, like a Natural one.. or, better, a Dvorak keyboard (on which most of the records are set).
      • So, you're saying the process is:

        1. Get hired to transcribe tapes at 40$CAN/tape.
        2. Buy a Dvorak keyboard (cost: probably $200CAN minimum)
        3. Spend a year getting really proficient with a new layout
        4. Whip through those transcriptions like it ain't nobody's business.
        5. ???
        6. Profit.

  • Sphinx (Score:5, Informative)

    by jcausey ( 253286 ) on Friday January 23, 2004 @10:03PM (#8072296) Homepage
    Give Sphinx [sourceforge.net] a try. It's pretty accurate; especially Sphinx-3. I've used v2 before for a live test, and it works great -- even with different voices.
  • by rueger ( 210566 ) on Friday January 23, 2004 @10:57PM (#8072557) Homepage
    If your hours of tape are something that has to be transcribed accurately, don't waste your time trying to do it with a computer.

    A person who does transcription for a living will do it faster, probably cheaper, and will be able to handle all of the quirks of human speech that will gum up the works of a voice to text program.

    There are still places where a machine cannot match the quality of a real live person.
    • That was what I thought immediately on reading this question. My girlfriend used to be a legal secretary, and one of the things she had to do was exactly this - transcribe dictation recorded on to audio tape. She never had to do quite so much at one time, of course, but most secretaries are trained in this skill.

      Being able to get a computer to do it would be cool, but there's a reason why companies hire people with this skill - the technology just isn't there yet. Even if it were possible to buy/build a sy
  • Here's what you do:

    1. Convert the tapes to *.mp3 files.
    2. In the xterm type:
    cat speech.mp3 | transcribe > speech.txt
    3. Write a program called transcribe that converts audio data on stdin to text on stdout.
    4. Redo step 2 now that you have a program.

    Karma: desrever
  • Can't be done in this instance. The best software solution is to dictate the tape into one of the commercial SR products. This persupposes you have already trained the software with your voice or have the time to do so. Hiring stenographers is an option, but you will still take quite a long time checking their work for accuracy and making corrections, and the less you pay them, the more work there will be to fix the transcripts up.
  • SuSE 7.3 (Score:3, Informative)

    by Anthony Boyd ( 242971 ) on Saturday January 24, 2004 @04:03AM (#8073734) Homepage

    If you can get a copy of SuSE 7.3 Professional, it comes with IBM's ViaVoice for Linux. It can take audio and turn it into text. The trick is that 7.3 came out about 2 years ago, I think. Most stores would have the newer 9.0 version, which doesn't have ViaVoice.

    I guess it is possible that IBM still sells ViaVoice for newer distros. I've never looked.

  • What people with large amounts of this sort of stuff to do is outsource offshore.

    There are companies that do this sort of stuff but I suspect your volumes are way below what they handle.

  • by travail_jgd ( 80602 ) on Saturday January 24, 2004 @11:21AM (#8074834)
    A friend was in a similar situation -- she had recorded a phone interview [1], and needed to transcribe it. To make certain there were no technical glitches, the interview was recorded to cassette and as a WAV file on her PC.

    When the time came to transcribe the interview, she found the version on her PC more helpful -- her hands never had to leave the keyboard in order to pause or "rewind" the audio.

    If you go this route, remember that you'll need about 600 MB per hour of uncompressed audio. If space is an issue and you need to compress, don't max out the compression; saving a few megabytes here and there could result in hours of extra work due to artifacts.

    [1] With explicit permission given.
    • Well, since phone is only at about 11Khz anyway, you could cut the sampling rate of the WAV to 22Khz or 11Khz and you would get the same quality but half to 3/4 smaller files..
    • [1] With explicit permission given.

      In most US states, as long as at least one of the parties in a phone conversation knows about the taping, the taping doesn't break the law. This means you can tape any calls you make or receive. Basically, the relevant laws only affect an uninvolved 3rd party recording a converation they listen in on without the main parties of the call knowing about it.

      However, I say "most states", not all, so you might want to verify this for where you live.
  • Try this (Score:1, Informative)

    by Anonymous Coward
    http://download.com.com/3000-7239-10251419.html?ta g=lst-0-12
  • Interviews are not formal speech. People mumble and slur their words in interviews. They cough, rustle and make non verbal vocalisations. There is no software that can deal with this. With the best voice recognition software you would be lucky to capture 30% of what was said, and that's *after* all the fricking about with settings etc. Quicker just to type it in.
  • Interesting Idea!
    I see this as in ideal area of genetic programming. Imagine that the person transcribes, by hand, a few random places on the tape. Maybe a total of 4 random places from the recorded material, at about a sentences each spot (or a paragraph). The program then generates an initial population of genetic programs who will breed based the fitness of your initial transcription. After a few rounds of this you take the best-fit individuals or something like that. If the speaker is the same o

I tell them to turn to the study of mathematics, for it is only there that they might escape the lusts of the flesh. -- Thomas Mann, "The Magic Mountain"

Working...