Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Books Media Operating Systems Software Unix

A Grep-like Utility That Works on More than Text? 65

Nutria writes "This article got me thinking: What's a poor Unix-using guy to do, when he needs to grep text, compressed tarballs, OO.o documents, Debian archives, mime-encoded files, Evil Microsoft documents, PDF files, compressed AbiWord files, etc." Is there an extensible searching program for Unix that can handle a variety of different file-types? Search engines like ht://Dig can accomplish part of this task, however currently it doesn't index the whole file (just portions of the metadata). If you had to perform a substring search on a set of documents of different types, what tools would you use to accomplish this task?
This discussion has been archived. No new comments can be posted.

A Grep-like Utility That Works on More than Text?

Comments Filter:
  • by szyzyg ( 7313 )
    strings filename | grep
    • But that doesn't pull out compressed strings -- only recognizable ascii strings.
    • Realistically, this should operate similar to text filters in less. I.e., a set of filters that convert files to text for grep'ing.

      A more generic solution would be to have less stop using those filters and instead use strings, which would identify the file type and then subcall strings.text.plain or strings.text.html or otherwise the strings.mime.type. The fallback would be the normal strings behaviour. That way, anything that expected text (a la grep, less, etc) can have a flag to filter through string

    • Because strings will do things like: uncompress ASCII text from tarballs, recognize unicode font files, work with XML document formats to ignore markup and focus on content, search within images for strings, etc, etc.

      The person is asking about a true file contents indexing scheme, where you have a database of file name, meta type, and keywords culled from inside the document -- something that'd work with PDF, JPG w/ EXIF, pictures of text, OpenOffice files, XML files, HTML files, etc!

      Strings does none of
    • This also won't work with MS Word documents because they store text out of order.
  • by Marxist Hacker 42 ( 638312 ) * <seebert42@gmail.com> on Thursday September 02, 2004 @02:50PM (#10141784) Homepage Journal
    the problem I see isn't searching compressed or tarballed files- where the text inside is still largely in plaintext. It's just a different file format- but it's still bytes and I've seen supergrep programs in the past for both Linux and Windows that do this (try Tucows [tucows.com] or SourceForge [sourceforge.net] before posting on slashdot in search of freeware and shareware.) The problem I see is searching *encrypted* files, esepcially ones with different keys. Now that would be *hard*.
  • file+[GLUE] and optionally, grep. GNU File is a nice utility to determine file types. From there, you could use some sort of glue language, such as Python or Perl, to perform the necessary actions when certain filetypes are found: decompress, index, etc. I'd be surprised if there wasn't a plugin style grep utility already out there.
  • Beagle [gnome.org]
  • by dougmc ( 70836 ) <dougmc+slashdot@frenzied.us> on Thursday September 02, 2004 @02:55PM (#10141838) Homepage
    Search engines like ht://Dig can accomplish part of this task, however currently it doesn't index the whole file (just portions of the metadata).
    Um, ht://Dig WILL do it. As will any of the other search engine packages out there, if you can feed them the text that they work on. It's generally not difficult (on a *nix platform) to convert all of the formats you mentioned into straight text or html and feed those to your favorite search engine platforms.

    Useful programs include wvWare, rtf2htm, pdftotext, and yes, even strings.

  • Beagle (Score:3, Informative)

    by Markusis ( 46739 ) * on Thursday September 02, 2004 @02:57PM (#10141861) Homepage Journal
    Beagle will probably meet at least some of these goals. Beagle also aims to index things like your gaim-logs, browsing history, and email as well as your files. This is closely related to the dashboard project. Both projects can be retrieved from Gnome CVS.

    http://www.nat.org/beagle/ [nat.org]
    http://www.nat.org/dashboard/ [nat.org]
  • by ivan256 ( 17499 ) * on Thursday September 02, 2004 @03:00PM (#10141905)
    What's a poor Unix-using guy to do, when he needs to grep text, compressed tarballs, OO.o documents, Debian archives, mime-encoded files, Evil Microsoft documents, PDF files, compressed AbiWord files, etc.

    Um, why not pipe the output of your favorite program that interacts with the file type you're interested in to grep? Isn't that the "poor unix guy" way?

    catdoc Blah.doc | grep foo
    zcat compressed.txt.gz | grep foo
    apt-cache show package | grep foo
    pdftotext Blah.pdf | grep foo

    etc...
    • Half a solution .... (Score:4, Informative)

      by gstoddart ( 321705 ) on Thursday September 02, 2004 @05:43PM (#10143625) Homepage
      Um, why not pipe the output of your favorite program that interacts with the file type you're interested in to grep? Isn't that the "poor unix guy" way?


      Because it involves the user explicitly using the right utility to decode each file type for every single file.

      So you don't get the nice simple case the user is looking for like:

      grep foo *

      or even something cool like

      find . -name "$pattern" -exec egrep -l -e "$1" {} \;

      Grep is often used to hunt through large amounts of docs because you know something has that text, but you can't remember what.

      What the guy is asking for is something that would take my second example, figure out the file type, and then do it automatically for you.

      Cheers

  • by Otter ( 3800 ) on Thursday September 02, 2004 @03:02PM (#10141919) Journal
    The same way everything else works!

    1) Be had a half-decent version years ago.

    2) Apple will have a reasonably robust version out soon.

    3) Microsoft will have a more frustrating knock-off of Apple's version a few years later.

    4) Four competing, incompatible open-source projects will copy the Apple and Microsoft implementations. When one of those companies sends a cease and desist letter to an open-source project that has shamelessly ripped off its trademarked name, Linux zealots will complain about how "intellectual property laws stifle our innovation"!
    • Funny? Seems more "insightful" to me!

      1) grep, ht://dig, find, etc
      2) Apple's new search utility, due out with 10.4
      3) WinFS
      4) ????
      • Yeah except MS has had this in FindFast from Office 97 on and the Microsoft Indexing Service since Windows 2000 came out. It doesn't work terribly well out of the box (idle detection sucks) but if you put a little effort into it Indexing Service really rocks, especially in an AD environment where you can perform a search across servers.
      • Parent's (1) had nothing to do with grep, ht://dig or find: it was more what will be coming in (2).

        As for (3), putting an SQL Server onto each desktop to store all the meta-data about all the files was a wrong idea to start with (not learning from Be's lessons are we? Nothing to do with computers becoming more powerful then what they were in the age of original Be). A much better approach would be more similar to ECCO's organizer system, RDF, FOAF, what Chandler will (hopefully) one day become, etc. --
      • 4) Gnome Storage [gnome.org]
  • I'm sure there are utilities to convert most of the formats to text. MS Word to text? No problems, there's an utility to handle that. PDF? Of course.
    Basically, what you might do is something like this:

    1. Figure out what kind of file it is (using 'file')
    2. Select the correct converter based on file
    3. grep
    4. Profit? -- not quite sure about this one.
  • by Matt Perry ( 793115 ) <perry.matt54@ya[ ].com ['hoo' in gap]> on Thursday September 02, 2004 @03:06PM (#10141972)
    The unix shell and pipe are your friends:
    grep text
    grep text filename
    compressed tarballs
    tar zOxf filename | grep text
    OO.o document
    unzip -p filename | grep text
    mime-encoded files
    mimedecode [debian.org] filename | grep text
    Evil Microsoft documents
    strings filename | grep text
    PDF files
    strings filename | grep text

    This would make a great shell script project. You could use file [ctyme.com] to detect the type and then filter and grep it appropriately. This sounds useful enough that I'll probably write this script this weekend. Thanks for the idea.

  • by bentfork ( 92199 ) on Thursday September 02, 2004 @03:18PM (#10142104)
    I like the Catfish namazu.org [google.ca] <-- google cache ( thats catfish in japanese ).

    It has multiple 'filters' that allow you to index .txt .pdf .doc and various other file formats. It also has some grep like command line utilities that allows you to 'grep' for text.

    Oh, and it has a perl front end for apache, and IIS. ( win32 installer available from here: www.namazu.org/windows [namazu.org]

    caveat: As with any full text search engine, the files should be reindexed on a regular basis, but this is worth it as the searches are very fast even with gigs of documents.

    enjoy

    • Just reindex at 9 in the moring, when you've stopped computing for the night.
      • Just reindex at 9 in the moring, when you've stopped computing for the night.

        hmm so true, who goes to bed before 9 anyways... modifying crontab...

        # reindex at 9 in the morning
        0 9 * * * /pathto/namazu/reindex-foo
    • Source (such as it is) is linked below. Requires Namazu. Developed on Debian / Unstable. ~75 lines of decently commented bash.

      http://www.robertames.com/index.sh.txt [robertames.com]

      Call it BSD-licensed by author, and sharing back is encouraged (of course). I downloaded all my mail from yahoo account (2 wks before they upped it to 100mb) and stuffed it in a directory so I could search it better. I have to agree that namazu rocks. :^)

      Major operations are "create, update" for working with the index, then "search,

  • You mean a personal slave^H^H^H^H^H^H^H^H^H^H^H^H^H^Hhapless grad student?
  • Oh, you want it to work now? Nevermind.

    But, down the road, maybe ReiserFS4 will do the trick.

    • Yep, I agree. But an indexing/search engine still needs to be written, after they get done arguing about who can stand the silent semantic changes and who can't. And then there will be those who use other filesystems wanting the same functionality, or at least the same interface to an inferior functionality...
  • Penetrator (Score:4, Informative)

    by cbcbcb ( 567490 ) on Thursday September 02, 2004 @04:15PM (#10142667)
    I've just implemented a Word/PDF/text search system using Penetrator [freshmeat.net].

    It acts as a web search engine or a grep-like tool. You can either build an index or configure it to search on demand.

    You can add filters to deal with any file type you like. I use xpdf and wvWare to handle PDFs and word documents.

    A little bit of work to set up, but it's a nifty bit bit of software.

    • with all the hidden goatse links around here I'll let somebody else hit a link with the word Penetrator in it.
    • Re:Penetrator (Score:2, Informative)

      by GrumpySimon ( 707671 )
      This looks very cool (and isn't a goatse link!), I'm just setting it up now. Trying to get html2text to kill things like font tags.

      Also - your default pdftotext setting seems to barf on files with spaces in their names. I changed the line to '%s' and this seems to work.

      Cheers,
      Simon
      • w3m [sourceforge.net] -dump does a very good job of converting html to text.

        > Also - your default pdftotext setting...

        I didn't write Penetrator, it just looks identical to the program I was planning to write before I found it :)

    • Also worth looking at is SWISH-Enhanced [swish-e.org]. It can deal with Word .docs (with a bit of configuration), PDF, HTML and anything you can filter to text.

  • doodle (Score:2, Informative)

    by tenco ( 773732 )
    Get doodle [ovmj.org]
    or simply do a
    #apt-get install doodle
  • by chjones ( 610558 ) <[chjones] [at] [aleph0.com]> on Thursday September 02, 2004 @06:02PM (#10143823) Homepage Journal
    Haven't actually tried it out (everything I write seems to be text or TeX), but I remembered reading this article a while back: "How to Index Anything", Linux J., July 1 2003 [linuxjournal.com].
  • Google [google.com] has a search appliance [google.com] that is capable of searching files on a local network. This may be more than what you're looking for, but if this is a serious enterprise application, this seems just right. It's like having your own personal google.
  • No one has mentioned glimpse [webglimpse.net], probably because they moved to a non-free license [webglimpse.net] a while back, but the 3.0 version is still gratis. And it allows you to configure filters to search files of whatever type you like.

    Here's a snapshot [freeshell.org] of the source for Glimpse 3.0, packaged up from my system as I don't have the original tarfile anymore.

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...