A Grep-like Utility That Works on More than Text? 65

Posted by Cliff on Thursday September 02, 2004 @02:45PM from the substring-searches-on-steroids dept.

Nutria writes "This article got me thinking: What's a poor Unix-using guy to do, when he needs to grep text, compressed tarballs, OO.o documents, Debian archives, mime-encoded files, Evil Microsoft documents, PDF files, compressed AbiWord files, etc." Is there an extensible searching program for Unix that can handle a variety of different file-types? Search engines like ht://Dig can accomplish part of this task, however currently it doesn't index the whole file (just portions of the metadata). If you had to perform a substring search on a set of documents of different types, what tools would you use to accomplish this task?

A Grep-like Utility That Works on More than Text?

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 65 Comments Log In/Create an Account

Comments Filter:

ever herad of 'strings'? (Score:1, Informative)

by szyzyg ( 7313 ) writes:

strings filename | grep
- Re:ever herad of 'strings'? (Score:2)
  
  by TheWanderingHermit ( 513872 ) writes:
  
  But that doesn't pull out compressed strings -- only recognizable ascii strings.
- Re:ever herad of 'strings'? (Score:3, Informative)
  
  by JabberWokky ( 19442 ) writes:
  
  Realistically, this should operate similar to text filters in less. I.e., a set of filters that convert files to text for grep'ing.
  A more generic solution would be to have less stop using those filters and instead use strings, which would identify the file type and then subcall strings.text.plain or strings.text.html or otherwise the strings.mime.type. The fallback would be the normal strings behaviour. That way, anything that expected text (a la grep, less, etc) can have a flag to filter through string
- Oh, yeah, that works... (Score:2)
  
  by Inoshiro ( 71693 ) writes:
  
  Because strings will do things like: uncompress ASCII text from tarballs, recognize unicode font files, work with XML document formats to ignore markup and focus on content, search within images for strings, etc, etc.
  
  The person is asking about a true file contents indexing scheme, where you have a database of file name, meta type, and keywords culled from inside the document -- something that'd work with PDF, JPG w/ EXIF, pictures of text, OpenOffice files, XML files, HTML files, etc!
  
  Strings does none of
  - Re:Oh, yeah, that works... (Score:1)
    
    by pete-classic ( 75983 ) writes:
    
    A Grep-like Utility That Works on More than Text?
    
    The person is asking about a true file contents indexing scheme[. . .]
    
    They added file contents indexing to grep?! Sweet!
    
    -Peter
- Re:ever herad of 'strings'? (Score:2)
  
  by Black Acid ( 219707 ) writes:
  
  This also won't work with MS Word documents because they store text out of order.
Got to be better than XP's puppy dog, but (Score:3, Informative)

by Marxist Hacker 42 ( 638312 ) * writes: <seebert42@gmail.com> on Thursday September 02, 2004 @02:50PM (#10141784) Homepage Journal

the problem I see isn't searching compressed or tarballed files- where the text inside is still largely in plaintext. It's just a different file format- but it's still bytes and I've seen supergrep programs in the past for both Linux and Windows that do this (try Tucows [tucows.com] or SourceForge [sourceforge.net] before posting on slashdot in search of freeware and shareware.) The problem I see is searching *encrypted* files, esepcially ones with different keys. Now that would be *hard*.

- Speaking of the 'puppy dog' (Score:2)
  
  by MarcQuadra ( 129430 ) * writes:
  
  I think the fscking puppy is the WORST thing about XP. Why on earth would I click 'search' in my 'file manager' if I wasn't searching for a FILE? I despise any 'improvement' to Windows that makes me use my mouse any more than I already do. If Microsoft knew what they were doing the search would START with files, with an option to do all the other crap on the sidelines.
  
  Apple has a really good search tool, it's simple but lets you string together conditions intuitively. KDE seems pretty decent too.
  - Re:Speaking of the 'puppy dog' (Score:2)
    
    by no soup for you ( 607826 ) writes:
    
    if you need to search, you can press F3 on almost any application or the desktop. Firefox seems to be an exception -- F3 finds text on that page.
    - Re:Speaking of the 'puppy dog' (Score:2)
      
      by Marxist Hacker 42 ( 638312 ) * writes:
      
      Which just brings up the damned dog again- I WANT F3 to do what it used to, find text in the CURRENT document on a Find First, Find Next meme.
  - Re:Speaking of the 'puppy dog' (Score:1)
    
    by airjrdn ( 681898 ) writes:
    
    I agree that the dog should be shot, but it is an easy setting to change.
file... (Score:2)

by runswithd6s ( 65165 ) writes:

file+[GLUE] and optionally, grep. GNU File is a nice utility to determine file types. From there, you could use some sort of glue language, such as Python or Perl, to perform the necessary actions when certain filetypes are found: decompress, index, etc. I'd be surprised if there wasn't a plugin style grep utility already out there.
- Re:GOOGLE (Score:1)
  
  by djsmiley ( 752149 ) writes:
  
  this isn't such a bad as a stupid idea.
  
  If the user was stupid enough to share their entire drive, and allow google to index it, it could find most things in most file types.
  
  With more support coming all the time.
  - Re:GOOGLE Try POPsearch (Score:1)
    
    by accensi ( 767981 ) writes:
    
    Try POPsearch (http://www.popsearch.net/)
Beagle (Score:1)

by heikkih ( 100839 ) writes:

Beagle [gnome.org]
ht://Dig doesn't do this? (Score:3, Informative)

by dougmc ( 70836 ) writes: <dougmc+slashdot@frenzied.us> on Thursday September 02, 2004 @02:55PM (#10141838) Homepage

Search engines like ht://Dig can accomplish part of this task, however currently it doesn't index the whole file (just portions of the metadata).

Um, ht://Dig WILL do it. As will any of the other search engine packages out there, if you can feed them the text that they work on. It's generally not difficult (on a *nix platform) to convert all of the formats you mentioned into straight text or html and feed those to your favorite search engine platforms.
Useful programs include wvWare, rtf2htm, pdftotext, and yes, even strings.

Beagle (Score:3, Informative)

by Markusis ( 46739 ) * writes: on Thursday September 02, 2004 @02:57PM (#10141861) Homepage Journal

Beagle will probably meet at least some of these goals. Beagle also aims to index things like your gaim-logs, browsing history, and email as well as your files. This is closely related to the dashboard project. Both projects can be retrieved from Gnome CVS.

http://www.nat.org/beagle/ [nat.org]
http://www.nat.org/dashboard/ [nat.org]

- Re:Beagle (Score:5, Informative)
  
  by HRbnjR ( 12398 ) writes: <chris@hubick.com> on Thursday September 02, 2004 @03:34PM (#10142277) Homepage
  
  Seth Nickell has a blog entry [gnome.org] discussing solutions in this area, including Gnome Storage, WinFS, Dashboard, Medusa, Spotlight, and Beagle.
  
Are you sure you're a "unix guy"? (Score:4, Insightful)

by ivan256 ( 17499 ) * writes: on Thursday September 02, 2004 @03:00PM (#10141905)

What's a poor Unix-using guy to do, when he needs to grep text, compressed tarballs, OO.o documents, Debian archives, mime-encoded files, Evil Microsoft documents, PDF files, compressed AbiWord files, etc.

Um, why not pipe the output of your favorite program that interacts with the file type you're interested in to grep? Isn't that the "poor unix guy" way?

catdoc Blah.doc | grep foo
zcat compressed.txt.gz | grep foo
apt-cache show package | grep foo
pdftotext Blah.pdf | grep foo

etc...

- Half a solution .... (Score:4, Informative)
  
  by gstoddart ( 321705 ) writes: on Thursday September 02, 2004 @05:43PM (#10143625) Homepage
  
  Um, why not pipe the output of your favorite program that interacts with the file type you're interested in to grep? Isn't that the "poor unix guy" way?
  
  Because it involves the user explicitly using the right utility to decode each file type for every single file.
  
  So you don't get the nice simple case the user is looking for like:
  
  grep foo *
  
  or even something cool like
  
  find . -name "$pattern" -exec egrep -l -e "$1" {} \;
  
  Grep is often used to hunt through large amounts of docs because you know something has that text, but you can't remember what.
  
  What the guy is asking for is something that would take my second example, figure out the file type, and then do it automatically for you.
  
  Cheers
  
  - Re:Half a solution .... (Score:2)
    
    by AuMatar ( 183847 ) writes:
    
    So write a script that runs each of those programs on all grepped files, ignore those with an error return, and pipes the outputs of the rest to grep
  - Re:Half a solution .... (Score:3, Informative)
    
    by cortana ( 588495 ) writes:
    
    This is what the less package is for. Specifically lesspipe(1).
    
    find -type f -exec lesspipe {} \; | grep whatever
  - Re:Half a solution .... (Score:2)
    
    by Matthew Weigel ( 888 ) writes:
    
    Funny... you'd think that with such a cogent explanation of the half of the problem that grep+find doesn't solve, file(1) would leap to mind as a reasonable start to the solution of it.
    
    It doesn't handle everything, you still need utilities for most file types to convert them to text, but it's a damned good start.
  - Re:Half a solution .... (Score:1)
    
    by El Buho ( 810814 ) writes:
    
    You could add support to your script to many file formats by using some of the many PostScript converters [mcgill.ca] that already exist. Pipe their output to an application that extracts plaintext from the PostScript file [rutgers.edu] , then pipe to grep.
How, indeed! (Score:5, Funny)

by Otter ( 3800 ) writes: on Thursday September 02, 2004 @03:02PM (#10141919) Journal

The same way everything else works!

1) Be had a half-decent version years ago.

2) Apple will have a reasonably robust version out soon.

3) Microsoft will have a more frustrating knock-off of Apple's version a few years later.

4) Four competing, incompatible open-source projects will copy the Apple and Microsoft implementations. When one of those companies sends a cease and desist letter to an open-source project that has shamelessly ripped off its trademarked name, Linux zealots will complain about how "intellectual property laws stifle our innovation"!

- Re:How, indeed! (Score:2)
  
  by Carnildo ( 712617 ) writes:
  
  Funny? Seems more "insightful" to me!
  
  1) grep, ht://dig, find, etc
  2) Apple's new search utility, due out with 10.4
  3) WinFS
  4) ????
  - Re:How, indeed! (Score:2)
    
    by afidel ( 530433 ) writes:
    
    Yeah except MS has had this in FindFast from Office 97 on and the Microsoft Indexing Service since Windows 2000 came out. It doesn't work terribly well out of the box (idle detection sucks) but if you put a little effort into it Indexing Service really rocks, especially in an AD environment where you can perform a search across servers.
  - Re:How, indeed! (Score:2)
    
    by AndyElf ( 23331 ) writes:
    
    Parent's (1) had nothing to do with grep, ht://dig or find: it was more what will be coming in (2).
    
    As for (3), putting an SQL Server onto each desktop to store all the meta-data about all the files was a wrong idea to start with (not learning from Be's lessons are we? Nothing to do with computers becoming more powerful then what they were in the age of original Be). A much better approach would be more similar to ECCO's organizer system, RDF, FOAF, what Chandler will (hopefully) one day become, etc. --
  - Re:How, indeed! (Score:2)
    
    by arose ( 644256 ) writes:
    
    4) Gnome Storage [gnome.org]
Convert the files (Score:1)

by repvik ( 96666 ) writes:

I'm sure there are utilities to convert most of the formats to text. MS Word to text? No problems, there's an utility to handle that. PDF? Of course.
Basically, what you might do is something like this:

1. Figure out what kind of file it is (using 'file')
2. Select the correct converter based on file
3. grep
4. Profit? -- not quite sure about this one.
Use a pipe and untilities (Score:5, Informative)

by Matt Perry ( 793115 ) writes: <perry.matt54@nOSpAm.yahoo.com> on Thursday September 02, 2004 @03:06PM (#10141972)

The unix shell and pipe are your friends:

grep text

grep text filename

compressed tarballs

tar zOxf filename | grep text

OO.o document

unzip -p filename | grep text

mime-encoded files

mimedecode [debian.org] filename | grep text

Evil Microsoft documents

strings filename | grep text

PDF files

strings filename | grep text
This would make a great shell script project. You could use file [ctyme.com] to detect the type and then filter and grep it appropriately. This sounds useful enough that I'll probably write this script this weekend. Thanks for the idea.

- Re:Use a pipe and untilities (Score:1)
  
  by Cyrus ( 84324 ) writes:
  
  Why not just use zgrep [delorie.com]?
  - Re:Use a pipe and untilities (Score:2)
    
    by Matt Perry ( 793115 ) writes:
    
    Oh you could. I just wanted to illustrate using the pipe and other commands with grep to achieve what the poster wanted.
- Re:Use a pipe and untilities (Score:3, Informative)
  
  by robbkidd ( 154298 ) writes:
  PDF files[?]
  strings filename | grep text
  I'm guessing you've never tried that search before. PDF stores the meat of a document in compressed data streams. strings would return a bunch of font names, headers and compressed garbage.
  There are a few other tools available, at various stages of stability:
  
  PDFtoHTML [sourceforge.net]
  PDFsearch [sourceforge.net]
  CPAN PDF modules [cpan.org]
  - Re:Use a pipe and untilities (Score:1)
    
    by Shadwell ( 709447 ) writes:
    
    xpdf [foolabs.com] comes with a utility called pdftotext that dumps the text portion to a plaintext file.
- Re:Use a pipe and untilities (Score:2)
  
  by alexandre ( 53 ) writes:
  
  there is also wordcat that displays only a txt version of the word document :)
  (there is also yet another program but i forgot the name)
  - Re:Use a pipe and untilities (Score:2)
    
    by Clover_Kicker ( 20761 ) writes:
    
    catdoc [free.net].
- Re:Use a pipe and untilities (Score:4, Informative)
  
  by bunnyman ( 121652 ) writes: on Thursday September 02, 2004 @05:33PM (#10143493)
  
  less(1) already does this! Check out the $LESSOPEN variable on your Linux system, it points to a shell script that detects what type of file you are viewing, and runs a filter on it to get plain text from it.
  
  - Re:Use a pipe and untilities (Score:3, Informative)
    
    by Spoing ( 152917 ) writes:
    less(1) already does this! Check out the $LESSOPEN variable on your Linux system, it points to a shell script that detects what type of file you are viewing, and runs a filter on it to get plain text from it.
    *blink*
    $ less firefox-i686-linux-gtk2+xft.tar.gz drwxrwxrwx cltbld/cltbld 0 2004-08-24 11:26:07 firefox/ -rwxr-xr-x cltbld/cltbld 30869 1999-10-05 22:14:51 firefox/LICENSE -rwxr-xr-x cltbld/cltbld 177 2004-08-09 16:01:23 firefox/README.txt drwxrwxrwx cltbld/cltbld 0 2004-08-24 11:26:07 firefox/chr
- Re:Use a pipe and untilities (Score:1)
  
  by gnu-user ( 162334 ) writes:
  
  Evil Microsoft documents
  
  strings filename | grep text
  
  Use antiword instead of strings.
  
  Works pretty well (It allows me to use mutt relatively painlessly in a corporate environment).
- Use PostScript converters (Score:1)
  
  by El Buho ( 810814 ) writes:
  
  You could add support to your script to many file formats by using some of the many PostScript converters [mcgill.ca] that already exist. Pipe their output to an application that extracts plaintext from the PostScript file [rutgers.edu], then pipe to grep.
Full text search engine? (Score:4, Informative)

by bentfork ( 92199 ) writes: on Thursday September 02, 2004 @03:18PM (#10142104)

I like the Catfish namazu.org [google.ca] <-- google cache ( thats catfish in japanese ).
It has multiple 'filters' that allow you to index .txt .pdf .doc and various other file formats. It also has some grep like command line utilities that allows you to 'grep' for text.
Oh, and it has a perl front end for apache, and IIS. ( win32 installer available from here: www.namazu.org/windows [namazu.org]
caveat: As with any full text search engine, the files should be reindexed on a regular basis, but this is worth it as the searches are very fast even with gigs of documents.
enjoy

- Re:Full text search engine? (Score:2, Insightful)
  
  by shufler ( 262955 ) writes:
  
  Just reindex at 9 in the moring, when you've stopped computing for the night.
  - Re:Full text search engine? (Score:2)
    
    by bentfork ( 92199 ) writes:
    
    Just reindex at 9 in the moring, when you've stopped computing for the night.
    
    hmm so true, who goes to bed before 9 anyways... modifying crontab...
    
    # reindex at 9 in the morning 0 9 * * * /pathto/namazu/reindex-foo
- I did this a while ago... (Score:1)
  
  by Ramses0 ( 63476 ) writes:
  
  Source (such as it is) is linked below. Requires Namazu. Developed on Debian / Unstable. ~75 lines of decently commented bash.
  
  http://www.robertames.com/index.sh.txt [robertames.com]
  Call it BSD-licensed by author, and sharing back is encouraged (of course). I downloaded all my mail from yahoo account (2 wks before they upped it to 100mb) and stuffed it in a directory so I could search it better. I have to agree that namazu rocks. :^)
  Major operations are "create, update" for working with the index, then "search,
Ooh, goody, a "research assistant!" (Score:3, Funny)

by orangesquid ( 79734 ) writes: <orangesquid@NOsPaM.yahoo.com> on Thursday September 02, 2004 @03:21PM (#10142128) Homepage Journal

You mean a personal slave^H^H^H^H^H^H^H^H^H^H^H^H^H^Hhapless grad student?

Reiser4 (Score:2)

by SpaceLifeForm ( 228190 ) writes:

Oh, you want it to work now? Nevermind.
But, down the road, maybe ReiserFS4 will do the trick.
- Re:Reiser4 (Score:2)
  
  by ecloud ( 3022 ) writes:
  
  Yep, I agree. But an indexing/search engine still needs to be written, after they get done arguing about who can stand the silent semantic changes and who can't. And then there will be those who use other filesystems wanting the same functionality, or at least the same interface to an inferior functionality...
Penetrator (Score:4, Informative)

by cbcbcb ( 567490 ) writes: on Thursday September 02, 2004 @04:15PM (#10142667)

I've just implemented a Word/PDF/text search system using Penetrator [freshmeat.net].
It acts as a web search engine or a grep-like tool. You can either build an index or configure it to search on demand.
You can add filters to deal with any file type you like. I use xpdf and wvWare to handle PDFs and word documents.
A little bit of work to set up, but it's a nifty bit bit of software.

- Re:Penetrator (Score:1)
  
  by chez69 ( 135760 ) writes:
  
  with all the hidden goatse links around here I'll let somebody else hit a link with the word Penetrator in it.
- Re:Penetrator (Score:2, Informative)
  
  by GrumpySimon ( 707671 ) writes:
  
  This looks very cool (and isn't a goatse link!), I'm just setting it up now. Trying to get html2text to kill things like font tags.
  
  Also - your default pdftotext setting seems to barf on files with spaces in their names. I changed the line to '%s' and this seems to work.
  
  Cheers,
  Simon
  - Re:Penetrator (Score:2)
    
    by cbcbcb ( 567490 ) writes:
    
    w3m [sourceforge.net] -dump does a very good job of converting html to text.
    > Also - your default pdftotext setting...
    I didn't write Penetrator, it just looks identical to the program I was planning to write before I found it :)
- Re:Penetrator (Score:2)
  
  by Twylite ( 234238 ) writes:
  
  Also worth looking at is SWISH-Enhanced [swish-e.org]. It can deal with Word .docs (with a bit of configuration), PDF, HTML and anything you can filter to text.
doodle (Score:2, Informative)

by tenco ( 773732 ) writes:

Get doodle [ovmj.org]
or simply do a
#apt-get install doodle
Haven't tried it, but SWISH sounds good (Score:4, Informative)

by chjones ( 610558 ) writes: <chjones@NoSpam.aleph0.com> on Thursday September 02, 2004 @06:02PM (#10143823) Homepage Journal

Haven't actually tried it out (everything I write seems to be text or TeX), but I remembered reading this article a while back: "How to Index Anything", Linux J., July 1 2003 [linuxjournal.com].

- Re:Haven't tried it, but SWISH sounds good (Score:5, Informative)
  
  by bobv-pillars-net ( 97943 ) writes: <bobvin@pillars.net> on Thursday September 02, 2004 @09:15PM (#10145187) Homepage Journal
  
  Actually, the last time I needed to index a large group of files (for disaster recovery purposes; the files in question were all from the lost+found directory) I used Swish++ [mac.com].
  And yes, it works VERY well, orders of magnitude faster than Ht:/Dig, features incremental reindexing, and can be configured to auto-convert various filetypes to text before indexing. I'd say it's exactly what the poster ordered.
  
Google search appliance (Score:2)

by muon1183 ( 587316 ) writes:

Google [google.com] has a search appliance [google.com] that is capable of searching files on a local network. This may be more than what you're looking for, but if this is a serious enterprise application, this seems just right. It's like having your own personal google.
Glimpse (Score:2)

by polymath69 ( 94161 ) writes:

No one has mentioned glimpse [webglimpse.net], probably because they moved to a non-free license [webglimpse.net] a while back, but the 3.0 version is still gratis. And it allows you to configure filters to search files of whatever type you like.
Here's a snapshot [freeshell.org] of the source for Glimpse 3.0, packaged up from my system as I don't have the original tarfile anymore.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

ever herad of 'strings'? (Score:1, Informative)

Re:ever herad of 'strings'? (Score:2)

Re:ever herad of 'strings'? (Score:3, Informative)

Oh, yeah, that works... (Score:2)

Re:Oh, yeah, that works... (Score:1)

Re:ever herad of 'strings'? (Score:2)

Got to be better than XP's puppy dog, but (Score:3, Informative)

Speaking of the 'puppy dog' (Score:2)

Re:Speaking of the 'puppy dog' (Score:2)

Re:Speaking of the 'puppy dog' (Score:2)

Re:Speaking of the 'puppy dog' (Score:1)

file... (Score:2)

Re:GOOGLE (Score:1)

Re:GOOGLE Try POPsearch (Score:1)

Beagle (Score:1)

ht://Dig doesn't do this? (Score:3, Informative)

Beagle (Score:3, Informative)

Re:Beagle (Score:5, Informative)

Are you sure you're a "unix guy"? (Score:4, Insightful)

Half a solution .... (Score:4, Informative)

Re:Half a solution .... (Score:2)

Re:Half a solution .... (Score:3, Informative)

Re:Half a solution .... (Score:2)

Re:Half a solution .... (Score:1)

How, indeed! (Score:5, Funny)

Re:How, indeed! (Score:2)

Re:How, indeed! (Score:2)

Re:How, indeed! (Score:2)

Re:How, indeed! (Score:2)

Convert the files (Score:1)

Use a pipe and untilities (Score:5, Informative)

Re:Use a pipe and untilities (Score:1)

Re:Use a pipe and untilities (Score:2)

Re:Use a pipe and untilities (Score:3, Informative)

Re:Use a pipe and untilities (Score:1)

Re:Use a pipe and untilities (Score:2)

Re:Use a pipe and untilities (Score:2)

Re:Use a pipe and untilities (Score:4, Informative)

Re:Use a pipe and untilities (Score:3, Informative)

Re:Use a pipe and untilities (Score:1)

Use PostScript converters (Score:1)

Full text search engine? (Score:4, Informative)

Re:Full text search engine? (Score:2, Insightful)

Re:Full text search engine? (Score:2)

I did this a while ago... (Score:1)

Ooh, goody, a "research assistant!" (Score:3, Funny)

Reiser4 (Score:2)

Re:Reiser4 (Score:2)

Penetrator (Score:4, Informative)

Re:Penetrator (Score:1)

Re:Penetrator (Score:2, Informative)

Re:Penetrator (Score:2)

Re:Penetrator (Score:2)

doodle (Score:2, Informative)

Haven't tried it, but SWISH sounds good (Score:4, Informative)

Re:Haven't tried it, but SWISH sounds good (Score:5, Informative)

Google search appliance (Score:2)

Glimpse (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals