




A Grep-like Utility That Works on More than Text? 65
Nutria writes "This article got me thinking: What's a poor Unix-using guy to do, when he needs to grep text, compressed tarballs, OO.o documents, Debian archives, mime-encoded files, Evil Microsoft documents, PDF files, compressed AbiWord files, etc." Is there an extensible searching program for Unix that can handle a variety of different file-types? Search engines like ht://Dig can accomplish part of this task, however currently it doesn't index the whole file (just portions of the metadata). If you had to perform a substring search on a set of documents of different types, what tools would you use to accomplish this task?
ever herad of 'strings'? (Score:1, Informative)
Re:ever herad of 'strings'? (Score:2)
Re:ever herad of 'strings'? (Score:3, Informative)
A more generic solution would be to have less stop using those filters and instead use strings, which would identify the file type and then subcall strings.text.plain or strings.text.html or otherwise the strings.mime.type. The fallback would be the normal strings behaviour. That way, anything that expected text (a la grep, less, etc) can have a flag to filter through string
Oh, yeah, that works... (Score:2)
The person is asking about a true file contents indexing scheme, where you have a database of file name, meta type, and keywords culled from inside the document -- something that'd work with PDF, JPG w/ EXIF, pictures of text, OpenOffice files, XML files, HTML files, etc!
Strings does none of
Re:Oh, yeah, that works... (Score:1)
They added file contents indexing to grep?! Sweet!
-Peter
Re:ever herad of 'strings'? (Score:2)
Got to be better than XP's puppy dog, but (Score:3, Informative)
Speaking of the 'puppy dog' (Score:2)
Apple has a really good search tool, it's simple but lets you string together conditions intuitively. KDE seems pretty decent too.
Re:Speaking of the 'puppy dog' (Score:2)
Re:Speaking of the 'puppy dog' (Score:2)
Re:Speaking of the 'puppy dog' (Score:1)
file... (Score:2)
Re:GOOGLE (Score:1)
If the user was stupid enough to share their entire drive, and allow google to index it, it could find most things in most file types.
With more support coming all the time.
Re:GOOGLE Try POPsearch (Score:1)
Beagle (Score:1)
ht://Dig doesn't do this? (Score:3, Informative)
Useful programs include wvWare, rtf2htm, pdftotext, and yes, even strings.
Beagle (Score:3, Informative)
http://www.nat.org/beagle/ [nat.org]
http://www.nat.org/dashboard/ [nat.org]
Re:Beagle (Score:5, Informative)
Are you sure you're a "unix guy"? (Score:4, Insightful)
Um, why not pipe the output of your favorite program that interacts with the file type you're interested in to grep? Isn't that the "poor unix guy" way?
catdoc Blah.doc | grep foo
zcat compressed.txt.gz | grep foo
apt-cache show package | grep foo
pdftotext Blah.pdf | grep foo
etc...
Half a solution .... (Score:4, Informative)
Because it involves the user explicitly using the right utility to decode each file type for every single file.
So you don't get the nice simple case the user is looking for like:
grep foo *
or even something cool like
find . -name "$pattern" -exec egrep -l -e "$1" {} \;
Grep is often used to hunt through large amounts of docs because you know something has that text, but you can't remember what.
What the guy is asking for is something that would take my second example, figure out the file type, and then do it automatically for you.
Cheers
Re:Half a solution .... (Score:2)
Re:Half a solution .... (Score:3, Informative)
find -type f -exec lesspipe {} \; | grep whatever
Re:Half a solution .... (Score:2)
Funny... you'd think that with such a cogent explanation of the half of the problem that grep+find doesn't solve, file(1) would leap to mind as a reasonable start to the solution of it.
It doesn't handle everything, you still need utilities for most file types to convert them to text, but it's a damned good start.
Re:Half a solution .... (Score:1)
How, indeed! (Score:5, Funny)
1) Be had a half-decent version years ago.
2) Apple will have a reasonably robust version out soon.
3) Microsoft will have a more frustrating knock-off of Apple's version a few years later.
4) Four competing, incompatible open-source projects will copy the Apple and Microsoft implementations. When one of those companies sends a cease and desist letter to an open-source project that has shamelessly ripped off its trademarked name, Linux zealots will complain about how "intellectual property laws stifle our innovation"!
Re:How, indeed! (Score:2)
1) grep, ht://dig, find, etc
2) Apple's new search utility, due out with 10.4
3) WinFS
4) ????
Re:How, indeed! (Score:2)
Re:How, indeed! (Score:2)
As for (3), putting an SQL Server onto each desktop to store all the meta-data about all the files was a wrong idea to start with (not learning from Be's lessons are we? Nothing to do with computers becoming more powerful then what they were in the age of original Be). A much better approach would be more similar to ECCO's organizer system, RDF, FOAF, what Chandler will (hopefully) one day become, etc. --
Re:How, indeed! (Score:2)
Convert the files (Score:1)
Basically, what you might do is something like this:
1. Figure out what kind of file it is (using 'file')
2. Select the correct converter based on file
3. grep
4. Profit? -- not quite sure about this one.
Use a pipe and untilities (Score:5, Informative)
This would make a great shell script project. You could use file [ctyme.com] to detect the type and then filter and grep it appropriately. This sounds useful enough that I'll probably write this script this weekend. Thanks for the idea.
Re:Use a pipe and untilities (Score:1)
Re:Use a pipe and untilities (Score:2)
Re:Use a pipe and untilities (Score:3, Informative)
I'm guessing you've never tried that search before. PDF stores the meat of a document in compressed data streams. strings would return a bunch of font names, headers and compressed garbage.
There are a few other tools available, at various stages of stability:
Re:Use a pipe and untilities (Score:1)
Re:Use a pipe and untilities (Score:2)
(there is also yet another program but i forgot the name)
Re:Use a pipe and untilities (Score:2)
Re:Use a pipe and untilities (Score:4, Informative)
Re:Use a pipe and untilities (Score:3, Informative)
*blink*
Re:Use a pipe and untilities (Score:1)
Use antiword instead of strings.
Works pretty well (It allows me to use mutt relatively painlessly in a corporate environment).
Use PostScript converters (Score:1)
Full text search engine? (Score:4, Informative)
It has multiple 'filters' that allow you to index .txt .pdf .doc and various other file formats. It also has some grep like command line utilities that allows you to 'grep' for text.
Oh, and it has a perl front end for apache, and IIS. ( win32 installer available from here: www.namazu.org/windows [namazu.org]
caveat: As with any full text search engine, the files should be reindexed on a regular basis, but this is worth it as the searches are very fast even with gigs of documents.
enjoy
Re:Full text search engine? (Score:2, Insightful)
Re:Full text search engine? (Score:2)
hmm so true, who goes to bed before 9 anyways... modifying crontab...
I did this a while ago... (Score:1)
http://www.robertames.com/index.sh.txt [robertames.com]
Call it BSD-licensed by author, and sharing back is encouraged (of course). I downloaded all my mail from yahoo account (2 wks before they upped it to 100mb) and stuffed it in a directory so I could search it better. I have to agree that namazu rocks. :^)
Major operations are "create, update" for working with the index, then "search,
Ooh, goody, a "research assistant!" (Score:3, Funny)
Reiser4 (Score:2)
But, down the road, maybe ReiserFS4 will do the trick.
Re:Reiser4 (Score:2)
Penetrator (Score:4, Informative)
It acts as a web search engine or a grep-like tool. You can either build an index or configure it to search on demand.
You can add filters to deal with any file type you like. I use xpdf and wvWare to handle PDFs and word documents.
A little bit of work to set up, but it's a nifty bit bit of software.
Re:Penetrator (Score:1)
Re:Penetrator (Score:2, Informative)
Also - your default pdftotext setting seems to barf on files with spaces in their names. I changed the line to '%s' and this seems to work.
Cheers,
Simon
Re:Penetrator (Score:2)
> Also - your default pdftotext setting...
I didn't write Penetrator, it just looks identical to the program I was planning to write before I found it :)
Re:Penetrator (Score:2)
Also worth looking at is SWISH-Enhanced [swish-e.org]. It can deal with Word .docs (with a bit of configuration), PDF, HTML and anything you can filter to text.
doodle (Score:2, Informative)
or simply do a
#apt-get install doodle
Haven't tried it, but SWISH sounds good (Score:4, Informative)
Re:Haven't tried it, but SWISH sounds good (Score:5, Informative)
And yes, it works VERY well, orders of magnitude faster than Ht:/Dig, features incremental reindexing, and can be configured to auto-convert various filetypes to text before indexing. I'd say it's exactly what the poster ordered.
Google search appliance (Score:2)
Glimpse (Score:2)
Here's a snapshot [freeshell.org] of the source for Glimpse 3.0, packaged up from my system as I don't have the original tarfile anymore.