Linux – Desktop Search

A while ago I published a post on Desktop Search on Linux (specifically Ubuntu).  I was far from happy with my conclusions and I felt I needed to re-evaluate all the options to see which would really perform the most accurate search against my information.

Primarily my information consists of Microsoft Office documents, Open Office documents, pictures (JPEG, as well as Canon RAW and Nikon RAW), web pages, archives, and email (stored as RFC822/RFC2822 compliant files with an eml extension).

My test metrics would be to take a handful of search terms which I new existed in various types of documents, and check the results (I actually used Microsoft Windows Search 4.0 to prepare a complete list of documents that matched the query — since I knew it worked as expected).

The search engines I tested were:

I was able to install, configure, and launch each of the applications.  Actually none of them were really that difficult to install and configure; but all of them required searching through documentation and third party sites — I’d say poor documentation is just something you have to get used to.

Beagle, Google, Tracker, Pinot, and Recoll all failed to find all the documents of interest… none of them properly indexed the email files — most of the failed to handle plain text files; that didn’t leave a very high bar to pick a winner.

Queries on Strigi actually provided every hit that the same query provided on Windows Search… though I have to say Windows Search was easier to setup and use.

I tried the Neopomuk (KDE) interface for Strigi — though it just didn’t seem to work as well as strigiclient did… and certainly strigiclient was pretty much at the top of the list for butt-ugly, user-hostile, un-intuitive applications I’d ever seen.

After all of the time I’ve spent on desktop search for Linux I’ve decided all of the search solutions are jokes.  None of them are well thought out, none of them are well executed, and most of them out right don’t work.

Like most Linux projects, more energy needs to be focused on working out a framework for search than everyone going off half-cocked and creating a new search paradigm.

The right model is…

A single multi-threaded indexer running in the background indexing files according to a system wide policy aggregated with user policies (settable by each user on directories they own) along with the access privileges.

A search API that takes the user/group and query to provide results for items that the user has (read) access to.

The indexer should be designed to use plug-in modules to handle particular file types (mapped both by file extension, and by file content).

The index should also be designed to use plug-in modules for walking a file system and receiving file system change events (that allows the framework to adapt as the Linux kernel changes — and would support remote indexing as well).

Additionally, the index/search should be designed with distributed queries in mind (often you want to search many servers, desktops, and web locations simultaneously).

Then it becomes a simple matter for developers to write new/better indexer plug-ins; and better search interfaces.

I’ve pointed out in a number of recent posts that you can effective use Linux as a server platform in your business; however, it seems that if search is a requirement you might want to consider ponying up the money for Microsoft Windows Server 2008 and enjoy seamless search (that works) between your Windows Vista / Windows 7 Desktops and Windows Server.

REFERENCES:

Ubuntu – Desktop Search

Originally posted 2010-07-16 02:00:19.