Entries Tagged as 'Desktop Search'

Linux – Desktop Search

A while ago I published a post on Desktop Search on Linux (specifically Ubuntu).  I was far from happy with my conclusions and I felt I needed to re-evaluate all the options to see which would really perform the most accurate search against my information.

Primarily my information consists of Microsoft Office documents, Open Office documents, pictures (JPEG, as well as Canon RAW and Nikon RAW), web pages, archives, and email (stored as RFC822/RFC2822 compliant files with an eml extension).

My test metrics would be to take a handful of search terms which I new existed in various types of documents, and check the results (I actually used Microsoft Windows Search 4.0 to prepare a complete list of documents that matched the query — since I knew it worked as expected).

The search engines I tested were:

I was able to install, configure, and launch each of the applications.  Actually none of them were really that difficult to install and configure; but all of them required searching through documentation and third party sites — I’d say poor documentation is just something you have to get used to.

Beagle, Google, Tracker, Pinot, and Recoll all failed to find all the documents of interest… none of them properly indexed the email files — most of the failed to handle plain text files; that didn’t leave a very high bar to pick a winner.

Queries on Strigi actually provided every hit that the same query provided on Windows Search… though I have to say Windows Search was easier to setup and use.

I tried the Neopomuk (KDE) interface for Strigi — though it just didn’t seem to work as well as strigiclient did… and certainly strigiclient was pretty much at the top of the list for butt-ugly, user-hostile, un-intuitive applications I’d ever seen.

After all of the time I’ve spent on desktop search for Linux I’ve decided all of the search solutions are jokes.  None of them are well thought out, none of them are well executed, and most of them out right don’t work.

Like most Linux projects, more energy needs to be focused on working out a framework for search than everyone going off half-cocked and creating a new search paradigm.

The right model is…

A single multi-threaded indexer running in the background indexing files according to a system wide policy aggregated with user policies (settable by each user on directories they own) along with the access privileges.

A search API that takes the user/group and query to provide results for items that the user has (read) access to.

The indexer should be designed to use plug-in modules to handle particular file types (mapped both by file extension, and by file content).

The index should also be designed to use plug-in modules for walking a file system and receiving file system change events (that allows the framework to adapt as the Linux kernel changes — and would support remote indexing as well).

Additionally, the index/search should be designed with distributed queries in mind (often you want to search many servers, desktops, and web locations simultaneously).

Then it becomes a simple matter for developers to write new/better indexer plug-ins; and better search interfaces.

I’ve pointed out in a number of recent posts that you can effective use Linux as a server platform in your business; however, it seems that if search is a requirement you might want to consider ponying up the money for Microsoft Windows Server 2008 and enjoy seamless search (that works) between your Windows Vista / Windows 7 Desktops and Windows Server.

REFERENCES:

Ubuntu – Desktop Search

Originally posted 2010-07-16 02:00:19.

Ubuntu – Desktop Search

Microsoft has really shown the power of desktop search in Vista and Windows 7; their newest Desktop Search Engine works, and works well… so in my quest to migrate over to Linux I wanted to have the ability to have both a server style as well as a desktop style search.

So the quest begun… and it was as short a quest as marching on the top of a butte.

I started by reviewing what I could find on the major contenders (just do an Internet search, and you’ll only find about half a dozen reasonable articles comparing the various desktop search solutions for Linux)… which were few enough it didn’t take very long (alphabetical):

My metrics to evaluate a desktop search solutions would focus on the following point:

  • ease of installation, configuration, maintenance
  • search speed
  • search accuracy
  • ease of access to search (applet, web, participation in Windows search)
  • resource utilization (cpu and memory on indexing and searching)

I immediately passed on Google Desktop Search; I have no desire for Google to have more access to information about me; and I’ve tried it before in virtual machines and didn’t think very much of it.

Begal

I first tried Beagle; it sounded like the most promising of all the search engines, and Novel was one of the developers behind it so I figured it would be a stable baseline.

It was easy to install and configure (the package manager did most of the work); and I could use the the search application or the web search, I had to enable it using beagle-config:

beagle-config Networking WebInterface true

And then I could just goto port 4000 (either locally or remotely).

I immediately did a test search; nothing came back.  Wow, how disappointing — several hundred documents in my home folder should have matched.  I waited and tried again — still nothing.

While I liked what I saw, a search engine that couldn’t return reasonable results to a simple query (at all) was just not going to work for me… and since Begal isn’t actively developed any longer, I’m not going to hold out for them to fix a “minor” issue like this.

Tracker

My next choice to experiment with was Tracker; you couldn’t ask for an easier desktop search to experiment with on Ubuntu — it seems to be the “default”.

One thing that’s important to mention — you’ll have to enable the indexer (per-user), it’s disabled by default.  Just use the configuration tool (you might need to install an additional package):

tracker-preferences

Same test, but instantly I got about a dozen documents returned, and additional documents started to appear every few seconds.  I could live with this; after all I figured it would take a little while to totally index my home directory (I had rsync’d a copy of all my documents, emails, pictures, etc from my Windows 2008 server to test with, so there was a great deal of information for the indexer to handle).

The big problem with Tracker was there was no web interface that I could find (yes, I’m sure I could write my own web interface; but then again, I could just write my own search engine).

Strigi

On to Strigi — straight forward to install, and easy to use… but it didn’t seem to give me the results I’d gotten quickly with Tracker (though better than Beagle), and it seemed to be limited to only ten results (WTF?).

I honestly didn’t even look for a web interface for Strigi — it was way too much a disappointment (in fact, I think I’d rather have put more time into Beagle to figure out why I wasn’t getting search results that work with Strigi).

Recoll

My last test was with Recoll; and while it looked promising from all that I read, but everyone seemed to indicate it was difficult to install and that you needed to build it from source.

Well, there’s an Ubuntu package for Recoll — so it’s just as easy to install; it just was a waste of effort to install.

I launched the recoll application, and typed a query in — no results came back, but numerous errors were printed in my terminal window.  I checked the preferences, and made a couple minor changes — ran the search query again — got a segmentation fault, and called it a done deal.

It looked to me from the size of the database files that Recoll had indexed quite a bit of my folder; why it wouldn’t give me any search results (and seg faulted) was beyond me — but it certainly was something I’d seen before with Linux based desktop search.

Conclusions

My biggest conclusion was that Desktop Search on Linux just isn’t really something that’s ready for prime time.  It’s a joke — a horrible joke.

Of the search engines I tried, only Tracker worked reasonably well, and it has no web interface, nor does it participate in a Windows search query (SMB2 feature which directs the server to perform the search when querying against a remote file share).

I’ve been vocal in my past that Linux fails as a Desktop because of the lack of a cohesive experience; but it appears that Desktop Search (or search in general) is a failing of Linux as both a Desktop and a Server — and clearly a reason why choosing Windows Server 2008 is the only reasonable choice for businesses.

The only upside to this evaluation was that it took less time to do than to read about or write up!

Originally posted 2010-07-06 02:00:58.

strigiSearch

I’ve been working with desktop search solutions, and I’ve determined through a great deal of perspiration that strigi seems to be the only Linux based desktop search engine that reliably indexes the majority of files that I’m interested in searching (Microsoft Office, Open Office, and files containing RF822 compliant email).

While the core engine for strigi might be the best I found, the client interface and tools leave a great deal to be desired.  In fact, to really figure out how to use strigi I needed to download and peruse the source.

In the source I found an elegant command line Perl script (search.pl) which demonstrated how to submit queries to strigi through a Linux socket.  The script was easy to port over to PHP5, and in doing so I added some enhancements in my core routines to make it easier to use to write a search client — that is I organized the results as array elements so that I could easily manipulate them rather than needing to attempt to parse them apart.

I haven’t gone any further than just writing some basis PHP5 functions and a harness that drives it from the command line; but I’m posting the code for others.  Just run the script from the command line, and provide the query as arguments (if the query has a space, make sure you enclose it in quotes — and you can put multiple queries on the command line as well).

Running the program and seeing the dump of the data is by far the easiest way to understand what I’ve done.

strigiSearch.7z