Entries Tagged as 'Search'

OS-X – Desktop Search

I’m posting this mainly to illustrate that not Microsoft alone get’s the importance of desktop search — Apple’s Spotlight provides much the same level of functionality as Windows Search in an equally seamless implementation.

So the question (once again) is why are all the Linux based desktop search solutions pathetic?

Originally posted 2010-07-20 02:00:15.

Ubuntu – Desktop Search

Microsoft has really shown the power of desktop search in Vista and Windows 7; their newest Desktop Search Engine works, and works well… so in my quest to migrate over to Linux I wanted to have the ability to have both a server style as well as a desktop style search.

So the quest begun… and it was as short a quest as marching on the top of a butte.

I started by reviewing what I could find on the major contenders (just do an Internet search, and you’ll only find about half a dozen reasonable articles comparing the various desktop search solutions for Linux)… which were few enough it didn’t take very long (alphabetical):

My metrics to evaluate a desktop search solutions would focus on the following point:

  • ease of installation, configuration, maintenance
  • search speed
  • search accuracy
  • ease of access to search (applet, web, participation in Windows search)
  • resource utilization (cpu and memory on indexing and searching)

I immediately passed on Google Desktop Search; I have no desire for Google to have more access to information about me; and I’ve tried it before in virtual machines and didn’t think very much of it.

Begal

I first tried Beagle; it sounded like the most promising of all the search engines, and Novel was one of the developers behind it so I figured it would be a stable baseline.

It was easy to install and configure (the package manager did most of the work); and I could use the the search application or the web search, I had to enable it using beagle-config:

beagle-config Networking WebInterface true

And then I could just goto port 4000 (either locally or remotely).

I immediately did a test search; nothing came back.  Wow, how disappointing — several hundred documents in my home folder should have matched.  I waited and tried again — still nothing.

While I liked what I saw, a search engine that couldn’t return reasonable results to a simple query (at all) was just not going to work for me… and since Begal isn’t actively developed any longer, I’m not going to hold out for them to fix a “minor” issue like this.

Tracker

My next choice to experiment with was Tracker; you couldn’t ask for an easier desktop search to experiment with on Ubuntu — it seems to be the “default”.

One thing that’s important to mention — you’ll have to enable the indexer (per-user), it’s disabled by default.  Just use the configuration tool (you might need to install an additional package):

tracker-preferences

Same test, but instantly I got about a dozen documents returned, and additional documents started to appear every few seconds.  I could live with this; after all I figured it would take a little while to totally index my home directory (I had rsync’d a copy of all my documents, emails, pictures, etc from my Windows 2008 server to test with, so there was a great deal of information for the indexer to handle).

The big problem with Tracker was there was no web interface that I could find (yes, I’m sure I could write my own web interface; but then again, I could just write my own search engine).

Strigi

On to Strigi — straight forward to install, and easy to use… but it didn’t seem to give me the results I’d gotten quickly with Tracker (though better than Beagle), and it seemed to be limited to only ten results (WTF?).

I honestly didn’t even look for a web interface for Strigi — it was way too much a disappointment (in fact, I think I’d rather have put more time into Beagle to figure out why I wasn’t getting search results that work with Strigi).

Recoll

My last test was with Recoll; and while it looked promising from all that I read, but everyone seemed to indicate it was difficult to install and that you needed to build it from source.

Well, there’s an Ubuntu package for Recoll — so it’s just as easy to install; it just was a waste of effort to install.

I launched the recoll application, and typed a query in — no results came back, but numerous errors were printed in my terminal window.  I checked the preferences, and made a couple minor changes — ran the search query again — got a segmentation fault, and called it a done deal.

It looked to me from the size of the database files that Recoll had indexed quite a bit of my folder; why it wouldn’t give me any search results (and seg faulted) was beyond me — but it certainly was something I’d seen before with Linux based desktop search.

Conclusions

My biggest conclusion was that Desktop Search on Linux just isn’t really something that’s ready for prime time.  It’s a joke — a horrible joke.

Of the search engines I tried, only Tracker worked reasonably well, and it has no web interface, nor does it participate in a Windows search query (SMB2 feature which directs the server to perform the search when querying against a remote file share).

I’ve been vocal in my past that Linux fails as a Desktop because of the lack of a cohesive experience; but it appears that Desktop Search (or search in general) is a failing of Linux as both a Desktop and a Server — and clearly a reason why choosing Windows Server 2008 is the only reasonable choice for businesses.

The only upside to this evaluation was that it took less time to do than to read about or write up!

Originally posted 2010-07-06 02:00:58.

Linux – Desktop Search

A while ago I published a post on Desktop Search on Linux (specifically Ubuntu).  I was far from happy with my conclusions and I felt I needed to re-evaluate all the options to see which would really perform the most accurate search against my information.

Primarily my information consists of Microsoft Office documents, Open Office documents, pictures (JPEG, as well as Canon RAW and Nikon RAW), web pages, archives, and email (stored as RFC822/RFC2822 compliant files with an eml extension).

My test metrics would be to take a handful of search terms which I new existed in various types of documents, and check the results (I actually used Microsoft Windows Search 4.0 to prepare a complete list of documents that matched the query — since I knew it worked as expected).

The search engines I tested were:

I was able to install, configure, and launch each of the applications.  Actually none of them were really that difficult to install and configure; but all of them required searching through documentation and third party sites — I’d say poor documentation is just something you have to get used to.

Beagle, Google, Tracker, Pinot, and Recoll all failed to find all the documents of interest… none of them properly indexed the email files — most of the failed to handle plain text files; that didn’t leave a very high bar to pick a winner.

Queries on Strigi actually provided every hit that the same query provided on Windows Search… though I have to say Windows Search was easier to setup and use.

I tried the Neopomuk (KDE) interface for Strigi — though it just didn’t seem to work as well as strigiclient did… and certainly strigiclient was pretty much at the top of the list for butt-ugly, user-hostile, un-intuitive applications I’d ever seen.

After all of the time I’ve spent on desktop search for Linux I’ve decided all of the search solutions are jokes.  None of them are well thought out, none of them are well executed, and most of them out right don’t work.

Like most Linux projects, more energy needs to be focused on working out a framework for search than everyone going off half-cocked and creating a new search paradigm.

The right model is…

A single multi-threaded indexer running in the background indexing files according to a system wide policy aggregated with user policies (settable by each user on directories they own) along with the access privileges.

A search API that takes the user/group and query to provide results for items that the user has (read) access to.

The indexer should be designed to use plug-in modules to handle particular file types (mapped both by file extension, and by file content).

The index should also be designed to use plug-in modules for walking a file system and receiving file system change events (that allows the framework to adapt as the Linux kernel changes — and would support remote indexing as well).

Additionally, the index/search should be designed with distributed queries in mind (often you want to search many servers, desktops, and web locations simultaneously).

Then it becomes a simple matter for developers to write new/better indexer plug-ins; and better search interfaces.

I’ve pointed out in a number of recent posts that you can effective use Linux as a server platform in your business; however, it seems that if search is a requirement you might want to consider ponying up the money for Microsoft Windows Server 2008 and enjoy seamless search (that works) between your Windows Vista / Windows 7 Desktops and Windows Server.

REFERENCES:

Ubuntu – Desktop Search

Originally posted 2010-07-16 02:00:19.

Desktop Search

Let me start by saying that Windows Desktop Search is a great addition to Windows; and while it might have taken four major releases to get it right, for the most part it works and it works well.

With Windows Server 2008, Windows Vista, and Windows 7 Desktop Search is installed and enabled by default; and it works in a federated mode (meaning that you can search from a client against a server via the network).

Desktop Search, however, seems to have some issues with junction points (specifically in the case I’ve seen — directory reparse, or directory links).

The search index service seems to do the right thing and not create duplicates enteries when both the parent of the link and the target are to be indexed (though I don’t know how you would control whether or not the indexer follows links in the case where the target wouldn’t normally be indexed).

The search client, though, does not seem to properly provide results when junction points are involved.

Let me illustrate by example.

Say we have directory tree D1 and directory tree D2 and both of those are set to be indexed.  If we do a search on D1 it produces the expected results.  If we do a search on D2 it produces the expected results.

Now say we create a junction point (link) to D2 from inside D1 called L1.  If we do a search on L1 we do not get the same results as if we’d searched in D2.

My expectation would be that the search was “smart” enough to do the search against D2 (taking the link into consideration) and then present the results with the path altered to reflect the link L1.

I consider this a deficiency; in fact it appears to me to be a major failing since the user of information shouldn’t be responsible for understanding all the underlying technology involved in organizing the information — he should just be able to obtain the results he expects.

It’s likely the client and the search server need some changes in order to accommodate this; and I would say that the indexer also needs a setting that would force it to follow links (though it shouldn’t store the same document information twice).

If this were a third party search solution running on Windows my expectation would be that file system constructs might not be handled properly; but last time I checked the same company wrote the search solution, the operating system, and the file system — again, perhaps more effort should be put into making things work right, rather than making things [needlessly] different.

Originally posted 2010-01-22 01:00:57.

Windows – Desktop Search

Most people realize how valuable Internet search engines are; but not everyone has figured out how valuable desktop (and server) search engines can be.

Even in corporate environments where data storage is highly organized it’s easy to forget where something is, or not know that someone else has already worked on a particular document — but if you could quickly and efficiently search all the public data on all the machines in your organization (or home) you could find those pieces of information you either misplaced or never knew about.

With Windows Search it just happens.  If you have access to a document, and you search — you can find it.  Open up a file explorer Window and point it at location you think it might be, type in the search box — and matching documents quickly appear (and those that don’t match disappear).  Do the same thing against a remote share – and it happens magically (the remote box does all the work).  It’s even possible to  be able to search multiple servers simultaneously – and it doesn’t require a rocket scientist to setup.

Windows Search is already on Windows 7 and Windows Server 2008 as well as Windows Vista (you’ll want to apply updates) — and easily installable on Windows XP and Windows Server 2003.  In fact, the defaults will probably do fine — just install and go (of course it will take a little while to index all your information).

A developer can fairly easily enhance search to include more document types using (plenty of examples, and it uses a model that Microsoft has employed in many parts of Windows)…   The search interface can be used via API, embedded in a web page, or just used directly from the search applet (which appears in auto-magically in Windows 7 and Windows Vista).

Very few Microsoft products are worth praise — but Windows Search is; and from my personal experience no competitor on any platform compares.

To those looking to write a “new” desktop search; look at Windows Search and understand what it does and how it works before you start your design.

Windows Search

Originally posted 2010-07-17 02:00:24.

strigiSearch

I’ve been working with desktop search solutions, and I’ve determined through a great deal of perspiration that strigi seems to be the only Linux based desktop search engine that reliably indexes the majority of files that I’m interested in searching (Microsoft Office, Open Office, and files containing RF822 compliant email).

While the core engine for strigi might be the best I found, the client interface and tools leave a great deal to be desired.  In fact, to really figure out how to use strigi I needed to download and peruse the source.

In the source I found an elegant command line Perl script (search.pl) which demonstrated how to submit queries to strigi through a Linux socket.  The script was easy to port over to PHP5, and in doing so I added some enhancements in my core routines to make it easier to use to write a search client — that is I organized the results as array elements so that I could easily manipulate them rather than needing to attempt to parse them apart.

I haven’t gone any further than just writing some basis PHP5 functions and a harness that drives it from the command line; but I’m posting the code for others.  Just run the script from the command line, and provide the query as arguments (if the query has a space, make sure you enclose it in quotes — and you can put multiple queries on the command line as well).

Running the program and seeing the dump of the data is by far the easiest way to understand what I’ve done.

strigiSearch.7z