Handle the Subject Availability

Registered by Jason Smith on 2009-10-10

In its current form zeitgeist will frequently return results of non-existent files (0.3+ series). This is modestly to greatly annoying to application authors who are then forced to do this filtering manually (along with some additional guesswork to pad out filtered results and such). As a rule, ZG should not return results for files which no longer exist, or is not currently available.

It should be possible to specify in the query whether or not to allow for unavailable items.

Blueprint information

Zeitgeist Framework Team
Mikkel Kamstrup Erlandsen
Needs approval
Mikkel Kamstrup Erlandsen
Series goal:
Accepted for 0.8
Milestone target:
Started by
Mikkel Kamstrup Erlandsen on 2009-11-23
Completed by
Seif Lotfy on 2011-05-07


STATUS: kamstrup started hacking on a gio based storage volume monitor (2010-01-16). It's not hooked up to the DB yet, and far from functioning. We still also haven't wired the queries in the core engine to properly handle storage mediums. Monitoring network via NetworkManager and ConnMan is not started yet, but should be straight forward once the volume handling is in place

In the resonance engine (0.3 series) we all event subjects link to a storage medium which is an entry in a table called 'storage'. The storage table has three columns: INTEGER id, VARCHAR system_label, INTEGER storage_state.

One row in the storage table might represent your USB pendrive and have system_label set to the UUID of that drive. When you unplug the drive Zeitgeist should set storage_state to StorageState.NotAvailable. We will do similar things with subjects requiring internet connectivity.

To handle deleted subjects we should have a special storage medium called "Deleted" or something like that, that always has storage_state set to StorageState.NotAvailable.


*** Approval
Seif: +1
RainCT: +1 *
kamstrup: -1
thekorn: -1

*** Discussion
Seif: I would propose we add a "reachable column" or "reachable table" in the item DB which is set to true or false. We then need a monitor that checks every 10 minutes check if the subjects are reachable and remark them. Or write a inotify DP that subscribes to "rename" "move" and "delete" and modify the columns accordingly or the uris. We could also just delete the item form the item table.

We should not delete the items from the event table since it will mess up the context usage. Again if you eat an apple then it does not exist anymore but its memory remains and the time dedicated to it is still there :)


RainCT: Getting this good performance wise and speed (as in how long we have wrong data) wise will be tricky though. Also problematic are files on removable media.


kamstrup: Polling for existence of all recently used files every 10 minutes is just not a working solution imho - sorry! It will needlessly burn battery and will not be very nice to work with anyway - if I am actively using my computer a lot of files might come and go all the time; we need instant reflection of this (or otherwise we should not do it at all - which I might be inclined towards).

On top of that file system crawling is not an easy task at all - ask the the Tracker developers about this :-) Hog the machine's IO resources and responsiveness of the entire desktop grinds to a halt. Remember the "good" olde Firefox 3.0 days where FF froze the entire desktop for 30s+ every so often?

And - does it end here? Should we also not list online resources if we are not connected? Should we not list files on a non-mounted drive? Etc. etc.

It will save us a bunch of work if we firmly assert that "Zeitgeist is a log". You don't delete log entries because from a ship's log because one of the sailors fall over board :-)

We *may* have a chance of getting this right when FANotify becomes mainstream, but I am not sure...


Seif: Like Mikkel said we cannot change history so we cant delete events form the log. However we can mark items (not events) as unreachable. Fist we will need to extend the the item table with a new column "state" or a new table called "item state" which consists of two columns "item_id, state (int)". The state can be one of 0,1,2

0: reachable which will be the default

1: blacklisted:
the state of blacklisted could be determined by a little file for which i will create UI that allows you to set Items as blacklisted thus never exposing its events per default#

2: unreachable: I will write a little Dataprovider that uses inotify to detect rename/move and delete events. On delete we will change the the state to 2 else we will just remodify the uris (I think this dataprovider and the table need a blueprint of its own which I could take over if you guys don't mind)

I don't mind also pusing this to the hackfest. Basically make 0.3 a stable platform and during the hackfest add these neat features (unless they could break the api, which it shouldn't since it is done internally)


kamstrup: @Seif - If we keep a separate table with the state and make the convention that unlisted items are also reachable then we can save some disk space. It is however a change in the database schema. The general idea of what you outline sounds sensible...

As you said it is a good topic for the hackfest, but I think that it might be a bit naiive that assume that we can accomplish this in a satisfactory way using a simple inotify python script... I would love to be proved wrong though :-)


Seif: @kamstrup please check the related branch! Now this is just a shitty hack from a tutorial! just change in the source the path to your home directory. :) I am trying


Jason: Not to throw another wrench in the plan here, but just because you shouldn't (in most cases) return bogus results does not mean you shouldn't *keep* them. It is possible for a result to go from existing to not existing to existing again (removable media). I think however that the discussion here has fallen away from the original spirit of the conversation. The problem is as follows.

Most application developers (ok at least *this* application developer who uses ZG in two places now) would rather not show "dead" results. Dead results being unreachable results at this very instance. No internet == no web results. File doesnt exist? don't show it. Basically if I can't click on it and get to it, I dont want to know about it. Now I *can* do this filtering myself, but it becomes a balancing act then. I then ask ZG for more results than I wish to display. Say I want to display 10, so I ask for 20 and 5 get filtered. All is well. However quite a bit of the time, 15 or even all 20 get filtered. This means that despite there being valid results out there, I am not displaying them.

To fix this, as an application developer I am doing one of two things. In cases where I only care a little (its not super important) I just ask for some obscene number, like n * 100 or even n^2 results and filter those until I have enough. This works most of the time but as you can imagine is a bit odd (and not a sure thing). In cases where I really really care, I send out a request with no upper bound. This can be somewhat slow on "busy" systems, especially if you have somewhat complex filtering on top of it. It should be obvious neither of these solutions are good, and it does present a clear design problem.

Maybe we can focus our discussion on how exactly to solve this problem without making it very difficult on ZG to manage. Without this kind of filtering (at least as an option) ZG becomes a system where I never use the max_items filter and always just request the whole match set because I dont know the quality of the results I will get. I guess in short its a quality control issue.

One last note: all my qualitative performance is done on a netbook with normal hdd. I know I did not provide numbers, I will do benchmarks in the future to show the problem more clearly.


RainCT: I just want to mention that another outstanding topic which can help with some of the same use cases is the "paging / saved queries / whatever we call them" (eg. asking for 5 events matching some filters, then being able to ask for the 5 next); let's not forget to discuss this at Bolzano! (Just to keep it in mind, this doesn't mean we shouldn't implement an "available" filter, both have use cases).


Jason: One more quick note. It seems to me, that a combination of what Seif and everyone have proposed here for solving this issue with files could be augmented with a flag in the item that says "Connectivity Required". So web addresses and so on could be marked as requiring connectivity to view. Then as a consumer of ZG, I can in my standard filters just pass along that I don't want things with connectivity when there is no connectivity. From my end the problem is resolved at that point.


kamstrup: @Jason - I can certainly see you point and I am convinced now that we must find a solution. It is however pretty far from being atrivial problem. How about removable drives for instance? If I unplug my external USB HD with 10.000 mp3s on it I hope we wouldn't need to update 10k rows in the DB...

One idea that would not require massive row updates would be to introduce a "datastore" table with three columns datastore.storage_id, datastore.storage_name, datastore.available. One row for each storage medium online, each local harddrive partition, usb keys, etc. The datastorage.name could be extracted from the serial number of the mounted device, or otherwise be constructed adhoc for more fuzzy mediums (eg. online material). Then have the rows in our normal item table point into this structure. When we go off line we only have to twiddle one row in the datastore table to mark all online files unavailable.

I know that the Tracker guys have solved this problem. We should check what they have done... Anyways - just thinking out loud :-)


kamstrup: @Seif: Are you mad? ;-) That script nearly choked my system. It starts adding recursive watches to each and every file under my home dir. Running 'find * | wc -l' I see that my home dir contains 494221 files which is pretty well beyond the default number of inotify watches :-)


I think we can reduce this problem to one sentence:
  "Some clients are only interested in existing objects"
I understand this requirement, but I don't think that the engine should care about checking the existence of objects activly, because:
  * it is too expensive (performance wise)
  * and it will be a pain to develop a sane solution
In my opinion we can tackle this issue by implementing this iterator idea. This way clients can do the necessary checks and request as many objects as needed. Also the performance should be better than in the other solution.
In any case, the engine should never delete any object automatically (without client action) from the database.


Fwiw, I agree with kamstrup in that handling removable media should be as easy as getting the device uid from gio and adding metadata to the entry so, when the device goes missing, you only need to register that the device is no longer available to automatically skip it's results.

Wrt returning file uris that don't actually exist, I'd normally say that this is an application issue, in that it needs to decide what to show to users, however I see the issue Jason is having with presenting a set number of 'real' results to a user. In that case, would it work to have some flags as an extra argument to the search that would make zeitgeist do some extra filtering on the results (excuse the C):


or something. This would cause zeitgeist to g_file_test the uris before adding them to the result set (or check web connection or check drive is plugged in etc). Of course, I'm not all that familiar with the zeitgeist API, so there maybe an easier way to indicate this, but I hope the idea is clear.

The plus points would be you would only test the existence of a file/object when you need to, instead of having to inotify the entire home directory. The downsides I see are an added step before creating the result set, and also maybe some more typing for the developers.

P.S. I think the filtering of unreachable/reachable results shouldn't be automatic but be developer controllable because you may want to indicate to the user that the file they want exists, but they've forgotten to plugin their usb drive.


kamstrup: @njpatel : Ok, so we have two kinds of "reachability checking". "Indirect" - which simply checks some parameter in the datastore table I suggested (eg. connectivity), and "direct" - which would require explicit stat()ing of the resources.

But consider the case where I request existing files and the result set contains files from all of : my bluetooth phone, usb key drive, smb shares, mtp-compatible music player, google docs, and of course also my local hard drive. It will require a non-trivial amount of logic to figure out which devices to check and which not to.

Then there is the case where we want to provide the full hit count. Even if the raw result set from sqlite only contains local fiels then stat()ing, eg. 1027 files, is not a really nice thing to do.

One solution is to keep a big LRUCache tracking the existence of local file URIs and keep it up to date with inotify or fanotify. This could help a bit with the latency, but I am not particularly keen on doing all this IO, but we may not have a choice...

Catch 22: My gut instinct tells me that we shouldn't stat() files on a usb key drive because of inferior IO. But that would mean that Zeitgeist wouldn't do reachability tracking correctly if I am running with a live session from a usb stick :-S

Seif: Since its blocked and we have no realistic plan on when we will deploy I set the milestone and target to None


Work Items

This blueprint contains Public information 
Everyone can see this information.