Implement sensible versioning of congregation & group data

Registered by Tim Black on 2010-05-05

Databases uniquely identify records by assigning a unique ID (normally an autoincremented integer) in an 'id' column in every table. This works fine when you have one database. But, we're dealing with federated data--data in many databases, each of which has its own set of IDs, and normally the RCL database will have no access to non-RCL databases' ID columns. The best way to deal with this situation at present is to use UUIDs - universally unique identifiers, which are basically long randomly generated integers that are thereby highly probable to be unique. But non-RCL databases (feed sources) normally won't provide us UUIDs.

When we update one feed's cache with new data from the feed, in the case where the feed does provide its own set of IDs (though not UUIDs) for each congregation/group (e.g., as a URL parameter like ?id=123), we can record that ID not as the RCL ID, but as the feed-specific ID for that congregation/group, and thereby avoid creating duplicates in that feed's data. This will not prevent duplicates in other cases, though. It also might fail if a feed source unexpectedly rearranges its IDs.

So, in addition to what is described in other blueprints already, we need to decide how to match

a) a congregation/group coming from a feed that doesn't already associate the congregation/group with a RCL UUID with

b) the right RCL congregation/group UUID.

Perhaps do it via similarity heuristics (like difflib http://docs.python.org/library/difflib.html, Levenshtein http://code.google.com/p/pylevenshtein/, or Damerau-Levenshtein http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison), but this is bound to create duplicate data unless we automate the matching heuristic to progressively decrease its sensitivity until a match is made (but then, we'll still need to determine if a NEW congregation was added in the source feed--how do we do that? Some sources won't flag congregations as new. We might be able to approximate a flag by counting the number of congregations.) So do we need a way to clean up duplicates manually?

We may also need to prioritize different fields - a congregation's name is more likely to remain the same than its street address. Likewise, we can consider prioritizing the beginning & end of some fields differently. E.g., changes after the first two words in the street address are less likely to be significant than changes to the first two words. Changes after the first word of the congregation name are less likely to be significant than changes to the first word.

Blueprint information

Status:
Not started
Approver:
None
Priority:
Essential
Drafter:
None
Direction:
Approved
Assignee:
None
Definition:
Drafting
Series goal:
Accepted for couchapp-backbone
Implementation:
Unknown
Milestone target:
milestone icon 0.3.5

Related branches

Sprints

Whiteboard

One good way to do versioning in CouchDB is described at http://blog.couchbase.com/simple-document-versioning-couchdb.

----------

It seems this versioning plan will require the following tables and fields:

feed
 - id
 - name
 - uuid

congregation
 - id
 - name
 - etc...
 - uuid
 - version_date
 - feed_id # foreign key to feed.uuid, identifying from which feed this version came

group
 - id
 - name
 - etc...
 - uuid
 - version_date
 - feed_id # foreign key to feed.uuid, identifying from which feed this version came

When a user creates a new record in the database, the model (or maybe the controller that writes to the model/database) should have a function that applies a Levenshtein similarity comparison to the submitted data, comparing it with existing records, then displays to the user a list of records that might be the same UUID as the congregation or group which the user has in mind, so the user can state that this data is actually for a new congregation/group, or that it is an update to an existing congregation/group.

When a feed is read and the feed's ID structure or data doesn't match existing data, we'll have to create some kind of automated Levenshtein comparison, like what is described above.

The section at http://pypi.python.org/pypi/vdm/0.7 titled "General Conceptual Documentation" provides some very good advice (including the other pages to which that section links) on how to store versioned data in a database--there are two ways--1) version every object separately, or 2) version all objects and their relations together as one version, and while 1) is easier to implement, it has some listed shortcomings. We need to both be able to store one version of the data which relates a congregation and its group and people all on the same date (so we need something like pattern 2), AND we need to have a way to relate one group's data at an earlier date with one congregation's data at a later date (pattern 1 enables this). I think the way to do this is to still use the id column for relations (which is SQLAlchemy's default), but when looking up a congregation's related group, use that group's UUID to find other versions of that group, in order to find the most recent (or most authoritative) version of that group in order to know what to display to the user. So that means we will follow pattern 1) primarily, but implement pattern 2) in a secondary fashion.

(?)

Work Items

Dependency tree

* Blueprints in grey have been implemented.

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.