glot database goals and design

Registered by Michael Wayne Goodman on 2009-04-01

The glot database architecture needs to be designed to be quick for multiple types of lookups and flexible enough to handle a wide variety of languages and dictionary formats. The aim is not to create a new dictionary format, but to be able to import and export dictionaries in existing formats into/from the database.

Blueprint information

Status:
Not started
Approver:
None
Priority:
Undefined
Drafter:
None
Direction:
Needs approval
Assignee:
None
Definition:
New
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Whiteboard

Goals:

1. Be structured and indexed in such a way to optimally handle rapid, sequential queries (both for incremental search (sequential queries are related) and iterative lookups (eg. lookup all words in a document... sequential queries are not necessarily related in any way, but most likely are the same language)

2. Able to quickly return results for regular expression searches.

3. Flexible enough to handle an arbitrary number of languages and dictionaries. Even dictionaries in the same language may have different fields (part-of-speech, valence, comments, orthographies). This means that we may get a dictionary with only one field (like a word in a single language) or one with 100 fields. All of these need to be handled by the same DB.

4. If a user adds a new entry to some dictionary (or multiple dictionaries), the database should keep track of which entries are new and which are from the source file.

5. From (4), it follows that the database should know what entries came from what dictionaries, even if multiple dictionaries provided similar entries. This implies some data duplication.

6. Lookups can be precise (explicitly define the language, field, dictionary, word, etc) or vague (only a word, or a language, etc)

7. We want to distinguish between different words with the same spelling. This means an extra table of "word senses". For instance, off the top of my head I can think of 7 meanings of "dog" ("a canine", short for "hotdog", "a scoundrel", "a womanizer", "feet", "to follow", an adjective ("dog tired")), and there might be more.

Based on these criteria, I propose the following schema:

( Table: |columns| )

    languages:
      | id | name |
    fields:
      | id | name | languages(id) |
    dictionaries:
      | id | name | description | import_date |
    dictionary_fields:
      | dictionaries(id) | fields(id) |
    words:
      | id | orthography | fields(id) | dictionaries(id) |
    senses:
      | id | words(id) |
    translations:
      | id | senses(id) | senses(id) |

Alternative implementations:

1. Replace the words table with the following:
    orthographies:
      | id | orthography | fields(id) |
    words:
      | id | orthographies(id) | dictionaries(id) |
  But this implementation only makes sense if we anticipate many orthographies for a single word.

2. Have separate tables or databases for each dictionary. This would allow us to break up the load required for a query when the number of dictionaries and entries becomes huge. We might explore this option later, when a need becomes apparent.

3. We may want a table of "constants" to minimize string comparisons. This way, we look up a string once, get an id, then do further lookups with that id.

4. We could reduce some data data redundancy by moving some columns (such as dictionary_id on words) to a separate table so it only appears once for each word_id. We need to be careful with this if we want to combine similar entries from multiple dictionaries at all.

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.