glot database goals and design
The glot database architecture needs to be designed to be quick for multiple types of lookups and flexible enough to handle a wide variety of languages and dictionary formats. The aim is not to create a new dictionary format, but to be able to import and export dictionaries in existing formats into/from the database.
Blueprint information
- Status:
- Not started
- Approver:
- None
- Priority:
- Undefined
- Drafter:
- None
- Direction:
- Needs approval
- Assignee:
- None
- Definition:
- New
- Series goal:
- None
- Implementation:
- Unknown
- Milestone target:
- None
- Started by
- Completed by
Related branches
Related bugs
Bug #352792: Unit tests need to address current DB schema | In Progress |
Bug #352820: DB Schema Diagram | Confirmed |
Bug #353546: ORM Evaluation and Implementation | In Progress |
Sprints
Whiteboard
Goals:
1. Be structured and indexed in such a way to optimally handle rapid, sequential queries (both for incremental search (sequential queries are related) and iterative lookups (eg. lookup all words in a document... sequential queries are not necessarily related in any way, but most likely are the same language)
2. Able to quickly return results for regular expression searches.
3. Flexible enough to handle an arbitrary number of languages and dictionaries. Even dictionaries in the same language may have different fields (part-of-speech, valence, comments, orthographies). This means that we may get a dictionary with only one field (like a word in a single language) or one with 100 fields. All of these need to be handled by the same DB.
4. If a user adds a new entry to some dictionary (or multiple dictionaries), the database should keep track of which entries are new and which are from the source file.
5. From (4), it follows that the database should know what entries came from what dictionaries, even if multiple dictionaries provided similar entries. This implies some data duplication.
6. Lookups can be precise (explicitly define the language, field, dictionary, word, etc) or vague (only a word, or a language, etc)
7. We want to distinguish between different words with the same spelling. This means an extra table of "word senses". For instance, off the top of my head I can think of 7 meanings of "dog" ("a canine", short for "hotdog", "a scoundrel", "a womanizer", "feet", "to follow", an adjective ("dog tired")), and there might be more.
Based on these criteria, I propose the following schema:
( Table: |columns| )
languages:
| id | name |
fields:
| id | name | languages(id) |
dictionaries:
| id | name | description | import_date |
dictionary_
| dictionaries(id) | fields(id) |
words:
| id | orthography | fields(id) | dictionaries(id) |
senses:
| id | words(id) |
translations:
| id | senses(id) | senses(id) |
Alternative implementations:
1. Replace the words table with the following:
orthographies:
| id | orthography | fields(id) |
words:
| id | orthographies(id) | dictionaries(id) |
But this implementation only makes sense if we anticipate many orthographies for a single word.
2. Have separate tables or databases for each dictionary. This would allow us to break up the load required for a query when the number of dictionaries and entries becomes huge. We might explore this option later, when a need becomes apparent.
3. We may want a table of "constants" to minimize string comparisons. This way, we look up a string once, get an id, then do further lookups with that id.
4. We could reduce some data data redundancy by moving some columns (such as dictionary_id on words) to a separate table so it only appears once for each word_id. We need to be careful with this if we want to combine similar entries from multiple dictionaries at all.