import and convert dictionaries from other programs

Registered by Joshua Harlan Lifton on 2010-11-15

Many Plover users have steno experience with other programs and therefore have mature dictionaries in those programs' formats. A tool should exist to easily convert other programs' dictionaries to the Plover dictionary format.

Blueprint information

Status:
Not started
Approver:
None
Priority:
Undefined
Drafter:
None
Direction:
Needs approval
Assignee:
None
Definition:
New
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

A list of (Vim-flavored) regular expressions that will convert a dictionary exported in rtf/cre format into Python dictionary format. Ideally this should be turned into a simple script that new users can run on their dictionaries without prior knowledge of regular expressions. This has only been fully tested with rtf/cre dictionaries exported by Eclipse. Additional formatting is probably necessary for rtf/cre files exported from CAT software other than Eclipse. More testing is required. Note that Plover currently supports two types of steno dictionary: Eclipse format, where hyphens are only made explicit when necessary, and DigitalCAT format, where all hyphens are explicit. Default format is Eclipse, so if you are importing a DigitalCAT dictionary, change the format in Plover's .config file.

--------------------------------
# escape backslashes
%s/\\/\\\\/g

# escape "
%s/"/\\"/g

# convert double spaces to single spaces
%s/ / /g

# Remove lines with court reporter-specific paragraphing commands (this is drastic, but they cause no end of trouble. Will maybe try to support them
# to some degree in a later version.)
%s/^.*{$}.*$\n//
%s/^.*\\\\par\\\\.*$\n//

# Convert steno half of entry to Python format
%s/{\\\\.\\\\cxs \([^\}]\+\)}/"\1": /

# Get rid of any lines that don't start with quotes. (i.e., more court reporting formatting residue)
%s/^[^"].*$\n//

# Convert infixes.
%s/: \\\\cxds \(.*\)\\\\cxds/: {^\1^}/

# Convert suffixes.
%s/: \\\\cxds \(.*\)/: {^\1}/

# Convert prefixes.
%s/: \(.*\)\\\\cxds/: {\1^}/

# Delete "force uncap" command (caption-specific command that Plover doesn't need to implement now, if ever.)
%s/{l1}//g
%s/{l0}//g

# Delete \\cxp, the punctuation marker, since Plover recognizes specific punctuation marks independently.
%s/\\\\cxp//g

# Convert glue strokes.
%s/\\\\cxfing /\&/g

# Convert "cap next" strokes.
%s/\\\\cxfc /-|/g

# Convert "stitch" strokes to suffix with hyphen.
%s/{\\\\cxstit /{^-/

# Search for other cx strokes and deal with them manually.
/cx

# Delete spaces at ends of line.
%s/ \n/^M/g - (don't type in the ^M; do control-q, then control-m, and what will display is ^M)

# Convert other half of entries.
:%s/^"\([-A-Z0-9\/\*]\+\)": \(.*\)$/"\1": "\2",

# Put in curly brackets at beginning and end of dictionary
# I'm sure there's a way to do this automatically, but I just did it manually.

You can find a ~9 mb zip file containing several unconverted dictionaries in rtf format and a few converted dictionaries in json format as well, in both Eclipse (only necessary hyphens explicit) and DigitalCAT (all hyphens explicit) flavors of steno here:

http://stenoknight.com/plover/ploverdicts.zip

The DigitalCAT dictionaries will require much more weeding, since they have extra metadata that the regular expressions in the launchpad blueprint doesn't account for. Stuff like dictentrydate, which we can just cut out completely, and conflicts, which will require the sacrifice of the entry, since Plover doesn't support conflict differentiation (nor will it ever, if I have anything to say about it). Basically anything starting with cx is steno-specific metadata

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.