Standardise the SGP algorithm

Registered by Matt Giuca

A proposal to standardise the SGP/SuperGenPass algorithm, and correct two noted issues.

Blueprint information

Status:
Not started
Approver:
None
Priority:
Undefined
Drafter:
None
Direction:
Needs approval
Assignee:
None
Definition:
New
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

Basically, I'd like to somehow standardize the process so that the tool becomes irrelevant and you can have any number of tools which do this algorithm.

Basically, I had my two official complaints, which were

   1. The unicode "bug" (for the purposes of standardization, rather than calling it a bug, it's an "unusual UnicodeString -> ByteString encoding scheme" (CodePoint % 256). The obvious solution is to replace this with the rather more well-known and data-preserving UTF-8 encoding scheme.
   2. The arbitrary choice of where to cut off the domain, based on a large database of known shared second-level domains. It's fairly unclear as to how to fix this.

So I want to attempt to address these issues through standardisation, and a tool that is aware of this process (and is a well-abstracted, well-commented reference implementation of the standard).

The algorithm has already been specified here:
http://supergenpass.com/about/#TechnicalDetails

I would start from that pseudocode as part of the specification. Note that the pseudocode avoids both of the above issues by:

   1. Conveniently assuming the MD5 algorithm works on Unicode strings (or conversely, assuming the input strings are byte strings).
   2. Accepting a Domain as input, rather than a URL.

#1 is an omission from the pseudocode so needs to be remedied (by explicitly stating the (CodePoint % 256) encoding scheme).
#2 is a good idea which I had separately before I looked at the pseudocode.

Basically, for #2, my idea of a "fix" is as follows. The "official" specification is defined in terms of a "Domain", not a "URL". The "Domain", for the purposes of this algorithm, is defined as any Unicode string (not necessarily of the form [a-z](.[a-z])*). It's just some arbitrary string which the password is defined over. (So it's more general than websites. I could make my domain "tax office" if I wanted to, as long as I can remember it ;) Practically speaking I'd probably pick "ato.gov.au" just to make sure I can remember it).

So the algorithm is simply defined as taking these two arbitrary strings as input. The method of translating a URL into a Domain is a feature of the UI, which looks at the address bar and somehow strips out a domain. The specification says (I make this up as I go):

    User Interfaces which employ this algorithm MAY automatically choose the domain from a given URL, but any interface doing so SHOULD display the domain to the user, to make it clear which "Domain" string is being used as input to the SGP algorithm.

    For example, an implementation MAY employ the following algorithm:

       1. Take the URL from the user's address bar. (eg. "http://www.foo.example.com.au/bar/baz")
       2. Strip off all characters outside the host name (eg. "www.foo.example.com.au")
       3. Take only the final two domain segments (eg. "com.au"), unless the final two domain segments are found in a database of known shared second-level domains (which "com.au" is), in which case take the final three domain segments (eg. "example.com.au"). Use this string as the "Domain" input to the SGP algorithm.

    Such an implementation SHOULD display the chosen domain (eg. "example.com.au") to the user, so the user knows that that domain has been chosen (and not "com.au" or "www.foo.example.com.au"). This is necessary for two reasons:

       1. So the user knows the Domain input, and can therefore use another SGP implementation to generate the same password.
       2. So the user can be sure a Domain wasn't chosen that is too general. If the algorithm erroneously chose "com.au", it would be a security risk to proceed, as the user would be given the same password for all sites in that domain.

    Any database of known shared second-level domains is implementation-specific and not part of this specification.

So basically, my solution to the arbitrary SLD problem is to take it out of the spec (and just leave it as a function of Domain), and make the UI more specific about which domain it has chosen.

My solution to the first problem (the Unicode encoding problem) is version numbering, unfortunately. I'd propose a first spec which is version 1.0, and it has a section about version numbering, which says that subsequent versions may be incompatible. A version 1.1 is then published which specifies UTF-8 encoding instead of CodePoint%256. The reference implementation has a switch which lets you select the version of the specification to use.

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.