Adorn String Literals for Better Python 2/3 Support

Registered by Aaron Whitehouse

One of the most time consuming, and least easy to automate, parts of supporting both Python 2 and 3 is string literals. This is because simple strings (e.g. a = "Hello") will be treated as bytes (e.g. encoded ASCII) in Python 2 and Unicode in Python 3. As we are trying to support both Python 2 and Python 3 for at least a transition period, we may end up with odd behaviour wherever we have an unadorned string.

The versions of Python 2 and 3 we are targeting means that we can "adorn" strings with letters to indicate what type of string (u for Unicode or b for Bytes).

An important preliminary step to Python 2/3 support is therefore for us to add these adornments to each and every string literal in the code base.

Blueprint information

Status:
Complete
Approver:
Kenneth Loafman
Priority:
Undefined
Drafter:
Aaron Whitehouse
Direction:
Approved
Assignee:
Aaron Whitehouse
Definition:
Approved
Series goal:
Proposed for 0.8
Implementation:
Implemented
Milestone target:
None
Started by
Aaron Whitehouse
Completed by
Aaron Whitehouse

Sprints

Whiteboard

Duplicity aims to support Python 3:
https://blueprints.launchpad.net/duplicity/+spec/python3
but will do so by supporting both Python 2 and Python 3 first.

One of the most time consuming, and least easy to automate, parts of supporting both Python 2 and 3 is string literals. This is because simple strings (e.g. a = "Hello") will be treated as bytes (e.g. encoded ASCII) in Python 2 and Unicode in Python 3. As we are trying to support both Python 2 and Python 3 for at least a transition period, we may end up with odd behaviour wherever we have an unadorned string. The versions of Python 2 and 3 we are targeting means that we can "adorn" strings with letters to indicate what type of string (u for Unicode or b for Bytes).

An important preliminary step to Python 2/3 support is therefore for us to add these adornments to each and every string literal in the code base.

As the codebase is currently Python 2 only, marking strings as Bytes (b" ") essentially preserves current behaviour, but it is highly desirable to convert as many of these as possible to Unicode strings (u" "), as these will be much easier to work with as we transition to Python 3 and it will improve non-ASCII support. This will likely require changes to other parts of the code that interact with the string.

A related change that is helpful to make at the same time is to convert any "str()"s to either "unicode()" or "bytes()".

The broad recommended approach for text is to decode at the boundaries (e.g. when reading from or writing to files) and use Unicode throughout internally. Many built-ins and libraries natively support Unicode, so in many cases very little needs to change to the code.

Many helper variables/functions have already been created in duplicity so that you can use Unicode wherever possible. For paths, for example, you can use Path.uc_name instead of Path.name.

See: http://python-future.org/what_else.html#bytes for some other potential "gotchas" in working with strings, bytes and unicode across a mixed Python 2/3 codebase.

=== Implementation Plan ===
To ensure that we can find these and do not accidentally introduce more unadorned strings, we have added a function to our test_code that automatically checks all .py files for unadorned strings and gives an error if any are found.

This takes the approach we have for many of our other code style checks, where it currently contains a very long list of excluded files (which are not checked) and we can remove these exceptions as we adorn the strings in each file.

The work plan is therefore:
1. Remove one of the excluded filenames in testing/test_code.py, test_unadorned_string_literals, ignored_files
2. Run test suite (e.g. tox -e code ) and check that it gives an error about unadorned string literals.
3. Find the unadorned string literals. We have created testing/find_unadorned_strings.py to assist people in finding and correcting all of the unadorned strings in a particular file, which can be executed directly with a python file as an argument and it will return a nicely-formatted list of all unadorned strings in the file that need to be corrected:
python3 find_unadorned_strings python_file.py
4. Adorn each string literal with a u wherever possible (making corresponding changes to support Unicode where necessary), or a b where the string should stay as bytes.
5. Run the test again to confirm there are no more unadorned string literals in that file.
6. Run full test suite to ensure no regressions.
7. Merge.

(?)

Work Items

Work items:
Create test for unadorned string literals (with ignores for all files that still contain unadorned string literals): DONE
Create helper script to help locate the unadorned string literals in a file: DONE
Adorn the strings in each file, removing the file from the ignore/exclusions list when completed: DONE

Dependency tree

* Blueprints in grey have been implemented.

This blueprint contains Public information 
Everyone can see this information.