Comment on attachment 54126
A new Python 3 port of dbus-python

Review of attachment 54126:
-----------------------------------------------------------------

::: PY3PORT.rst
@@ +38,5 @@
> + - All object reprs are unicodes.  This change was made because it greatly
> +   simplifies the implementation and cross-compatibility with Python 3.
> + - Some values which were ints are now longs.  Primarily, this affects the
> +   type of the `variant_level` attributes.
> + - There is a new `dbus._BytesBase` class, but it is unused in Python 2.

I don't think this should count as user-visible; anyone relying on _-prefixed stuff in the dbus module was already wrong.

@@ +46,5 @@
> +
> +What do you need to do to port that to Python 3?  Here are the user visible
> +changes you should be aware of.  Python 3.2 is the minimal required version::
> +
> + - `ByteArray`s must be initialed with bytes objects, not unicodes.  Use `b''`

initialized

@@ +50,5 @@
> + - `ByteArray`s must be initialed with bytes objects, not unicodes.  Use `b''`
> +   literals in the constructor.  This also works in Python 2, where bytes
> +   objects are aliases for 8-bit strings.
> + - `ByteArray`s now derive from the new `dbus._BytesBase` class instead of
> +   `dbus._StrBase`.

If you feel the need to say this, please also say: Like everything with an underscore prefix, this is considered to be an implementation detail and may change in future versions.

I think the user-visible change you're actually describing is this?

`ByteArray` is now a subclass of `bytes`, where it was previously a subclass of `str`.

@@ +51,5 @@
> +   literals in the constructor.  This also works in Python 2, where bytes
> +   objects are aliases for 8-bit strings.
> + - `ByteArray`s now derive from the new `dbus._BytesBase` class instead of
> +   `dbus._StrBase`.
> + - `dbus.Byte` can be constructed from a 1-character byte or str object.

"... or from an integer", if it still can?

@@ +53,5 @@
> + - `ByteArray`s now derive from the new `dbus._BytesBase` class instead of
> +   `dbus._StrBase`.
> + - `dbus.Byte` can be constructed from a 1-character byte or str object.
> + - `dbus.UTF8String` is gone.  Use `dbus.String`.
> + - `dbus._IntBase` is gone.  Use `dbus._LongBase`.

... but don't anyway, because these are private.

@@ +58,5 @@
> + - All longs are now ints, since Python 3 has only a single int type.  This
> +   also means that the class hierarchy for the dbus numeric types has changed
> +   (all derive from int in Python 3).
> + - Some exception strings have changed.
> + - `dbus.is_py3` is a new flag that is True when running under Python 3.

This is a change to Python 2 as well, surely? But I'd have called this dbus._is_py3 and not documented it.

@@ +61,5 @@
> + - Some exception strings have changed.
> + - `dbus.is_py3` is a new flag that is True when running under Python 3.
> + - `dbus.keywords()` is a new function that helps with writing compatible code
> +   under Python 2 and Python 3.  It filters out `utf8_string` arguments under
> +   Python 3.

I don't think this deserves to be public API. dbus._adjust_keywords()?

@@ +84,5 @@
> +`dbus.UTF8String` (str).  Also notice that there is no direct dbus equivalent
> +of Python's bytes type (although dbus does have byte arrays), so I am mapping
> +dbus strings to unicodes in all cases, and getting rid of `dbus.UTF8String` in
> +Python 3.  I've also added a `dbus._BytesBase` type which is unused in Python
> +2, but which forms the base class for `dbus.ByteArray` in Python 3.

Implementation detail, not intended to be something that library users ever rely on

@@ +91,5 @@
> +`dbus.Signature`), bus names, interfaces, and methods are all strings.  A
> +previous aborted effort was made to use bytes for these, which at first blush
> +makes some sense.  Practicality beats purity though, and this approach tended
> +to impose too many changes on user code, and caused lots of difficult to track
> +down problems, so they are all str (i.e. unicode) in Python 3.

I think using strings for these is the pure choice as well as the practical choice - they're conceptually strings with a syntax.

@@ +125,5 @@
> +In the above page mapping basic types, you'll notice that the Python int type
> +is mapped to 32-bit signed integers ('i') and the Python long type is mapped
> +to 64-bit signed integers ('x').  Python 3 doesn't have this distinction, so
> +that mapping is removed.  Use `dbus.UInt32` if you must have 32-bit signed
> +integers.

You mean dbus.Int32.

I think mapping 'int' to 'x' will break real-world code; many D-Bus services are not tolerant of integer type mismatches. I'd recommend continuing to map int to 'i', even though int is now larger than it used to be. In my experience, if people want 'x' they're more likely to ask for it.

@@ +134,5 @@
> +Long literals in Python code are an interesting thing to have to port.  Don't
> +use them if you want your code to work in both Python versions.
> +
> +`dbus._IntBase` is removed in Python 3, you only have `dbus._LongBase`, which
> +inherits from a Python 3 int (i.e. a PyLong).

Implementation detail.

@@ +175,5 @@
> +
> +There were a few places where I noticed what might be considered bugs,
> +unchecked exception conditions, or possible reference count leaks.  In these
> +cases, I've just fixed what I can and hopefully haven't made the situation
> +worse.

You're very very welcome to report bugs for this sort of thing, preferably with patches!

@@ +222,5 @@
> + - Update all C extension docstrings for accuracy.
> + - Should `dbus.ByteArray` - which is a subclass of `dbus._BytesBase` which is
> +   itself a subclass of bytes, i.e. str in Python 2 - accept UTF8 encoded
> +   unicode strings?  Currently, no, they must be bytes in Python 3, str in
> +   Python 2.

I think this is right. If your string is UTF-8 (without embedded NULs), it should usually be a str/unicode in Python and a 's' on D-Bus; and if what you want is genuinely ByteArray(s.encode('utf-8')), explicit is better than implicit.

D-Bus requires UTF-8 in strings (as in 's') but has nothing to say about the encoding of byte arrays.

::: _dbus_bindings/bytes.c
@@ +92,5 @@
> +	if (!obj_as_bytes) {
> +	    return NULL;
> +	}
> +	obj = PyLong_FromLong(
> +	    (unsigned char)(PyBytes_AS_STRING(obj_as_bytes)[0]));

As far as I can tell, this doesn't check for length 1, so

    Byte(u'\x12xxxxxxxxxxxxx')

is misinterpreted as

    Byte(u'\x12')

which is 0x12. It should be an error, IMO.

@@ +172,5 @@
> +        PyErr_SetString(PyExc_RuntimeError, "Integer outside range 0-255");
> +        return NULL;
> +    }
> +    str[0] = (unsigned char)i;
> +    return PyUnicode_FromStringAndSize((char *)str, 1);

Is it OK for tp_str to return a unicode, even in Python 2? (I thought that was what tp_unicode was for?)

::: _dbus_bindings/conn-methods.c
@@ +562,4 @@
>      ok = dbus_connection_get_unix_fd (self->conn, &fd);
>      Py_END_ALLOW_THREADS
>      if (!ok) Py_RETURN_NONE;
> +    return PyLong_FromLong(fd);

Is it considered OK for (Unix) file descriptors to be long in Python 2? (For instance, does os.fdopen work?)

::: _dbus_bindings/conn.c
@@ +232,4 @@
>      self->has_mainloop = (mainloop != Py_None);
>      self->conn = NULL;
>      self->filters = PyList_New(0);
> +    self->weaklist = NULL;

Is this related to the porting, or an unrelated bug fix?

@@ +324,5 @@
> +	!PyBytes_Check(address_or_conn)
> +#endif
> +	)
> +    {
> +	PyErr_SetString(PyExc_TypeError, "connection or str expected");

I tend to prefer "is it one of these possibilities?" checks to be in the form:

    if (first thing) {
        ...
    }
    else if (second thing) {
        ...
    }
    else {
        /* none of the above, usually an error */
    }

Putting the "none of the above" case in the middle like you have here:

    if (first thing) {
        ...
    }
    else if (!(second thing)) {
        /* none of the above, usually an error */
    }
    else {
        /* second thing */
    }

guarantees that you'll have to rewrite the conditional if you add a third thing.

::: _dbus_bindings/containers.c
@@ +407,5 @@
> +#ifdef PY3K
> +	    !PyUnicode_Check(signature)
> +#else
> +	    !PyBytes_Check(signature)
> +#endif

I keep seeing this construct. I feel as though for as long as we remain compatible with Python 2, the dbus-bindings header ought to #define a STR_CHECK() or something, meaning "is it the thing that this version of Python calls str?"

@@ +701,5 @@
>  {
>      PyObject *key, *value;
>  
> +#ifdef PY3K
> +    if (PyUnicode_CompareWithASCIIString(name, "signature")) {

Similarly, I'm starting to think we could do with a locally-defined Str_CompareWithASCIIString.

::: _dbus_bindings/generic.c
@@ +38,4 @@
>  {
>      if (op == Py_EQ || op == Py_NE) {
>          if (self == other) {
> +            return PyLong_FromLong(op == Py_EQ);

Just to confirm, is it OK for comparators to return long in Python 2?

::: _dbus_bindings/int.c
@@ +25,5 @@
>  
>  #include "types-internal.h"
>  
> +#ifdef PY3K
> +#define INTBASE &DBusPyLongBase_Type

Should have parentheses around the RHS, to avoid non-obvious precedence...

@@ +71,5 @@
>      }
>      tuple = Py_BuildValue("(i)", PyObject_IsTrue(value) ? 1 : 0);
>      if (!tuple) return NULL;
> +#ifdef PY3K
> +    self = (DBusPyLongBase_Type.tp_new)(cls, tuple, kwargs);

... then you could use INTBASE->tp_new without the conditional :-)

@@ +264,4 @@
>  dbus_uint16_t
>  dbus_py_uint16_range_check(PyObject *obj)
>  {
> +    long i = PyLong_AsLong(obj);

I think dbus_py_uint16_range_check (and friends) are sometimes called on int objects in Python 2. With that in mind, is this usage OK?

::: _dbus_bindings/message-append.c
@@ +166,5 @@
>      }
>  }
>  
> +#ifdef PY3K
> +#define FROMSTRING PyUnicode_FromString

Call this STR_FROMSTRING and put it in the header?

@@ +209,5 @@
> +    if (PyLong_Check(obj)) {
> +        if (DBusPyInt64_Check(obj))
> +            return PyUnicode_FromString(DBUS_TYPE_INT64_AS_STRING);
> +        else if (DBusPyUInt32_Check(obj))
> +            return PyUnicode_FromString(DBUS_TYPE_UINT32_AS_STRING);

This ordering is really weird. Any particular reason why they're not in some sort of systematic ordering, like "sort by size and then by signedness"?

@@ +223,5 @@
> +            return PyUnicode_FromString(DBUS_TYPE_UINT16_AS_STRING);
> +        else if (DBusPyBoolean_Check(obj))
> +            return PyUnicode_FromString(DBUS_TYPE_BOOLEAN_AS_STRING);
> +        else
> +            return PyUnicode_FromString(DBUS_TYPE_INT64_AS_STRING);

As pointed out already, I think this will break existing code, and int32 is a better default. You could make it emit deprecation warnings if you like.

@@ +503,2 @@
>      DBG("Message_guess_signature: returning Signature at %p \"%s\"", ret,
> +	ret ? PyUnicode_AS_DATA(ret) : "(NULL)");

Is %s really right to printf the array of wide characters (codepoints or UTF-16 2-byte words) in a Unicode?

(I wouldn't necessarily object to just getting rid of the DBG(), though.)

@@ +566,3 @@
>          DBG("Performing actual append: string (from unicode) %s", s);
>          if (!dbus_message_iter_append_basic(appender, sig_type, &s)) {
> +	    Py_CLEAR(utf8);

This is an unrelated leak fix, right?

@@ +593,5 @@
>          }
> +        y = *(unsigned char *)PyBytes_AS_STRING(obj);
> +    }
> +    else if (PyUnicode_Check(obj)) {
> +	PyObject *obj_as_bytes = PyUnicode_AsUTF8String(obj);

This logic (to make a Byte from a single-byte unicode string) seems really strange, and a bit inefficient.

If you're using UTF-8 (but why is UTF-8 special here?), then it'd be way more efficient to check that the length of the Unicode is 1 and the first (and only) character is < 128. (Otherwise, it'd encode to more than one UTF-8 byte.)

But I think more sensible semantics would be to check that the length of the Unicode is 1, and use the numeric value of the first *codepoint* - so it's an error if it isn't in the latin-1 range, between U+0000 and U+00FF?

(Optimization: even if Python is encoding its strings in UTF-16 like it does on Windows, it's enough to get the first-and-only Py_UNICODE - either a UCS-4 32-bit codepoint, or a 16-bit unit of UTF-16 which is either itself or half of a surrogate pair - and check that it's in the range 0 to 255. If it is, the right answer is that byte; if not, error.)

::: _dbus_bindings/message-get-args.c
@@ +252,4 @@
>                                      args, kwargs);
>              }
>              else {
> +#endif

I'd prefer to still have a { in the Python 3 case so that the } doesn't need to be conditional.

@@ +256,5 @@
> +		/* Watch out for this call failing. */
> +		unicode = PyUnicode_DecodeUTF8(u.s, strlen(u.s), NULL);
> +		if (!unicode)
> +		    break;
> +                args = Py_BuildValue("(N)", unicode);

Unrelated bugfix?

Is unicode leaked here, in your version?

libdbus guarantees to return valid UTF-8, and its definition of valid UTF-8 is stricter than Python's (it excludes noncharacters like U+FEFF), so this should never fail in any case.

::: _dbus_bindings/message.c
@@ +115,5 @@
>      return 0;
>  }
>  
> +static PyObject *
> +MethodCallMessage_tp_repr(PyObject *self)

Not really related to this port...

@@ +133,5 @@
> +    if (!destination)
> +        destination = "n/a";
> +
> +    return PyUnicode_FromFormat(
> +        "<%s path: %s, int: %s, member: %s dest: %s>",

In D-Bus we usually say "iface" if "interface" would be too long.

::: _dbus_bindings/server.c
@@ +101,4 @@
>      {
>          const char *list[length + 1];
>  
> +	if (!(references = PyTuple_New(length)))

Is it OK to free a tuple some of whose entries are NULL?

@@ +402,4 @@
>  
>      self = DBusPyServer_NewConsumingDBusServer(cls, server, conn_class,
>              mainloop, auth_mechanisms);
> +    ((Server *)self)->weaklist = NULL;

Unrelated fix, or what?

::: configure.ac
@@ +164,4 @@
>     nested-externs \
>     pointer-arith \
>     format-security \
> +   no-unused-parameter \

This is in the wrong place; put "unused-parameter" in the list of undesirable warnings, next to "missing-field-initializers", without the "no-" prefix.

(The difference is that for each thing in the list of undesirable warnings, if "-Wno-THING -Wno-error=THING" doesn't work, the macro will refuse to enable -Werror.)

::: dbus/__init__.py
@@ +68,5 @@
>             # submodules
>             'service', 'mainloop', 'lowlevel'
> +           ]
> +
> +is_py3 = getattr(sys.version_info, 'major', sys.version_info[0]) == 3

Should be private, IMO... or if not, should be documented, and questions should be asked about why Python itself doesn't provide this.

(Shouldn't it be >=, too?)

Having this in 'dbus' (as opposed to dbus._compat or something) works against the possibility of ever not having this module be full of circular dependencies... although we can probably never break the circular dependencies without breaking API anyway.

@@ +73,5 @@
> +
> +if not is_py3:
> +    __all__.append('UTF8String')
> +
> +def keywords(**kws):

Should be private, IMO.

I think it should also raise an exception if you use utf8_strings (at least if it's set to a true value) in Python 3, rather than silently swallowing it - then part of porting to Python 3 will be "stop using utf8_strings, which no longer does anything useful".

::: dbus/connection.py
@@ +28,2 @@
>  try:
> +    import _thread as thread

Is this really the recommended way to do things in Python-3-land? Aren't underscore prefixes normally meant to mean "no user-serviceable parts inside"?

@@ +186,5 @@
>              return False
>          if self._int_args_match is not None:
>              # extracting args with utf8_strings and byte_arrays is less work
> +            args = message.get_args_list(
> +                **keywords(utf8_strings=True, byte_arrays=True))

It seems as though now it'd be better to just accept the overhead of UTF-8 -> Unicode conversion...

@@ +192,4 @@
>                  if (index >= len(args)
> +                    or (not isinstance(args[index], UTF8String)
> +                        if not is_py3
> +                        else False)

... which would let you use String as the thing that you check with isinstance, which can then be unconditional.

Removing the UTF8String check is a semantic change: previously we insisted that arguments matched with args_match were (in D-Bus terms) strings, whereas now you'll accept anything string-like. The underlying D-Bus match rules only accept strings (and not, for instance, object paths), so we should be consistent with that.

@@ +210,5 @@
>              # doing so again
>              if args is None or not self._utf8_strings or not self._byte_arrays:
> +                args = message.get_args_list(
> +                    **keywords(utf8_strings=self._utf8_strings,
> +                               byte_arrays=self._byte_arrays))

Be careful to change this too, when changing the above.

::: dbus/dbus_bindings.py
@@ +23,1 @@
>  from dbus.types import *

dbus_bindings is so deprecated that it should never have existed; it only exists to support apps that were wrongly using the C-level part of the binding (what's now _dbus_bindings). Can we take the opportunity to just not install it, or have it raise an exception on import, for Python 3?