Add option to unzip to detect the encoding of filenames

Registered by Simos Xenitellis  on 2007-06-26

The ZIP file format http://en.wikipedia.org/wiki/ZIP_(file_format) does not include information on the encoding of the filenames of the compressed files. Zip files created on non-Unicode systems or on Windows XP use legacy encodings. For example, for Greek, the CP737 encoding is used to encode the Greek text in the filenames.

When extracting files such files on current Linux distributions, the unzip tool fails to create any files as the filename is invalid. GUI utilities such as file-roller (the default compressed file manager of GNOME and subseqeuntly Ubuntu) fails to extract those files because they use the "unzip" utility to handle Zip files.

A. References
a. Summary of discussion on Zip encoding issues at Linux-utf8 mailing list: http://mail.nl.linux.org/linux-utf8/2005-06/msg00010.html
b. RFE: handle non-ascii filenames in archive properly: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=225576
filenames in non-UTF-8 encodings are not handled correctly: http://bugzilla.gnome.org/show_bug.cgi?id=306403
c. Autoconvert filename encoding to UTF-8: https://sourceforge.net/tracker/index.php?func=detail&aid=1214471&group_id=14481&atid=114481
d. Another Windows zipfile encoding problem, with patch: http://mail.python.org/pipermail/python-list/2003-May/202914.html
e. Unzip 5.52 with custom patch: http://www.linuxfromscratch.org/blfs/view/stable/general/unzip.html
f. Bug report with patch from AltLinux (RUS), https://bugzilla.altlinux.org/long_list.cgi?buglist=4871
g. Launchpad report, with custom fix: https://bugs.launchpad.net/debian/+source/unzip/+bug/10979

Blueprint information

Status:
Not started
Approver:
None
Priority:
Undefined
Drafter:
Simos Xenitellis 
Direction:
Needs approval
Assignee:
None
Definition:
New
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

I just found out -Jun07- (reference [g] above) that Ubuntu has recently started using a patch for unzip (originated from AltLinux) that offers the option to select the source encoding.

Specifically,
+ -O CHARSET specify a character encoding for DOS, Windows and OS/2 archives\n\
+ -I CHARSET specify a character encoding for UNIX and other archives\n\n";
and
export UNZIP="-O CP866"
export ZIPINFO="-O CP866" (to be used in system configuration files).

It looks that this blueprint will close soon; an unresolved issue that remains is to pass the change upstream to the InfoZip project.

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.