I was going to just make a simple pull request, but I wanted to ensure my thinking is

Cocoa and Core Foundation Unicode issues about alfred-workflow HOT 5 CLOSED

deanishe commented on May 20, 2024

Cocoa and Core Foundation Unicode issues

from alfred-workflow.

Comments (5)

deanishe commented on May 20, 2024

The output of mdls isn't encoded text. It's a file format (the old NeXT plist format, by the looks of it), more akin to JSON.

Adding something like this to Workflow.decode() is inappropriate. Its job is to decode encoded text and normalise text, not parse file formats.

Sure, it would do some useful "magic" if the text just happens to be mdls output, but what if it's just a normal string that happens to have \U in it? It'd basically be breaking decode() to handle one weird edge case.

from alfred-workflow.

$fractaledmind avatar$ fractaledmind commented on May 20, 2024

The distinction between encoded text and file format is well taken. And the point about breaking simple strings with \U in them is also a good point. My point was merely that decode() does not decode text Unicode text gotten from subprocess in this edge case. As you state in the docs, subprocess is a module that will return strings that require some messaging to get in Pythonic form. I have no idea which OS X CLIs return old NeXT plist formats, but when someone writes a workflow using Alfred-Workflow, they won't have any idea what the hell is going on. And the solution is to .decode() the text, so it doesn't seem totally inappropriate to add this functionality to Workflow.decode().

Obviously, it's your package, so do as you see fit, and this will be my last ditch effort. I've altered the function so that it handles strings with "\U" randomly in them. It will only use .decode('unicode-escape') if it finds a "\U" followed by 3 or more numbers:

def decode(text, encoding='utf-8', normalization='NFC'):
    """Return ``text`` as normalised unicode.

    :param text: string
    :type text: encoded or Unicode string. If ``text`` is already a
        Unicode string, it will only be normalised.
    :param encoding: The text encoding to use to decode ``text`` to
        Unicode.
    :type encoding: ``unicode`` or ``None``
    :param normalization: The nomalisation form to apply to ``text``.
    :type normalization: ``unicode`` or ``None``
    :returns: decoded and normalised ``unicode``

    """
    # convert string to Unicode
    if isinstance(text, basestring):
        if not isinstance(text, unicode):
            text = unicode(text, encoding)
    # decode Cocoa/CoreFoundation Unicode to Python Unicode
    if re.search(r'\\U\d{3,}', text):
        text = text.replace('\\U', '\\u').decode('unicode-escape')
    return unicodedata.normalize(normalization, text)

This seems to me sufficiently safe, such that there is no downside to adding the functionality. But, as I say, I will drop it at this.

from alfred-workflow.

deanishe commented on May 20, 2024

My point was merely that decode() does not decode text Unicode text gotten from subprocess in this edge case

Yes, it does decode the text. 'To\U0304ny\U0308 Sta\U030ark' contains representations of Unicode codepoints within an encoded string, just as writing mystring = u'To\u0304ny' in your Python source code is a Unicode representation within an ASCII- or UTF-8-encoded text file.

What you're suggesting is hardcoding a second, mdls codec and corresponding decoding step in decode(). If you want to decode mdls output, then you should be specifying something other than utf-8 as the encoding, not hardcoding your mdls codec into decode().

This seems to me sufficiently safe, such that there is no downside to adding the functionality.

It's not safe, it's broken.

Let's look at your Stack Overflow question as an example. Say you're using some hypothetical Stack Overflow command line tool via subprocess to grab your post and its comments, and then you run the output through decode() (as you should because it might contain non-ASCII characters).

Your question:

I'm having issues reading Unicode text from the shell into Python. I have a test document with the following metadata atrribute:
kMDItemAuthors = (
    "To\U0304ny\U0308 Sta\U030ark"
)

Current `decode()` returns:

I'm having issues reading Unicode text from the shell into Python. I have a test document with the following metadata atrribute:
kMDItemAuthors = (
    "To\U0304ny\U0308 Sta\U030ark"
)

Your modified `decode()` returns:

I'm having issues reading Unicode text from the shell into Python. I have a test document with the following metadata atrribute:
kMDItemAuthors = (
    "Tōnÿ Stårk"
)

Do you see the problem?

from alfred-workflow.

$fractaledmind avatar$ fractaledmind commented on May 20, 2024

Point taken.

from alfred-workflow.

deanishe commented on May 20, 2024

Seeing as mdls apparently only outputs ASCII, you might want to look into adding a codec for mdls's Unicode escape format, which you could then pass to decode().

That might be total overkill, though.

from alfred-workflow.

Cocoa and Core Foundation Unicode issues about alfred-workflow HOT 5 CLOSED

Comments (5)

Your question:

Current `decode()` returns:

Your modified `decode()` returns:

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (5)

Your question:

Current decode() returns:

Your modified decode() returns:

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

Current `decode()` returns:

Your modified `decode()` returns: