Coder Social home page Coder Social logo

Comments (5)

deanishe avatar deanishe commented on May 20, 2024

The output of mdls isn't encoded text. It's a file format (the old NeXT plist format, by the looks of it), more akin to JSON.

Adding something like this to Workflow.decode() is inappropriate. Its job is to decode encoded text and normalise text, not parse file formats.

Sure, it would do some useful "magic" if the text just happens to be mdls output, but what if it's just a normal string that happens to have \U in it? It'd basically be breaking decode() to handle one weird edge case.

from alfred-workflow.

fractaledmind avatar fractaledmind commented on May 20, 2024

The distinction between encoded text and file format is well taken. And the point about breaking simple strings with \U in them is also a good point. My point was merely that decode() does not decode text Unicode text gotten from subprocess in this edge case. As you state in the docs, subprocess is a module that will return strings that require some messaging to get in Pythonic form. I have no idea which OS X CLIs return old NeXT plist formats, but when someone writes a workflow using Alfred-Workflow, they won't have any idea what the hell is going on. And the solution is to .decode() the text, so it doesn't seem totally inappropriate to add this functionality to Workflow.decode().

Obviously, it's your package, so do as you see fit, and this will be my last ditch effort. I've altered the function so that it handles strings with "\U" randomly in them. It will only use .decode('unicode-escape') if it finds a "\U" followed by 3 or more numbers:

def decode(text, encoding='utf-8', normalization='NFC'):
    """Return ``text`` as normalised unicode.

    :param text: string
    :type text: encoded or Unicode string. If ``text`` is already a
        Unicode string, it will only be normalised.
    :param encoding: The text encoding to use to decode ``text`` to
        Unicode.
    :type encoding: ``unicode`` or ``None``
    :param normalization: The nomalisation form to apply to ``text``.
    :type normalization: ``unicode`` or ``None``
    :returns: decoded and normalised ``unicode``

    """
    # convert string to Unicode
    if isinstance(text, basestring):
        if not isinstance(text, unicode):
            text = unicode(text, encoding)
    # decode Cocoa/CoreFoundation Unicode to Python Unicode
    if re.search(r'\\U\d{3,}', text):
        text = text.replace('\\U', '\\u').decode('unicode-escape')
    return unicodedata.normalize(normalization, text)

This seems to me sufficiently safe, such that there is no downside to adding the functionality. But, as I say, I will drop it at this.

from alfred-workflow.

deanishe avatar deanishe commented on May 20, 2024

My point was merely that decode() does not decode text Unicode text gotten from subprocess in this edge case

Yes, it does decode the text. 'To\U0304ny\U0308 Sta\U030ark' contains representations of Unicode codepoints within an encoded string, just as writing mystring = u'To\u0304ny' in your Python source code is a Unicode representation within an ASCII- or UTF-8-encoded text file.

What you're suggesting is hardcoding a second, mdls codec and corresponding decoding step in decode(). If you want to decode mdls output, then you should be specifying something other than utf-8 as the encoding, not hardcoding your mdls codec into decode().

This seems to me sufficiently safe, such that there is no downside to adding the functionality.

It's not safe, it's broken.

Let's look at your Stack Overflow question as an example. Say you're using some hypothetical Stack Overflow command line tool via subprocess to grab your post and its comments, and then you run the output through decode() (as you should because it might contain non-ASCII characters).

Your question:

I'm having issues reading Unicode text from the shell into Python. I have a test document with the following metadata atrribute:

kMDItemAuthors = (
    "To\U0304ny\U0308 Sta\U030ark"
)
Current decode() returns:

I'm having issues reading Unicode text from the shell into Python. I have a test document with the following metadata atrribute:

kMDItemAuthors = (
    "To\U0304ny\U0308 Sta\U030ark"
)
Your modified decode() returns:

I'm having issues reading Unicode text from the shell into Python. I have a test document with the following metadata atrribute:

kMDItemAuthors = (
    "Tōnÿ Stårk"
)

Do you see the problem?

from alfred-workflow.

fractaledmind avatar fractaledmind commented on May 20, 2024

Point taken.

from alfred-workflow.

deanishe avatar deanishe commented on May 20, 2024

Seeing as mdls apparently only outputs ASCII, you might want to look into adding a codec for mdls's Unicode escape format, which you could then pass to decode().

That might be total overkill, though.

from alfred-workflow.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.