Coder Social home page Coder Social logo

UTF-16 string literals about crystal HOT 14 OPEN

straight-shoota avatar straight-shoota commented on June 25, 2024 2
UTF-16 string literals

from crystal.

Comments (14)

straight-shoota avatar straight-shoota commented on June 25, 2024 5

Do we want to proceed with StringLiteral#to_utf16 then?
I think it's more elegant than String#utf16_literal (#14670 (comment)). An apparently more performant (#14676 (comment)).

from crystal.

BlobCodes avatar BlobCodes commented on June 25, 2024 3

Maybe the converstion algorithm from String#to_utf16 could be implemented as a macro method? It's a bit complex, but not too much. I don't think we can explicitly do math operations on 16-bit integers in the macro language, though.

Simply porting #14671 works fine, explicit math operations on 16-bit integers are not needed.

class String
  macro utf16_literal(data)
    {%
      arr = [] of NumberLiteral
      data.chars.each do |c|
        c = c.ord
        if c < 0x1_0000
          arr << c
        else
          c -= 0x1_0000
          arr << 0xd800 + ((c >> 10) & 0x3ff)
          arr << 0xdc00 + (c & 0x3ff)
        end
      end
      arr << 0
    %}
    Slice(UInt16).literal({{arr.splat}})[0, {{arr.size - 1}}]
  end
end

s = String.utf16_literal("TEST 😐🐙 ±∀ の")
# => Slice[84, 69, 83, 84, 32, 55357, 56848, 55357, 56345, 32, 177, 8704, 32, 12398]

String.from_utf16(s)
# => "TEST 😐🐙 ±∀ の"

Encoding 10000 characters takes around 300ms.
That's certainly not fast, but probably good enough.

EDIT: Added a final 0 byte

from crystal.

ysbaddaden avatar ysbaddaden commented on June 25, 2024 2

The macro is nice, but if we want to eventually have the compiler optimize it, maybe we could just expose the String.to_utf16 to macros directly? For example {{ "CRYSTAL_TRACE".to_utf16 }} would be lovely & fast.

from crystal.

BlobCodes avatar BlobCodes commented on June 25, 2024 1

The version from my comment uses the literals from #13716, so it is static data in this case.
Although it is still experimental API.

from crystal.

straight-shoota avatar straight-shoota commented on June 25, 2024 1

So the conversations could be avoided entirely

Would be nice. But I believe we're quite a bit away from that. The Windows ecosystem is huge and it has 30 years of wide chars in it.

from crystal.

straight-shoota avatar straight-shoota commented on June 25, 2024 1

Hm, that's an interesting idea. Exposing StringLiteral#to_utf16 would certainly have the benefit that you have the resulting literal easily available in macro land.
I like that it's exactly identical to the runtime version, but in a macro expansion which makes it clear that this happens at compile time.

FTR: Eventual compiler optimization would also be possible with String.utf16_literal as well. We could turn this macro into a primitive later.

Let's focus on UTF-16 string literals here and continue the discussion about UTF-8 support on Win32 in a different issue. I'm pretty sure we won't lose all use cases for UTF-16 string literals over night, so this will still be useful.

from crystal.

bcardiff avatar bcardiff commented on June 25, 2024 1

I like StringLiteral#to_utf16 and if to do that we end up having a SliceLiteral even one without first class syntax yet it would still be a double win. Because then embedding resources could leverage a similar StringLiteral#to_slice in compile-time.

from crystal.

straight-shoota avatar straight-shoota commented on June 25, 2024

Looks like a winner, then 🚀

That's certainly not fast, but probably good enough

Yeah, this is mainly for relatively short strings, so performance should not be an issue.
We can always push it up into the compiler if the need arises.

Btw. CharLiteral#ord was only added in 1.11 (#13910), so this wouldn't have been possible before.

from crystal.

straight-shoota avatar straight-shoota commented on June 25, 2024

In order to make it actually static data, we'd also need a slice literal (#2886).

from crystal.

stakach avatar stakach commented on June 25, 2024

Worth noting that Windows supports UTF8 now and encourages use of those APIs

https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page#-a-vs--w-apis

So the conversations could be avoided entirely

from crystal.

ysbaddaden avatar ysbaddaden commented on June 25, 2024

@straight-shoota this is reusing the "old" ANSI API to use the UTF-8 codepage, so it might just work 🤷

It took me a while to find this: at the above link there is the explanation to set the Active Code Page (ACP) to UTF-8 which requires a manifest and calling an EXE to "add the manifest" to an executable. Then the executable the ANSI variant of the Windows API will use UTF-8.

That being said, it requires Windows 10 v1903 (2019) and GDI applications won't support it unless the user activates a beta setting.

from crystal.

ysbaddaden avatar ysbaddaden commented on June 25, 2024

The difficulty to implement StringLiteral#to_utf16 is that there is no SliceLiteral and we should generate a Slice(UInt16).literal(..., 0) and I have no idea how to achieve that.

from crystal.

BlobCodes avatar BlobCodes commented on June 25, 2024

The difficulty to implement StringLiteral#to_utf16 is that there is no SliceLiteral and we should generate a Slice(UInt16).literal(..., 0) and I have no idea how to achieve that.

It could return ArrayLiteral(NumberLiteral) (or Call(@receiver=Generic(@name=Slice, @type_vars=[UInt16]) @name="literal", @args=[0, 1, 2, 3, 4, 5, ...]))


Btw I just tested the performance of my macro code a bit more.
Simply replacing the line {{ arr.splat }} with {% arr.splat %} 0 (so the resulting splat is not parsed) improves the runtime of encoding 10000 characters from ~300ms to ~20ms.

The macro language actually isn't that slow - the parser is.

Implementing StringLiteral#to_utf16 wouldn't improve performance in a perceivable manner since it would only remove <10% of the runtime.

Maybe there should be a way to create AST nodes directly inside the macro language, so we don't have to parse everything again.

from crystal.

stakach avatar stakach commented on June 25, 2024

GDI applications won't support it unless the user activates a beta setting.

You can activate the code pages in code, this is how applications like MS Edge browser run.
MS Edge being a react native app, so runs using JS and UTF8 (although Microsoft is removing react)

from crystal.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.