Coder Social home page Coder Social logo

Unicode problem (I guess) about kramdown HOT 22 CLOSED

gettalong avatar gettalong commented on June 17, 2024
Unicode problem (I guess)

from kramdown.

Comments (22)

lydell avatar lydell commented on June 17, 2024

What settings? How?

from kramdown.

gettalong avatar gettalong commented on June 17, 2024

Could you please run the following:

ruby -e 'puts Encoding.default_external'
ruby -e 'puts Encoding.default_internal'

The solution is probably that the input needs to be converted to UTF-8 before being parsed. Not sure, though, if the output needs to be converted back...

from kramdown.

lydell avatar lydell commented on June 17, 2024
$ ruby -e 'puts Encoding.default_external'
CP850
$ ruby -e 'puts Encoding.default_internal'

The solution is probably that the input needs to be converted to UTF-8 before being parsed.

I have also tried to convert files saved in UTF-8 with input as described. The same thing happens.

from kramdown.

gettalong avatar gettalong commented on June 17, 2024

Yeah, because your environment external encoding is CP850 which means that all files read and all text input get the encoding label CP850...

What I have meant with "The solution..." is that kramdown needs to do this internally, i.e. convert input to UTF-8. I will see how the Ruby stdlib CSV library does this and will probably follow along the footsteps.

from kramdown.

lydell avatar lydell commented on June 17, 2024

Can I work around this? I know nothing about ruby, I just have it installed so I can use kramdown and sass, but it sounds like a bad thing not to have an "environment external encoding" to anything else than UTF-8.

from kramdown.

gettalong avatar gettalong commented on June 17, 2024

You can try calling kramdown in the following way until I can fix this (note that the input has to be valid UTF-8):

ruby --external-encoding UTF-8 -S kramdown

from kramdown.

lydell avatar lydell commented on June 17, 2024

Thanks, that works.

from kramdown.

lydell avatar lydell commented on June 17, 2024

Possibly related:

C:\Users\Simon\Desktop>ruby aaå.rb
Hello, World!

C:\Users\Simon\Desktop>kramdown aaå.rb
C:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/bin/kramdown:72:in `read': No such file or directory - aaå.rb (Errno::ENOENT
)
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/bin/kramdown:72:in `<top (required)>'
        from C:/Ruby193/bin/kramdown:23:in `load'
        from C:/Ruby193/bin/kramdown:23:in `<main>'

from kramdown.

gettalong avatar gettalong commented on June 17, 2024

You may also want to globally change the default encoding to UTF-8 if that is what you use. See for example http://stackoverflow.com/questions/469163/how-to-set-the-default-encoding-in-windows-xp and http://stackoverflow.com/questions/11806512/ruby-1-9-wrong-file-encoding-on-windows.

And yes, this will be related. However, this is a general problem: If you use UTF-8, you should set that as encoding for your computer because otherwise the CP850 encoding will always make trouble.

from kramdown.

 avatar commented on June 17, 2024

Setting LANG fixes it for me

LANG=en_US.CP850 kramdown aaå.rb
/usr/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/bin/kramdown:72:in `read': No such
 file or directory - aaå.rb (Errno::ENOENT)
        from /usr/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/bin/kramdown:72:in `<
top (required)>'
        from /usr/bin/kramdown:23:in `load'
        from /usr/bin/kramdown:23:in `<main>'
LANG=en_US.UTF-8 kramdown aaå.rb
<p>puts `Hello, World!'</p>

from kramdown.

alexpeattie avatar alexpeattie commented on June 17, 2024

I ran into this problem today because Tilt doesn't handle encoding properly (rtomayko/tilt#75). This is how I patched the problem, in case it's useful to anyone.

Ruby 2

module Kramdown::Parser
  module EncodingFix
    def adapt_source(source)
      super.force_encoding('UTF-8')
    end
  end

  class Base; prepend EncodingFix; end
end

Ruby <= 1.9

module Kramdown::Parser
  class Base
    alias old_adapt_source adapt_source

    def adapt_source(source)
      old_adapt_source(source).force_encoding('UTF-8')
    end
  end
end

from kramdown.

gettalong avatar gettalong commented on June 17, 2024

So, this does to seem to be a bit more complicated... or funny, depending on how you look at it.

I took the example from @lydell and put it in a file on Windows 7 with Notepad and selected ANSI encoding. So, what does one expect here? I expected that the file would contain CP850 encoded characters because this is what seems to be the encoding on Windows 7 command line. But when you look up what ANSI encoding means, you see that it is actually called CP-1252. So the file gets saved in CP-1252 format.

On the command line, ruby reads it in as CP-850 (because this is the external encoding) and then outputs the result as CP-850 which leads to å becoming õ... which is just not right.

So... you are basically screwed on Windows if you expect a sane default environment because the command line encoding differs from the GUI encoding (or however one wants to phrase that).

However, since there is also still a bug in kramdown I will fix this bug by converting the source string to UTF-8 in Kramdown::Parser::Base.adapt_source and convert the result back to the original encoding of the string in Kramdown::Converter::Base.convert. The back-conversion is not really needed in the most common use cases because on terminal output or when writing to files Ruby automatically transcodes strings to the external encoding. However, when the string is further transformed in Ruby the caller probably expects a string in the same encoding as he has given.

And the result of all this? If you save a file on Windows with a CP-850 encoding, kramdown will now work correctly. Just remember that saving a file in Notepad with the ANSI encoding does not mean CP-850 but CP-1252 (or WINDOWS-1252 as it is known to ruby)!

Coming to you with the next release of kramdown which will be the (spoiler alert) 1.0.0 😄


The problem with the input file can't be solved by kramdown since this is a general problem (Question: Is the encoding of the file system paths different to the external encoding on Windows 7 cmd command line? Answer: Yes, it seems so. Solution: For your and my sake/saneness, please just set the default encoding to UTF-8 everywhere and use UTF-8 everywhere).

from kramdown.

lydell avatar lydell commented on June 17, 2024

I just tried 1.0.0. I would like to confirm that the original test case now works! Thanks!

Converting a kramdown file with UTF-8 (without BOM) encoding now works out of the box, without changing any settings or typing extra things on the command line. Great!

However, the "possibly related" issue still persists, with the same error. @svnpenn's LANG fix does not work for me:

$ LANG=en-US.UTF-8 kramdown aaå.rb
c:/Ruby193/lib/ruby/1.9.1/optparse.rb:1351:in `===': invalid byte sequence in UTF-8 (ArgumentError)
        from c:/Ruby193/lib/ruby/1.9.1/optparse.rb:1351:in `block in parse_in_order'
        from c:/Ruby193/lib/ruby/1.9.1/optparse.rb:1347:in `catch'
        from c:/Ruby193/lib/ruby/1.9.1/optparse.rb:1347:in `parse_in_order'
        from c:/Ruby193/lib/ruby/1.9.1/optparse.rb:1341:in `order!'
        from c:/Ruby193/lib/ruby/1.9.1/optparse.rb:1432:in `permute!'
        from c:/Ruby193/lib/ruby/1.9.1/optparse.rb:1453:in `parse!'
        from c:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-1.0.0/bin/kramdown:16:in `<top (required)>'
        from c:/Ruby193/bin/kramdown:23:in `load'
        from c:/Ruby193/bin/kramdown:23:in `<main>'

Is it related, or should a new issue be opened? Or, should I change my settings? If the latter—exactly what should I change? (The answer to that should be added to kramdown's install instructions for Windows.) Why can ruby find files with unicode characters in them, but not kramdown?

Anyways, thanks for the quick solution for the main problem! (For the time being, I could just avoid unicode characters in my file names, or rename them temporarily.)

from kramdown.

gettalong avatar gettalong commented on June 17, 2024

There are still known problems with Ruby, Windows and Unicode path names (see, for example, http://bugs.ruby-lang.org/issues/1685). However, they may not apply to this situation.

You should be able to work around this by using ruby --external-encoding UTF-8. However, you need to be aware that this assumes that the content files for kramdown are also in UTF-8 and not CP850!

I have search a bit and found the chcp command with which you can change the used CMD.com code page. Code Page 65001 can be used for UTF-8. You should also change the console font from the raster font to something else (Lucida Console works fine for me).

After changing to code page 1252 (chcp 1252), I was able to execute the following command:

C:\temp>chcp 850
Aktive Codepage: 850.

C:\temp>kramdown ä.txt
C:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-1.0.0/bin/kramdown:59:in `read': No such file or directory - ä.txt (Errno::ENOENT)
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-1.0.0/bin/kramdown:59:in `<top (required)>'
        from C:/Ruby193/bin/kramdown:23:in `load'
        from C:/Ruby193/bin/kramdown:23:in `<main>'

C:\temp>chcp 1252
Aktive Codepage: 1252.

C:\temp>kramdown ä.txt
<p>aaä</p>

C:\temp>

Please note that the contents of the ä.txt file was in Windows 1252 encoding and not UTF-8. If it were in UTF-8, the output would have been <p>aaä</p> (because Ruby would interpret the characters as being Windows 1252 encoded before giving it to kramdown).

And the moral of the story? I don't think that I can provide you with a general solution. You need to make sure that the proper code page on the command line is set and that all your files and their content is encoded in this code page. I'm sorry I can't help more but I don't really use Windows that often.

from kramdown.

 avatar commented on June 17, 2024

@lydell you input

LANG=en-US.UTF-8 kramdown aaå.rb

when I input

LANG=en_US.UTF-8 kramdown aaå.rb

notice carefully the underscore.

from kramdown.

gettalong avatar gettalong commented on June 17, 2024

Does using LANG=... on Windows really work?

from kramdown.

 avatar commented on June 17, 2024

@gettalong with Cygwin

from kramdown.

gettalong avatar gettalong commented on June 17, 2024

Ah, yes, of course, there it should work. But I don't think it works with the Windows Command Shell.

from kramdown.

lydell avatar lydell commented on June 17, 2024

Ah, I'm so used to seeing en-US that I just couldn't see that underscore … However, I still got the same result :(

LANG=... does not work with Windows Command Shell, that's right. The discussion might have been a bit confused, since I've sometimes used Windows Command Shell and sometimes (mostly) Git bash (I don't know if LANG=... works there).

In Windows Command Shell, running chcp 1252 before running kramdown aaå.rb works! aaå.rb is saved with UTF-8 encoding though. And using chcp 65001 did not work! I'm now very confused …

Unfortunately, the chcp command does not work in Git bash, which is what I use the most. Luckily, I found a solution: cat aaå.rb | kramdown.

It seems like Sass has the same problem:

$ sass aaå.rb
Errno::ENOENT: No such file or directory - aaÕ.rb
  Use --trace for backtrace.

I also tried feeding the aaå.rb files to half a dozen compilers using node.js. All of them found it and used it correctly. So it really seems to be a ruby thing.

To sum up, thanks for your efforts! The important thing is that the main issue is resolved. I will continue to experiment with this, because I really want Windows users to be able to enjoy kramdown :)

from kramdown.

gettalong avatar gettalong commented on June 17, 2024

If you change the code page using chcp, you basically change what Ruby sees as the encoding of its environment. I referred to this as the default external encoding (use ruby -e "puts Encoding.default_external" to output it).

So if you change the code page to 1252, kramdown/ruby now can read the file name correctly because the encodings match (the default external with the encoding of the file name). However, since the content of the file is in a different encoding than the file name, the output from kramdown is garbled...

(side note: I don't really know how file system paths work on Windows and whether they are encoded in UTF-8 or Windows 1252, I'm just interpreting the data).

What exactly is "Git bash"? If it is based on Cygwin, the LANG trick from @svnpenn should work!

from kramdown.

gettalong avatar gettalong commented on June 17, 2024

Also just found http://stackoverflow.com/questions/2050973/what-encoding-are-filenames-in-ntfs-stored-as

After reading this it seems that Ruby is using the ANSI version of the fopen system call because it works if the external encoding is Windows 1252 but not if it is UTF-8. So this could probably be fixed by always converting the file name to the proper ANSI encoding when passing it to a Ruby file method.

from kramdown.

 avatar commented on June 17, 2024

@gettalong "Git Bash" is essentially MinGW, and yes it is based on Cygwin.

from kramdown.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.