Coder Social home page Coder Social logo

clean: true problem about roo HOT 5 CLOSED

roo-rb avatar roo-rb commented on July 18, 2024
clean: true problem

from roo.

Comments (5)

mwmalinowski avatar mwmalinowski commented on July 18, 2024

The :clean option is only designed to apply to entire sheets.

from lib/roo/base.rb

you can also pass in a :clean => true option to strip the sheet of

odd unicode characters and white spaces around columns

This begs the question: which unicode characters are 'odd'?

If :clean means that all non-ASCII characters will be removed then perhaps 'non-ASCII' would be a more informative comment to use.

If :clean means that control characters and other invisibles will be removed then Roo::Base#sanitize_value needs more discriminating criteria for the Array#select block

I'm guessing that the characters that anyone would want removed are the unicode values greater than 126 and less than 160.

http://en.wikipedia.org/wiki/List_of_Unicode_characters

@scaryguy If you just want to remove preceding and trailing whitespace then I think you want to iterate your headers and use String#strip! on each of them.

from roo.

stevendaniels avatar stevendaniels commented on July 18, 2024

Can anyone comment on why the :clean option removes all non-ASCII characters?

from roo.

stevendaniels avatar stevendaniels commented on July 18, 2024

@simonoff @Empact Should we reconsider what the :clean option does?

from roo.

mwmalinowski avatar mwmalinowski commented on July 18, 2024

@stevendaniels
If the :clean option is intended to mirror the Excel function CLEAN() then its main purpose is to remove control codes[1]. That is not what the clean_sheet method does--it removes UTF characters with codes >=127 and then any leading and trailing ASCII whitespace[2] via Ruby's String#strip.

This means that non-printable non-whitespace ASCII characters (e.g. SOH) will not be removed... which... seems like strange behavior to me.

Also, as @scaryguy pointed out, the :clean option paints with a brush that's probably too broad. I don't think one would normally expect the word "clean" in this context to mean removing ™ ü é or a boatload of other legitimate, printable UTF characters.

I think you're on the right track suggesting that the :clean option needs refining and/or redefinition.


[1] https://support.office.com/en-us/article/CLEAN-function-334d8a17-c9e1-4d8e-96c2-c89530201f4d

Removes all nonprintable characters from text. Use CLEAN on text imported from other applications that contains characters that may not print with your operating system. For example, you can use CLEAN to remove some low-level computer code that is frequently at the beginning and end of data files and cannot be printed.

IMPORTANT The CLEAN function was designed to remove the first 32 nonprinting characters in the 7-bit ASCII code (values 0 through 31) from text. In the Unicode character set, there are additional nonprinting characters (values 127, 129, 141, 143, 144, and 157). By itself, the CLEAN function does not remove these additional nonprinting characters.

[2] I'm emphasizing that String#strip only removes ASCII whitespace characters. Non-breaking spaces, for instance, are not removed by String#strip:

# 0xa0 = 160 = non-breaking space
# 0x20 = 32 = regular ASCII space
irb(main):001:0> "\u{a0}hello\u{20}".strip
=> " hello"

from roo.

stevendaniels avatar stevendaniels commented on July 18, 2024

In my opinion, the :clean option should match the office description.

I also think it should remove Unicode spaces in addition to ASCII spaces. When would keeping Unicode spaces be a desired result?
Something like this would work. I'll turn it into a pull request.

def sanitize_value(v)
  v.gsub(/[[:cntrl:]]|^[\p{Space}]+|[\p{Space}]+$/, '')
end

from roo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.