Comments (5)
The :clean
option is only designed to apply to entire sheets.
from lib/roo/base.rb
you can also pass in a :clean => true option to strip the sheet of
odd unicode characters and white spaces around columns
This begs the question: which unicode characters are 'odd'?
If :clean
means that all non-ASCII characters will be removed then perhaps 'non-ASCII' would be a more informative comment to use.
If :clean
means that control characters and other invisibles will be removed then Roo::Base#sanitize_value needs more discriminating criteria for the Array#select block
I'm guessing that the characters that anyone would want removed are the unicode values greater than 126 and less than 160.
http://en.wikipedia.org/wiki/List_of_Unicode_characters
@scaryguy If you just want to remove preceding and trailing whitespace then I think you want to iterate your headers and use String#strip! on each of them.
from roo.
Can anyone comment on why the :clean option removes all non-ASCII characters?
from roo.
@simonoff @Empact Should we reconsider what the :clean option does?
from roo.
@stevendaniels
If the :clean
option is intended to mirror the Excel function CLEAN() then its main purpose is to remove control codes[1]. That is not what the clean_sheet
method does--it removes UTF characters with codes >=127 and then any leading and trailing ASCII whitespace[2] via Ruby's String#strip.
This means that non-printable non-whitespace ASCII characters (e.g. SOH) will not be removed... which... seems like strange behavior to me.
Also, as @scaryguy pointed out, the :clean
option paints with a brush that's probably too broad. I don't think one would normally expect the word "clean" in this context to mean removing ™ ü é or a boatload of other legitimate, printable UTF characters.
I think you're on the right track suggesting that the :clean
option needs refining and/or redefinition.
[1] https://support.office.com/en-us/article/CLEAN-function-334d8a17-c9e1-4d8e-96c2-c89530201f4d
Removes all nonprintable characters from text. Use CLEAN on text imported from other applications that contains characters that may not print with your operating system. For example, you can use CLEAN to remove some low-level computer code that is frequently at the beginning and end of data files and cannot be printed.
IMPORTANT The CLEAN function was designed to remove the first 32 nonprinting characters in the 7-bit ASCII code (values 0 through 31) from text. In the Unicode character set, there are additional nonprinting characters (values 127, 129, 141, 143, 144, and 157). By itself, the CLEAN function does not remove these additional nonprinting characters.
[2] I'm emphasizing that String#strip only removes ASCII whitespace characters. Non-breaking spaces, for instance, are not removed by String#strip:
# 0xa0 = 160 = non-breaking space
# 0x20 = 32 = regular ASCII space
irb(main):001:0> "\u{a0}hello\u{20}".strip
=> " hello"
from roo.
In my opinion, the :clean option should match the office description.
I also think it should remove Unicode spaces in addition to ASCII spaces. When would keeping Unicode spaces be a desired result?
Something like this would work. I'll turn it into a pull request.
def sanitize_value(v)
v.gsub(/[[:cntrl:]]|^[\p{Space}]+|[\p{Space}]+$/, '')
end
from roo.
Related Issues (20)
- nokogiri vulnerabilities
- roo attempts to open file URL without first encoding spaces in filename. HOT 1
- Is roo abandoned? HOT 9
- NoMethodError: undefined method `gsub' for nil:NilClass for links in csv formatter
- Roo set method doesn't works for ODS file. anyway to set the value in ODS file? HOT 2
- More XLSX methods
- Workaround needed for: open html-as-xls file HOT 1
- invalid value for Integer(): "" HOT 1
- Version 2.9.0 installed from RubyGems still not compatible with Ruby 3? HOT 2
- support for xlsb HOT 1
- Add support for boolean values in open office files exported via Google Sheets
- Excel can store dates in exponent form which causes dates to incorrectly parse to year 1900
- Tempfile cleanup causes warning on Ruby 3.1 when using Spring (or other forking preloader)
- [Doc] Iterate through each sheet with "each_with_pagename" not working anymore HOT 2
- previously exported Spanish accents couldn't be imported correctly.
- Incorrect date parsing in xlsx files to Sat, 15 Jul 1905 HOT 2
- Reading of xlsx fails if any datetime-cell is also linked
- undefined method `upto' for nil:NilClass
- Warnings are displayed when roo is used with Ruby 3.3.0 HOT 3
- New Release with fix for Roo::Base#each_with_pagename HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from roo.