String and char utility functions, regular expressions.
spago install strings
Module documentation is published on Pursuit.
String utility functions, Char type, regular expressions.
License: BSD 3-Clause "New" or "Revised" License
String and char utility functions, regular expressions.
spago install strings
Module documentation is published on Pursuit.
The following three functions are identical
String.fromChar :: Char -> String
String.singleton :: Char -> String
Char.toString :: Char -> String
Do we want to keep all three of them?
Would there be any interest in a Builder
API?
I'd like to have some functions from Haskell's Data.Char
available, like isAscii
and isSymbol
.
Would a PR adding these be accepted?
Is there ever an occasion when Javascript's String.prototype.split()
returns an empty array? If not, couldn't we have split :: Pattern -> String -> NonEmptyArray String
? Or another version, called split'
, if we want to avoid breaking changes.
Should split
be upgraded to return a non-empty array, or should we implement split' :: Pattern -> String -> NonEmptyArray String
? The underlying Javascript function String.prototype.split()
will return an empty array if the input to be split is an empty string. So an implementation might look like:
split' :: Pattern -> String -> NonEmptyArray String
split' p s = fromMaybe (singleton s) (fromArray $ split p s)
I guess that the original split has the property that no elements of the resulting array are empty strings, so maybe a version as split'
is preferable.
@michaelficarra is adding functions for working with strings of Unicode code points. Normalization is an important part of supporting Unicode strings, as illustrated by a Spotify blogpost, Creative Usernames.
Looks like https://github.com/menelaos/purescript-stringutils/blob/v0.0.6/src/Data/String/Utils.purs#L221 has this function, but we can also add it here, as some functions in that repo presumes ES6 support, which isn't acceptable by all PS users.
IDK if we can add this function to this lib, though, as it looks like only MS Edge browser has str.normalize - MS docs: normalize. But most other browsers should have it - MDN: normalize
Unsafe charAt
and charCodeAt
should either throw an exception or return a legal value (say Char "\0"
and -1
) if the index is out of bounds, so they won't be able cause exceptions later in the code.
If one really want to use it without the bounds check (for efficiency?), one can define one's own foreign function.
An exception seems the better option. If this makes sense, I'll PR.
match
currently has the following type:
match :: Regex -> String -> Maybe [String]
This is problematic, though, when one considers optional capturing groups:
> 'goodbye'.match(/(good)?bye/)
['goodbye', 'good']
> 'bye'.match(/(good)?bye/)
['bye', undefined]
Perhaps the function's type should be Regex -> String -> Maybe [Maybe String]
. This would make the function cumbersome in the case of a pattern with no capturing groups: matching /hello/
would either give Nothing
or Just([Just("hello")])
.
This is an example of a dynamically typed language allowing a function to do too many things. Now we have the unenviable job of making sense of it all. :)
Should we have an Enum instance for Chars? It shouldn't be to hard to make if we base in on charCode.
In my project I've had to define a splitAtTuple
but even after #69 was fixed I still use it because I don't like the Maybe
in the return type: just return an empty String
on one side!
I would prefer it if basically any string index was considered "valid" insofaras it would return a record instead of a Maybe
, with values ranging between { before: "", after: s }
and { before: s, after: "" }
.
Based on JavaScript String.prototype.substring
behavior, it looks like this would amount to just removing the conditional (since substrings with negative indices and indices beyond the end of the string just return the whole string or empty as appropriate).
Is this something you are willing to change again, or would it be best to introduce like a splitAt'
in this library with my suggested version?
These functions are already present in purescript-stringutils
but many people, especially ones coming from JavaScript, expect them to be present in the main strings library.
Should I make a PR to fix this?
splitAt :: Int -> String -> Maybe (Array String)
. Looking at the FFI, it returns just([s.substring(0, i), s.substring(i)]) : nothing;
, always either a Just with a 2-length array or a Nothing. Wouldn't a Tuple be more appropriate?
The expression indexOf' "" 2 "ab"
evaluates to Just 0
, however it should be Nothing
because the string "ab"
doesn't have a character at index 2
.
Here's the documentation for uncons
:
-- | Returns a record with the first code point and the remaining code points
-- | of the string. Returns Nothing if the string is empty. Operates in
-- | constant space and time.
-- |
-- | ```purescript
-- | >>> uncons "ππ c π"
-- | Just { head: CodePoint 0x1D400, tail: "π c π" }
-- | >>> uncons ""
-- | Nothing
-- | ```
-- |
uncons :: String -> Maybe { head :: CodePoint, tail :: String }
I would have expected this to be O(n), because you need to copy almost the entirety of the argument string into the tail
field of the result. We should probably verify this before changing it.
Similar to purescript-contrib/purescript-parsing#92
The dotAll
flag doesn't seem to be exposed in the Regex module. Could we add it?
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/dotAll
Current Data.String.Unsafe.char :: String -> Char
implementation,
exports.char = function (s) {
if (s.length === 1) return s.charAt(0);
throw new Error("Data.String.Unsafe.char: Expected string of length 1.");
};
https://github.com/purescript/purescript-strings/blob/v4.0.1/src/Data/String/Unsafe.js#L10-L13
fails to handle the strings like "π"
.
Currently, the function
regex :: String -> RegexFlags -> Regex
fails with a runtime error if the Regex has a syntax error:
Uncaught SyntaxError: Invalid regular expression: /+/: Nothing to repeat
or if one of the flags is not supported ('y' is not supported in Chrome, for example):
Uncaught SyntaxError: Invalid flags supplied to RegExp constructor 'y'
I'm not sure if the design goal of PureScript libraries is to avoid runtime errors at all costs, but if this is the case, I would suggest modifying the type signature to return Maybe Regex
or Either Error Regex
. Of course, most of the time a Regex will be hard-coded and this might not be such a big issue.
(found this bug while playing with FlareCheck π)
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/match
States that if there is a match, the first element of the array is the entire match. In which case there is atleast 1 element. Should we make this return Maybe NonEmptyArray (Maybe String)
?
Hard-coded regexes are so incredibly common, and there is no sensible thing to do other than crash when you got them wrong (except compiler error/warning? πΈ), so I think this short-hand is very useful?
The js code for Data.String.Unsafe.char looks like this:
exports.char = function (s) {
if (s.length !== 1) return s.charAt(0);
throw new Error("Data.String.Unsafe.char: Expected string of length 1.");
};
I'm pretty sure that line on line 21 that !==
is meant to be an ===
; atm the function's bugging out for good inputs & vice versa. Should be:
if (s.length === 1) return s.charAt(0);
How do people feel about changing indexOf
etc to return Maybe Int
?
For backward-compatibility, String.split' and String.Regex.split' should be defined returning NonEmptyArray
s.
The pursuit docs for singleton show a codePointFromInt
function which has been removed from the API a while back, and in fact the docs are properly updated in the source code: https://github.com/purescript/purescript-strings/blob/master/src/Data/String/CodePoints.purs#L84
Currently, Data.String.CodeUnits.slice
returns Nothing
on slices from the final index to the final index, e.g.
slice 4 4 "test" == Nothing
I find this behavior to be undesirable. It:
i
from 0
through length s
, we have s == slice 0 i s <> slice i (length s)
. Instead, this only holds up to length s - 1
.Motivating example: I was dealing with string containing two parts separated by a delimiter, like prefix:::suffix
. I had a function to separate the two parts, getParts :: String -> Maybe (String /\ String)
, which was implemented via indexOf
and split
. With the current split
implementation, this function has that getParts ":::suffix" == Just ("" /\ "suffix")
but getParts "prefix:::" == Nothing
, which is surprising.
Happy to make an MR if it's decided that this is reasonable. I suppose it's technically a reverse-incompatible change, though.
Split takes the separator as the first argument, but the documentation says it's the second argument.
fromCharCode
should return Nothing
if the code is out of the Basic Multilingual Plane Char
range, right?
purescript-strings/src/Data/Char.purs
Line 16 in 157e372
>>> show $ fromCharCode 65900
(Just 'Ε¬')
The Bounded
instance for Char
says that βCharacters fall within the Unicode range,β but the Char
says βguaranteed to contain one code unit.β
Since the String
type comes from the Prim
module, if we assume that Monoid
should stay in 'userland', the only option to avoid having an orphan instance for Monoid String
would be to move it into purescript-monoid
.
I'm happy to put PRs together for this if you agree.
Specifically, we currently have:
take (-2) "abcdef" == ""
drop (-2) "abcdef" == "ef"
In my opinion, the take
behavior is fine (same result as for take 0
) while the drop
behavior seems weird.
I would suggest replacing the current implementation
exports.drop = function (n) {
return function (s) {
return s.substr(n);
};
};
with
exports.drop = function (n) {
return function (s) {
return s.substring(n);
};
};
In this case, we would get:
drop (-2) "abcdef" == drop 0 "abcdef" == "abcdef"
Should the flags be exposed as a record or as some canonicalised string? Either way, should there be a function for converting between the two formats?
edit: JS RegExp flags can be retrieved as a string like this:
function flags(re) {
var s = '' + re;
return s.slice(s.lastIndexOf('/') + 1);
}
or as a record like this:
function flags(re) {
return {
multiline: re.multiline,
ignoreCase: re.ignoreCase,
global: re.global,
sticky: !!re.sticky,
unicode: !!re.unicode
};
}
With the next release of breaking changes to this library, the "count" function should be given a better name. My expectation of "count" function in context of a string is to tell me how many characters are in that string, but this "count" function actually counts the prefix characters which satisfy a predicate.
One suggested name is "countPrefix".
Issue originally raised in #79 (comment)
It shouldn't be exported. But there are no Char
literals, so to construct a Char
you need to use one of
fromCharCode 99
fromMaybe (fromCharCode 0)
with (charAt 0 "c")
or (head $ toCharArray "c")
or similarData.String.Unsafe.charAt 0 "c"
Lacking character literals, I propose to hide the Char
constructor and instead provide Data.String.Unsafe.char :: String -> Char
. I can PR if this makes sense.
The count
function is a mistake, it should not have been exported and it's name is misleading.
Better deprecate it and afterwards unexport it.
Useful for HTTP headers.
In the JavaScript API, the fromIndex
parameter in str.lastIndexOf(pattern [, fromIndex])
, perhaps surprisingly, represents the index at which to stop searching (if we imagine the search proceeds from left-to-right). This means that if you provide a fromIndex
which is greater than or equal to the string's length, in JS, it's equivalent to searching the whole string, i.e. not specifying that parameter at all. See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/lastIndexOf
However, in this library, the bounds check in lastIndexOf'
means that e.g. lastIndexOf' (Pattern "a") 3 "aa"
comes out as Nothing
. I'd suggest getting rid of this check entirely, so that this matches better with the JS behaviour. This is a breaking change unfortunately.
Incidentally there are one or two other breaking changes I'd like to do (#81, #78, and also renaming Data.String.CodePoints
to Data.String
and renaming Data.String
to Data.String.CodeUnits
) so maybe we could do all of these the next time we break everything (i.e. when we release 0.12?)
https://www.stackage.org/haddock/lts-8.21/text-1.2.2.1/Data-Text.html#g:9
justifyRight :: Int -> Char -> String -> String
justifyLeft :: Int -> Char -> String -> String
center :: Int -> Char -> String -> String
Let me know if it's desired addition π
Please check if variable s is not null before calling s.length
Currently there seems to be no way to iterate over all matches including capturing groups: https://pursuit.purescript.org/packages/purescript-strings/4.0.1/docs/Data.String.Regex
However, javascript has a function which lets you do just this: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/matchAll
Can we expose this in the regex module?
Try it here http://sharkdp.github.io/purescript-strings/
Some useful features appeared since last one.
working on this now
@michaelficarra originally suggested this and I agree; I think Data.String.CodePoints
should really be the default. Unless you're certain you won't be working with anything outside the Basic Multilingual Plane, and you've identified string manipulations as a performance bottleneck, you should really be using the functions in Data.String.CodePoints
.
For the functions whose type signatures are the same across both modules, like length :: String -> Int
, this has the potential to be quite problematic, so I think we need to be quite careful about it. I'd suggest the following:
Data.String.CodeUnits
, with the exact same exports as the current Data.String
,Data.String
, detailing that the functions within currently operate on code units, not code points; that this will change in the next breaking release; and that you should very probably be using Data.String.CodePoints
instead (unless you are sure you want to operate on code units, in which case you can use Data.String.CodeUnits
)Data.String
so that it re-exports everything from Data.String.CodePoints
Data.String.CodePoints
module, for removal in a subsequent breaking release?There is a trim
, but no trimStart/trimEnd
(trimLeft/trimRight).
Is implementing it each time with some countPrefix
or regEx
should be an option?
What do you think about renaming the singleton
function in Data.String.CodePoints
to fromCodePoint
?
I believe this would increase discoverability for people coming from JavaScript and also be closer to fromCodePointArray
that is defined in the same module.
In the javascript slice function the stop
parameter is optional, which makes the slice function slice until the end of the string.
This doesn't seem to be possible in the purescript version, since both parameters are mandatory.
There doesn't seem to be a slice version taking only one index argument.
The errors with replace
where I mix up arguments occur surprisingly often.
What do you think, is it worth to add newtype wrappers for replacer/replacee part?
And in general something like Hay
and Needle
, SplitBy
and other stuff?
If a regex uses the 'g' flag, it saves state between calls to test
which violates referential transparency.
Using version 4.0.1
this fails
import Data.String
instance myCodePointShow :: Show CodePoint where
show (CodePoint i) = "CodePoint: " <> show i
with "Unknown data constructor CodePoint".
I got the impression that the CodePoint
type constructor is not exported, so I tried using codePointToInt
but there's no such function anywhere in the module. I've seen there was such a function in 3.5.0
.
How to proceed? What am I missing?
It seems to me that these lines in Data.String.CodePoints.uncons
purescript-strings/src/Data/String/CodePoints.purs
Lines 197 to 198 in 157e372
are first slicing the first code unit into a Char
string with the JavaScript charAt
method
and then converting the Char
string to a CodePoint
by the boundedEnumChar
instance fromEnum
method which calls the Javascript charCodeAt
method.
We could skip the intermediate string slice of the charAt
method and call charCodeAt
directly.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.