purescript / purescript-strings Goto Github PK

View Code? Open in Web Editor NEW

54.0 13.0 71.0 327 KB

String utility functions, Char type, regular expressions.

License: BSD 3-Clause "New" or "Revised" License

JavaScript 6.78% PureScript 93.22%

purescript-strings's Introduction

purescript-strings

String and char utility functions, regular expressions.

Installation

spago install strings

Documentation

Module documentation is published on Pursuit.

purescript-strings's People

Contributors

Stargazers

Watchers

Forkers

joneshf michaelficarra jdegoes fresheyeball davidchambers nightra jacereda caryoscelus brainrake relrod flipstone anttih tfausak kika sharkdp cdepillabout telser lua-purescript pavlosgi paulyoung risto-stevcev leighman rightfold pure11 purerl negator rajeshkumar9t6 monoidmusician bartadv ahstro notgiorgi kedashoe csicar matthewleon safareli themattchan eskoniiranen carstenkoenig alexadewit bouzuya restaumatic abaco krisajenkins chexxor dilipzotha rintcius kl0tl strax growthagent jordanmartinez fujisawa jmatsushita pure-c triallax wclr k4m4 sigma-andex kab0a1 gasi seanpm2001 postsolar purescm pete-murphy unisay

purescript-strings's Issues

Remove some of the following: fromChar, singleton, toString

The following three functions are identical

String.fromChar :: Char -> String
String.singleton :: Char -> String
Char.toString :: Char -> String

Do we want to keep all three of them?

Builder

Would there be any interest in a Builder API?

add 'isAscii' and 'isSymbol' from Haskell's Data.Char

I'd like to have some functions from Haskell's Data.Char available, like isAscii and isSymbol.

Would a PR adding these be accepted?

`split` should return a NonEmptyArray?

Is there ever an occasion when Javascript's String.prototype.split() returns an empty array? If not, couldn't we have split :: Pattern -> String -> NonEmptyArray String? Or another version, called split', if we want to avoid breaking changes.

Should split be upgraded to return a non-empty array, or should we implement split' :: Pattern -> String -> NonEmptyArray String? The underlying Javascript function String.prototype.split() will return an empty array if the input to be split is an empty string. So an implementation might look like:

split' :: Pattern -> String -> NonEmptyArray String
split' p s = fromMaybe (singleton s) (fromArray $ split p s)

I guess that the original split has the property that no elements of the resulting array are empty strings, so maybe a version as split' is preferable.

Add "normalize" function for Unicode

@michaelficarra is adding functions for working with strings of Unicode code points. Normalization is an important part of supporting Unicode strings, as illustrated by a Spotify blogpost, Creative Usernames.

Looks like https://github.com/menelaos/purescript-stringutils/blob/v0.0.6/src/Data/String/Utils.purs#L221 has this function, but we can also add it here, as some functions in that repo presumes ES6 support, which isn't acceptable by all PS users.

IDK if we can add this function to this lib, though, as it looks like only MS Edge browser has str.normalize - MS docs: normalize. But most other browsers should have it - MDN: normalize

Data.String.Unsafe is very unsafe

Unsafe charAt and charCodeAt should either throw an exception or return a legal value (say Char "\0" and -1) if the index is out of bounds, so they won't be able cause exceptions later in the code.
If one really want to use it without the bounds check (for efficiency?), one can define one's own foreign function.
An exception seems the better option. If this makes sense, I'll PR.

consider optional capturing groups in `match`

match currently has the following type:

match :: Regex -> String -> Maybe [String]

This is problematic, though, when one considers optional capturing groups:

> 'goodbye'.match(/(good)?bye/)
['goodbye', 'good']
> 'bye'.match(/(good)?bye/)
['bye', undefined]

Perhaps the function's type should be Regex -> String -> Maybe [Maybe String]. This would make the function cumbersome in the case of a pattern with no capturing groups: matching /hello/ would either give Nothing or Just([Just("hello")]).

This is an example of a dynamically typed language allowing a function to do too many things. Now we have the unenviable job of making sense of it all. :)

Char Enum

Should we have an Enum instance for Chars? It shouldn't be to hard to make if we base in on charCode.

splitAt without Maybe?

In my project I've had to define a splitAtTuple but even after #69 was fixed I still use it because I don't like the Maybe in the return type: just return an empty String on one side!

I would prefer it if basically any string index was considered "valid" insofaras it would return a record instead of a Maybe, with values ranging between { before: "", after: s } and { before: s, after: "" }.

Based on JavaScript String.prototype.substring behavior, it looks like this would amount to just removing the conditional (since substrings with negative indices and indices beyond the end of the string just return the whole string or empty as appropriate).

Is this something you are willing to change again, or would it be best to introduce like a splitAt' in this library with my suggested version?

Add startsWith and endsWith

These functions are already present in purescript-stringutils but many people, especially ones coming from JavaScript, expect them to be present in the main strings library.

Should I make a PR to fix this?

Why does splitAt return an array instead of a Tuple?

splitAt :: Int -> String -> Maybe (Array String). Looking at the FFI, it returns just([s.substring(0, i), s.substring(i)]) : nothing;, always either a Just with a 2-length array or a Nothing. Wouldn't a Tuple be more appropriate?

`indexOf'` gives incorrect result for empty prefix when starting index is out of bounds

The expression indexOf' "" 2 "ab" evaluates to Just 0, however it should be Nothing because the string "ab" doesn't have a character at index 2.

Data.String.CodePoints.uncons probably isn't constant-time

Here's the documentation for uncons:

-- | Returns a record with the first code point and the remaining code points
-- | of the string. Returns Nothing if the string is empty. Operates in
-- | constant space and time.
-- |
-- | ```purescript
-- | >>> uncons "𝐀𝐀 c 𝐀"
-- | Just { head: CodePoint 0x1D400, tail: "𝐀 c 𝐀" }
-- | >>> uncons ""
-- | Nothing
-- | ```
-- |
uncons :: String -> Maybe { head :: CodePoint, tail :: String }

I would have expected this to be O(n), because you need to copy almost the entirety of the argument string into the tail field of the result. We should probably verify this before changing it.

stripPrefix performs full linear search in order to check if index is 0

Add support for dotAll flag?

The dotAll flag doesn't seem to be exposed in the Regex module. Could we add it?

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/dotAll

Ensure all remaining functions are safe/have unsafe versions as needed

Data.String.Unsafe.char is not Unicode-aware

Current Data.String.Unsafe.char :: String -> Char implementation,

exports.char = function (s) {
  if (s.length === 1) return s.charAt(0);
  throw new Error("Data.String.Unsafe.char: Expected string of length 1.");
};

https://github.com/purescript/purescript-strings/blob/v4.0.1/src/Data/String/Unsafe.js#L10-L13

fails to handle the strings like "𝐀".

Handle Regex syntax errors?

Currently, the function

regex :: String -> RegexFlags -> Regex

fails with a runtime error if the Regex has a syntax error:

Uncaught SyntaxError: Invalid regular expression: /+/: Nothing to repeat

or if one of the flags is not supported ('y' is not supported in Chrome, for example):

Uncaught SyntaxError: Invalid flags supplied to RegExp constructor 'y'

I'm not sure if the design goal of PureScript libraries is to avoid runtime errors at all costs, but if this is the case, I would suggest modifying the type signature to return Maybe Regex or Either Error Regex. Of course, most of the time a Regex will be hard-coded and this might not be such a big issue.

(found this bug while playing with FlareCheck 😄)

Regex match should give NonEmptyArray

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/match

States that if there is a match, the first element of the array is the entire match. In which case there is atleast 1 element. Should we make this return Maybe NonEmptyArray (Maybe String)?

Shorthand: unsafeRegex :: String -> RegexFlags -> Regex

Hard-coded regexes are so incredibly common, and there is no sensible thing to do other than crash when you got them wrong (except compiler error/warning? 😸), so I think this short-hand is very useful?

Bug in Data.String.Unsafe.char

The js code for Data.String.Unsafe.char looks like this:

exports.char = function (s) {
  if (s.length !== 1) return s.charAt(0);
  throw new Error("Data.String.Unsafe.char: Expected string of length 1.");
};

I'm pretty sure that line on line 21 that !== is meant to be an ===; atm the function's bugging out for good inputs & vice versa. Should be:

if (s.length === 1) return s.charAt(0);

Using -1 to mean "not found"

How do people feel about changing indexOf etc to return Maybe Int?

Split functions should return NonEmptyArray

For backward-compatibility, String.split' and String.Regex.split' should be defined returning NonEmptyArrays.

Docs are not up to date on Pursuit

The pursuit docs for singleton show a codePointFromInt function which has been removed from the API a while back, and in fact the docs are properly updated in the source code: https://github.com/purescript/purescript-strings/blob/master/src/Data/String/CodePoints.purs#L84

Suggestion: allow `slice (length s) (length s)` on Strings

Currently, Data.String.CodeUnits.slice returns Nothing on slices from the final index to the final index, e.g.

slice 4 4 "test" == Nothing

I find this behavior to be undesirable. It:

Means that "out of bounds" has different meanings for the two bounds, which I personally find to be unintuitive.
Nullifies the invarant that for each i from 0 through length s, we have s == slice 0 i s <> slice i (length s). Instead, this only holds up to length s - 1.

Motivating example: I was dealing with string containing two parts separated by a delimiter, like prefix:::suffix. I had a function to separate the two parts, getParts :: String -> Maybe (String /\ String), which was implemented via indexOf and split. With the current split implementation, this function has that getParts ":::suffix" == Just ("" /\ "suffix") but getParts "prefix:::" == Nothing, which is surprising.

Happy to make an MR if it's decided that this is reasonable. I suppose it's technically a reverse-incompatible change, though.

Documentation of split is misleading

Split takes the separator as the first argument, but the documentation says it's the second argument.

fromCharCode BMP

fromCharCode should return Nothing if the code is out of the Basic Multilingual Plane Char range, right?

purescript-strings/src/Data/Char.purs

Line 16 in 157e372

fromCharCode = toEnum

>>> show $ fromCharCode 65900

(Just 'Ŭ')

The Bounded instance for Char says that “Characters fall within the Unicode range,” but the Char says “guaranteed to contain one code unit.”

Consider moving orphan instance monoidString into purescript-monoid?

Since the String type comes from the Prim module, if we assume that Monoid should stay in 'userland', the only option to avoid having an orphan instance for Monoid String would be to move it into purescript-monoid.

I'm happy to put PRs together for this if you agree.

Negative arguments for take and drop are handled inconsistently

Specifically, we currently have:

take (-2) "abcdef" == ""
drop (-2) "abcdef" == "ef"

In my opinion, the take behavior is fine (same result as for take 0) while the drop behavior seems weird.

I would suggest replacing the current implementation

exports.drop = function (n) {
  return function (s) {
    return s.substr(n);
  };
};

with

exports.drop = function (n) {
  return function (s) {
    return s.substring(n);
  };
};

In this case, we would get:

drop (-2) "abcdef" == drop 0 "abcdef" == "abcdef"

add functions for retrieving a Regex's source and flags

Should the flags be exposed as a record or as some canonicalised string? Either way, should there be a function for converting between the two formats?

edit: JS RegExp flags can be retrieved as a string like this:

function flags(re) {
  var s = '' + re;
  return s.slice(s.lastIndexOf('/') + 1);
}

or as a record like this:

function flags(re) {
  return {
    multiline: re.multiline,
    ignoreCase: re.ignoreCase,
    global: re.global,
    sticky: !!re.sticky,
    unicode: !!re.unicode
  };
}

Rename "count" to "countPrefix"

With the next release of breaking changes to this library, the "count" function should be given a better name. My expectation of "count" function in context of a string is to tell me how many characters are in that string, but this "count" function actually counts the prefix characters which satisfy a predicate.

One suggested name is "countPrefix".

Issue originally raised in #79 (comment)

NonEmptyString should convert to/from NonEmptyArray

Char constructor is unsafe

It shouldn't be exported. But there are no Char literals, so to construct a Char you need to use one of

fromCharCode 99
fromMaybe (fromCharCode 0) with (charAt 0 "c") or (head $ toCharArray "c") or similar
Data.String.Unsafe.charAt 0 "c"
all of which are somewhat awkward. The first two are safe but obfuscated, the third is unsafe, but at least marked as such.

Lacking character literals, I propose to hide the Char constructor and instead provide Data.String.Unsafe.char :: String -> Char. I can PR if this makes sense.

Deprecate count

The count function is a mistake, it should not have been exported and it's name is misleading.
Better deprecate it and afterwards unexport it.

Add a case insensitive newtype wrapper

Useful for HTTP headers.

lastIndexOf' with index greater than string length returns Nothing

In the JavaScript API, the fromIndex parameter in str.lastIndexOf(pattern [, fromIndex]), perhaps surprisingly, represents the index at which to stop searching (if we imagine the search proceeds from left-to-right). This means that if you provide a fromIndex which is greater than or equal to the string's length, in JS, it's equivalent to searching the whole string, i.e. not specifying that parameter at all. See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/lastIndexOf

However, in this library, the bounds check in lastIndexOf' means that e.g. lastIndexOf' (Pattern "a") 3 "aa" comes out as Nothing. I'd suggest getting rid of this check entirely, so that this matches better with the JS behaviour. This is a breaking change unfortunately.

Incidentally there are one or two other breaking changes I'd like to do (#81, #78, and also renaming Data.String.CodePoints to Data.String and renaming Data.String to Data.String.CodeUnits) so maybe we could do all of these the next time we break everything (i.e. when we release 0.12?)

Add Justification functions

https://www.stackage.org/haddock/lts-8.21/text-1.2.2.1/Data-Text.html#g:9

justifyRight :: Int -> Char -> String -> String
justifyLeft :: Int -> Char -> String -> String
center :: Int -> Char -> String -> String

Let me know if it's desired addition 🎈

Data.String.null

Please check if variable s is not null before calling s.length

Add support for matchAll?

Currently there seems to be no way to iterate over all matches including capturing groups: https://pursuit.purescript.org/packages/purescript-strings/4.0.1/docs/Data.String.Regex

However, javascript has a function which lets you do just this: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/matchAll

Can we expose this in the regex module?

lastIndexOf' does something odd with indices

Try it here http://sharkdp.github.io/purescript-strings/

Please make a new release (for bower)

Some useful features appeared since last one.

make Regex instance of Show

working on this now

Make Data.String.CodePoints the default

@michaelficarra originally suggested this and I agree; I think Data.String.CodePoints should really be the default. Unless you're certain you won't be working with anything outside the Basic Multilingual Plane, and you've identified string manipulations as a performance bottleneck, you should really be using the functions in Data.String.CodePoints.

For the functions whose type signatures are the same across both modules, like length :: String -> Int, this has the potential to be quite problematic, so I think we need to be quite careful about it. I'd suggest the following:

In the next breaking release:
- we create a module Data.String.CodeUnits, with the exact same exports as the current Data.String,
- we add a notice at the very top of Data.String, detailing that the functions within currently operate on code units, not code points; that this will change in the next breaking release; and that you should very probably be using Data.String.CodePoints instead (unless you are sure you want to operate on code units, in which case you can use Data.String.CodeUnits)
In the breaking release after that one:
- change Data.String so that it re-exports everything from Data.String.CodePoints
- remove the notices
- consider deprecating the Data.String.CodePoints module, for removal in a subsequent breaking release?

Add trimStart/trimEnd?

There is a trim, but no trimStart/trimEnd (trimLeft/trimRight).

Is implementing it each time with some countPrefix or regEx should be an option?

Rename `Data.String.CodePoints.singleton` to `fromCodePoint`

What do you think about renaming the singleton function in Data.String.CodePoints to fromCodePoint?
I believe this would increase discoverability for people coming from JavaScript and also be closer to fromCodePointArray that is defined in the same module.

Allow slicing from index to end of string

In the javascript slice function the stop parameter is optional, which makes the slice function slice until the end of the string.
This doesn't seem to be possible in the purescript version, since both parameters are mandatory.
There doesn't seem to be a slice version taking only one index argument.

Newtypes for replace

The errors with replace where I mix up arguments occur surprisingly often.
What do you think, is it worth to add newtype wrappers for replacer/replacee part?

And in general something like Hay and Needle, SplitBy and other stuff?

Global flag violates referential transparency

If a regex uses the 'g' flag, it saves state between calls to test which violates referential transparency.

"Unknown data constructor CodePoint" (missing "codePointToInt"?)

Using version 4.0.1 this fails

import Data.String

instance myCodePointShow :: Show CodePoint where
  show (CodePoint i) = "CodePoint: " <> show i

with "Unknown data constructor CodePoint".

I got the impression that the CodePoint type constructor is not exported, so I tried using codePointToInt but there's no such function anywhere in the module. I've seen there was such a function in 3.5.0.

How to proceed? What am I missing?

CodePoints.uncons performance optimization?

It seems to me that these lines in Data.String.CodePoints.uncons

purescript-strings/src/Data/String/CodePoints.purs

Lines 197 to 198 in 157e372

    
           cu0 = fromEnum (Unsafe.charAt 0 s) 
        
           cu1 = fromEnum (Unsafe.charAt 1 s)

are first slicing the first code unit into a Char string with the JavaScript charAt method

purescript-strings/src/Data/String/Unsafe.js

Line 5 in 157e372

if (i >= 0 && i < s.length) return s.charAt(i);

and then converting the Char string to a CodePoint by the boundedEnumChar instance fromEnum method which calls the Javascript charCodeAt method.

https://github.com/purescript/purescript-enums/blob/170d959644eb99e0025f4ab2e38f5f132fd85fa4/src/Data/Enum.js#L4

We could skip the intermediate string slice of the charAt method and call charCodeAt directly.

	cu0 = fromEnum (Unsafe.charAt 0 s)
	cu1 = fromEnum (Unsafe.charAt 1 s)