Coder Social home page Coder Social logo

purescript-strings's Introduction

purescript-strings

Latest release Build status Pursuit

String and char utility functions, regular expressions.

Installation

spago install strings

Documentation

Module documentation is published on Pursuit.

purescript-strings's People

Contributors

bouzuya avatar brainrake avatar csicar avatar davidchambers avatar fujisawa avatar garyb avatar hdgarrood avatar jdegoes avatar joneshf avatar jordanmartinez avatar kl0tl avatar kritzcreek avatar liamgoodacre avatar matthewleon avatar menelaos avatar michaelficarra avatar monoidmusician avatar nightra avatar paf31 avatar postsolar avatar quelklef avatar rightfold avatar risto-stevcev avatar sharkdp avatar tfausak avatar themattchan avatar thomashoneyman avatar toastal avatar triallax avatar zyla avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

purescript-strings's Issues

Builder

Would there be any interest in a Builder API?

`split` should return a NonEmptyArray?

Is there ever an occasion when Javascript's String.prototype.split() returns an empty array? If not, couldn't we have split :: Pattern -> String -> NonEmptyArray String? Or another version, called split', if we want to avoid breaking changes.

Should split be upgraded to return a non-empty array, or should we implement split' :: Pattern -> String -> NonEmptyArray String? The underlying Javascript function String.prototype.split() will return an empty array if the input to be split is an empty string. So an implementation might look like:

split' :: Pattern -> String -> NonEmptyArray String
split' p s = fromMaybe (singleton s) (fromArray $ split p s)

I guess that the original split has the property that no elements of the resulting array are empty strings, so maybe a version as split' is preferable.

Add "normalize" function for Unicode

@michaelficarra is adding functions for working with strings of Unicode code points. Normalization is an important part of supporting Unicode strings, as illustrated by a Spotify blogpost, Creative Usernames.

Looks like https://github.com/menelaos/purescript-stringutils/blob/v0.0.6/src/Data/String/Utils.purs#L221 has this function, but we can also add it here, as some functions in that repo presumes ES6 support, which isn't acceptable by all PS users.

IDK if we can add this function to this lib, though, as it looks like only MS Edge browser has str.normalize - MS docs: normalize. But most other browsers should have it - MDN: normalize

Data.String.Unsafe is very unsafe

Unsafe charAt and charCodeAt should either throw an exception or return a legal value (say Char "\0" and -1) if the index is out of bounds, so they won't be able cause exceptions later in the code.
If one really want to use it without the bounds check (for efficiency?), one can define one's own foreign function.
An exception seems the better option. If this makes sense, I'll PR.

consider optional capturing groups in `match`

match currently has the following type:

match :: Regex -> String -> Maybe [String]

This is problematic, though, when one considers optional capturing groups:

> 'goodbye'.match(/(good)?bye/)
['goodbye', 'good']
> 'bye'.match(/(good)?bye/)
['bye', undefined]

Perhaps the function's type should be Regex -> String -> Maybe [Maybe String]. This would make the function cumbersome in the case of a pattern with no capturing groups: matching /hello/ would either give Nothing or Just([Just("hello")]).

This is an example of a dynamically typed language allowing a function to do too many things. Now we have the unenviable job of making sense of it all. :)

Char Enum

Should we have an Enum instance for Chars? It shouldn't be to hard to make if we base in on charCode.

splitAt without Maybe?

In my project I've had to define a splitAtTuple but even after #69 was fixed I still use it because I don't like the Maybe in the return type: just return an empty String on one side!

I would prefer it if basically any string index was considered "valid" insofaras it would return a record instead of a Maybe, with values ranging between { before: "", after: s } and { before: s, after: "" }.

Based on JavaScript String.prototype.substring behavior, it looks like this would amount to just removing the conditional (since substrings with negative indices and indices beyond the end of the string just return the whole string or empty as appropriate).

Is this something you are willing to change again, or would it be best to introduce like a splitAt' in this library with my suggested version?

Add startsWith and endsWith

These functions are already present in purescript-stringutils but many people, especially ones coming from JavaScript, expect them to be present in the main strings library.

Should I make a PR to fix this?

Why does splitAt return an array instead of a Tuple?

splitAt :: Int -> String -> Maybe (Array String). Looking at the FFI, it returns just([s.substring(0, i), s.substring(i)]) : nothing;, always either a Just with a 2-length array or a Nothing. Wouldn't a Tuple be more appropriate?

Data.String.CodePoints.uncons probably isn't constant-time

Here's the documentation for uncons:

-- | Returns a record with the first code point and the remaining code points
-- | of the string. Returns Nothing if the string is empty. Operates in
-- | constant space and time.
-- |
-- | ```purescript
-- | >>> uncons "𝐀𝐀 c 𝐀"
-- | Just { head: CodePoint 0x1D400, tail: "𝐀 c 𝐀" }
-- | >>> uncons ""
-- | Nothing
-- | ```
-- |
uncons :: String -> Maybe { head :: CodePoint, tail :: String }

I would have expected this to be O(n), because you need to copy almost the entirety of the argument string into the tail field of the result. We should probably verify this before changing it.

Handle Regex syntax errors?

Currently, the function

regex :: String -> RegexFlags -> Regex

fails with a runtime error if the Regex has a syntax error:

Uncaught SyntaxError: Invalid regular expression: /+/: Nothing to repeat

or if one of the flags is not supported ('y' is not supported in Chrome, for example):

Uncaught SyntaxError: Invalid flags supplied to RegExp constructor 'y'

I'm not sure if the design goal of PureScript libraries is to avoid runtime errors at all costs, but if this is the case, I would suggest modifying the type signature to return Maybe Regex or Either Error Regex. Of course, most of the time a Regex will be hard-coded and this might not be such a big issue.

(found this bug while playing with FlareCheck πŸ˜„)

Bug in Data.String.Unsafe.char

The js code for Data.String.Unsafe.char looks like this:

exports.char = function (s) {
  if (s.length !== 1) return s.charAt(0);
  throw new Error("Data.String.Unsafe.char: Expected string of length 1.");
};

I'm pretty sure that line on line 21 that !== is meant to be an ===; atm the function's bugging out for good inputs & vice versa. Should be:

if (s.length === 1) return s.charAt(0);

Suggestion: allow `slice (length s) (length s)` on Strings

Currently, Data.String.CodeUnits.slice returns Nothing on slices from the final index to the final index, e.g.

slice 4 4 "test" == Nothing

I find this behavior to be undesirable. It:

  • Means that "out of bounds" has different meanings for the two bounds, which I personally find to be unintuitive.
  • Nullifies the invarant that for each i from 0 through length s, we have s == slice 0 i s <> slice i (length s). Instead, this only holds up to length s - 1.

Motivating example: I was dealing with string containing two parts separated by a delimiter, like prefix:::suffix. I had a function to separate the two parts, getParts :: String -> Maybe (String /\ String), which was implemented via indexOf and split. With the current split implementation, this function has that getParts ":::suffix" == Just ("" /\ "suffix") but getParts "prefix:::" == Nothing, which is surprising.

Happy to make an MR if it's decided that this is reasonable. I suppose it's technically a reverse-incompatible change, though.

Negative arguments for take and drop are handled inconsistently

Specifically, we currently have:

take (-2) "abcdef" == ""
drop (-2) "abcdef" == "ef"

In my opinion, the take behavior is fine (same result as for take 0) while the drop behavior seems weird.

I would suggest replacing the current implementation

exports.drop = function (n) {
  return function (s) {
    return s.substr(n);
  };
};

with

exports.drop = function (n) {
  return function (s) {
    return s.substring(n);
  };
};

In this case, we would get:

drop (-2) "abcdef" == drop 0 "abcdef" == "abcdef"

add functions for retrieving a Regex's source and flags

Should the flags be exposed as a record or as some canonicalised string? Either way, should there be a function for converting between the two formats?

edit: JS RegExp flags can be retrieved as a string like this:

function flags(re) {
  var s = '' + re;
  return s.slice(s.lastIndexOf('/') + 1);
}

or as a record like this:

function flags(re) {
  return {
    multiline: re.multiline,
    ignoreCase: re.ignoreCase,
    global: re.global,
    sticky: !!re.sticky,
    unicode: !!re.unicode
  };
}

Rename "count" to "countPrefix"

With the next release of breaking changes to this library, the "count" function should be given a better name. My expectation of "count" function in context of a string is to tell me how many characters are in that string, but this "count" function actually counts the prefix characters which satisfy a predicate.

One suggested name is "countPrefix".

Issue originally raised in #79 (comment)

Char constructor is unsafe

It shouldn't be exported. But there are no Char literals, so to construct a Char you need to use one of

  • fromCharCode 99
  • fromMaybe (fromCharCode 0) with (charAt 0 "c") or (head $ toCharArray "c") or similar
  • Data.String.Unsafe.charAt 0 "c"
    all of which are somewhat awkward. The first two are safe but obfuscated, the third is unsafe, but at least marked as such.

Lacking character literals, I propose to hide the Char constructor and instead provide Data.String.Unsafe.char :: String -> Char. I can PR if this makes sense.

Deprecate count

The count function is a mistake, it should not have been exported and it's name is misleading.
Better deprecate it and afterwards unexport it.

lastIndexOf' with index greater than string length returns Nothing

In the JavaScript API, the fromIndex parameter in str.lastIndexOf(pattern [, fromIndex]), perhaps surprisingly, represents the index at which to stop searching (if we imagine the search proceeds from left-to-right). This means that if you provide a fromIndex which is greater than or equal to the string's length, in JS, it's equivalent to searching the whole string, i.e. not specifying that parameter at all. See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/lastIndexOf

However, in this library, the bounds check in lastIndexOf' means that e.g. lastIndexOf' (Pattern "a") 3 "aa" comes out as Nothing. I'd suggest getting rid of this check entirely, so that this matches better with the JS behaviour. This is a breaking change unfortunately.

Incidentally there are one or two other breaking changes I'd like to do (#81, #78, and also renaming Data.String.CodePoints to Data.String and renaming Data.String to Data.String.CodeUnits) so maybe we could do all of these the next time we break everything (i.e. when we release 0.12?)

Data.String.null

Please check if variable s is not null before calling s.length

Make Data.String.CodePoints the default

@michaelficarra originally suggested this and I agree; I think Data.String.CodePoints should really be the default. Unless you're certain you won't be working with anything outside the Basic Multilingual Plane, and you've identified string manipulations as a performance bottleneck, you should really be using the functions in Data.String.CodePoints.

For the functions whose type signatures are the same across both modules, like length :: String -> Int, this has the potential to be quite problematic, so I think we need to be quite careful about it. I'd suggest the following:

  • In the next breaking release:
    • we create a module Data.String.CodeUnits, with the exact same exports as the current Data.String,
    • we add a notice at the very top of Data.String, detailing that the functions within currently operate on code units, not code points; that this will change in the next breaking release; and that you should very probably be using Data.String.CodePoints instead (unless you are sure you want to operate on code units, in which case you can use Data.String.CodeUnits)
  • In the breaking release after that one:
    • change Data.String so that it re-exports everything from Data.String.CodePoints
    • remove the notices
    • consider deprecating the Data.String.CodePoints module, for removal in a subsequent breaking release?

Add trimStart/trimEnd?

There is a trim, but no trimStart/trimEnd (trimLeft/trimRight).

Is implementing it each time with some countPrefix or regEx should be an option?

Allow slicing from index to end of string

In the javascript slice function the stop parameter is optional, which makes the slice function slice until the end of the string.
This doesn't seem to be possible in the purescript version, since both parameters are mandatory.
There doesn't seem to be a slice version taking only one index argument.

Newtypes for replace

The errors with replace where I mix up arguments occur surprisingly often.
What do you think, is it worth to add newtype wrappers for replacer/replacee part?

And in general something like Hay and Needle, SplitBy and other stuff?

"Unknown data constructor CodePoint" (missing "codePointToInt"?)

Using version 4.0.1 this fails

import Data.String

instance myCodePointShow :: Show CodePoint where
  show (CodePoint i) = "CodePoint: " <> show i

with "Unknown data constructor CodePoint".

I got the impression that the CodePoint type constructor is not exported, so I tried using codePointToInt but there's no such function anywhere in the module. I've seen there was such a function in 3.5.0.

How to proceed? What am I missing?

CodePoints.uncons performance optimization?

It seems to me that these lines in Data.String.CodePoints.uncons

cu0 = fromEnum (Unsafe.charAt 0 s)
cu1 = fromEnum (Unsafe.charAt 1 s)

are first slicing the first code unit into a Char string with the JavaScript charAt method

if (i >= 0 && i < s.length) return s.charAt(i);

and then converting the Char string to a CodePoint by the boundedEnumChar instance fromEnum method which calls the Javascript charCodeAt method.

https://github.com/purescript/purescript-enums/blob/170d959644eb99e0025f4ab2e38f5f132fd85fa4/src/Data/Enum.js#L4

We could skip the intermediate string slice of the charAt method and call charCodeAt directly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.