Coder Social home page Coder Social logo

perl-uri-encode's People

Contributors

aeruder avatar kyzn avatar markstos avatar mithun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

perl-uri-encode's Issues

The module uses $' which incurs a regex penalty

In the sub encode

# Encode a literal '%'
if ($double_encode) { $data =~ s{(\%)}{$self->_get_encoded_char($1)}gex; }
else { $data =~ s{(\%)}{$self->_encode_literal_percent($1, $')}gex; }

That use of $' makes all regexes in the entire app take a performance hit.

Improve uri_decode() runtime by skipping strings with no percent encodings

Currently, uri_decode() always applies its $encoded_chars regex pattern:

sub decode {
    ...omitted...
    $data =~ s{$encoded_chars}{ $self->_get_decoded_char($1) }gex;
    return $data;
}

If some input strings to uri_decode() do not have any percent encodings, it can be faster to apply the pattern to only those strings containing a % character:

sub decode {
    ...omitted...
    return $data if strlen($data, '%') == -1;  # <-------- ADDED
    $data =~ s{$encoded_chars}{ $self->_get_decoded_char($1) }gex;
    return $data;
}

I wrote a script (test.zip) to measure the overhead of this check given a frequency of strings that contain a percent encoding (versus not having one):

test.pl 100000 0
test.pl 100000 1
test.pl 100000 5
test.pl 100000 50
test.pl 100000 95
test.pl 100000 99
test.pl 100000 100

The results are as follows:

When 0% of 100000 strings have a percent encoding,
  Elapsed time as-is: 43.47 seconds
  Elapsed time with initial check for '%': 0.03 seconds

When 1% of 100000 strings have a percent encoding,
  Elapsed time as-is: 41.06 seconds
  Elapsed time with initial check for '%': 0.49 seconds

When 5% of 100000 strings have a percent encoding,
  Elapsed time as-is: 43.58 seconds
  Elapsed time with initial check for '%': 2.34 seconds

When 50% of 100000 strings have a percent encoding,
  Elapsed time as-is: 43.77 seconds
  Elapsed time with initial check for '%': 23.38 seconds

When 95% of 100000 strings have a percent encoding,
  Elapsed time as-is: 42.01 seconds
  Elapsed time with initial check for '%': 44.32 seconds

When 99% of 100000 strings have a percent encoding,
  Elapsed time as-is: 45.05 seconds
  Elapsed time with initial check for '%': 45.01 seconds

When 100% of 100000 strings have a percent encoding,
  Elapsed time as-is: 44.01 seconds
  Elapsed time with initial check for '%': 46.22 seconds

When a small percentage of the strings have encodings, there is a significant runtime improvement. When all strings have encodings, there is a slight runtime degradation.

I think this would be a worthwhile enhancement to implement inside the uri_decode() subroutine itself.

In our case, we use ripgrep to extract @href values from 50k-100k text files. ripgrep itself returns the matches in a few seconds, but applying uri_decode() to the incoming values via map adds 1-2 minutes of runtime. By manually adding the initial check for %, the uri_decode() runtime becomes negligible again. (Only a small fraction of the @href values have encodings.)

What's strange is, I don't understand why perl isn't already doing this. The $encoded_chars pattern has its own leading % character. Shouldn't it already skip strings that don't match the leading character? Or maybe the $encoded_chars pattern is regex-compiled on every use, thus causing the overhead?

Characters with lowercase hexadecimal value are double encoded

Hi,

the following code:

URI::Encode::uri_encode('%2F', {double_encode => 0});

yields the expected outcome %2F but this code:

URI::Encode::uri_encode('%2f', {double_encode => 0});

yields an unexpected outcome %252f while it should have similarly yielded %2f. If two strings differ only by letter case of a character's percent encoding, they should be treated as equivalent (RFC 3986).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.