Coder Social home page Coder Social logo

ada-url / ada Goto Github PK

View Code? Open in Web Editor NEW
1.3K 22.0 81.0 19.08 MB

WHATWG-compliant and fast URL parser written in modern C++

Home Page: https://ada-url.com

License: Apache License 2.0

CMake 4.48% C 2.96% C++ 89.31% Shell 0.46% Python 2.66% Rust 0.10% Dockerfile 0.02%
cpp parser url performance whatwg-url neon sse2 simd

ada's Introduction

Ada

OpenSSF Scorecard OpenSSF Best Practices Ubuntu 22.04 VS17-CI VS17-clang-CI Ubuntu s390x (GCC 11)

Ada is a fast and spec-compliant URL parser written in C++. Specification for URL parser can be found from the WHATWG website.

The Ada library passes the full range of tests from the specification, across a wide range of platforms (e.g., Windows, Linux, macOS). It fully supports the relevant Unicode Technical Standard.

A common use of a URL parser is to take a URL string and normalize it. The WHATWG URL specification has been adopted by most browsers. Other tools, such as curl and many standard libraries, follow the RFC 3986. The following table illustrates possible differences in practice (encoding of the host, encoding of the path):

string source string value
input string https://www.7‑Eleven.com/Home/Privacy/Montréal
ada's normalized string https://www.xn--7eleven-506c.com/Home/Privacy/Montr%C3%A9al
curl 7.87 (returns the original unchanged)

Requirements

The project is otherwise self-contained and it has no dependency. A recent C++ compiler supporting C++17. We test GCC 9 or better, LLVM 10 or better and Microsoft Visual Studio 2022.

Ada is fast.

On a benchmark where we need to validate and normalize thousands URLs found on popular websites, we find that ada can be several times faster than popular competitors (system: Apple MacBook 2022 with LLVM 14).

      ada ▏  188 ns/URL ███▏
servo url ▏  664 ns/URL ███████████▎
     CURL ▏ 1471 ns/URL █████████████████████████

Ada has improved the performance of the popular JavaScript environment Node.js:

Since Node.js 18, a new URL parser dependency was added to Node.js — Ada. This addition bumped the Node.js performance when parsing URLs to a new level. Some results could reach up to an improvement of 400%. (State of Node.js Performance 2023)

The Ada library is used by important systems besides Node.js such as Redpanda and Cloudflare Workers.

the ada library

Quick Start

Linux or macOS users might follow the following instructions if they have a recent C++ compiler installed and a standard utility (wget)

  1. Pull the library in a directory

    wget https://github.com/ada-url/ada/releases/download/v2.6.10/ada.cpp
    wget https://github.com/ada-url/ada/releases/download/v2.6.10/ada.h
    
  2. Create a new file named demo.cpp with this content:

     #include "ada.cpp"
     #include "ada.h"
     #include <iostream>
    
     int main(int, char *[]) {
       auto url = ada::parse<ada::url>("https://www.google.com");
       if (!url) {
         std::cout << "failure" << std::endl;
         return EXIT_FAILURE;
       }
       url->set_protocol("http");
       std::cout << url->get_protocol() << std::endl;
       std::cout << url->get_host() << std::endl;
       return EXIT_SUCCESS;
     }
  3. Compile

    c++ -std=c++17 -o demo demo.cpp
    
  4. ./demo

    http:
    www.google.com
    

Bindings of Ada

We provide clients for different programming languages through our C API.

  • Rust: Rust bindings for Ada
  • Go: Go bindings for Ada
  • Python: Python bindings for Ada
  • R: R wrapper for Ada

Usage

Ada supports two types of URL instances, ada::url and ada::url_aggregator. The usage is the same in either case: we have an parsing function template ada::parse which can return either a result of type ada::result<ada::url> or of type ada::result<ada::url_aggregator> depending on your needs. The ada::url_aggregator class is smaller and it is backed by a precomputed serialized URL string. The ada::url class is made of several separate strings for the various components (path, host, and so forth).

Parsing & Validation

  • Parse and validate a URL from an ASCII or a valid UTF-8 string.
ada::result<ada::url_aggregator> url = ada::parse<ada::url_aggregator>("https://www.google.com");
if (url) { /* URL is valid */ }

After calling 'parse', you must check that the result is valid before accessing it when you are not sure that it will succeed. The following code is unsafe:

ada::result<ada::url_aggregator> url = ada::parse<ada::url_aggregator>("some bad url");
url->get_href();

You should do...

ada::result<ada::url_aggregator> url = ada::parse<ada::url_aggregator>("some bad url");
if(url) {
  // next line is now safe:
  url->get_href();
} else {
  // report a parsing failure
}

For simplicity, in the examples below, we skip the check because we know that parsing succeeds. All strings are assumed to be valid UTF-8 strings.

Examples

  • Get/Update credentials
ada::result<ada::url_aggregator> url = ada::parse<ada::url_aggregator>("https://www.google.com");
url->set_username("username");
url->set_password("password");
// ada->get_href() will return "https://username:[email protected]/"
  • Get/Update Protocol
ada::result<ada::url_aggregator> url = ada::parse<ada::url_aggregator>("https://www.google.com");
url->set_protocol("wss");
// url->get_protocol() will return "wss:"
// url->get_href() will return "wss://www.google.com/"
  • Get/Update host
ada::result<ada::url_aggregator> url = ada::parse<ada::url_aggregator>("https://www.google.com");
url->set_host("github.com");
// url->get_host() will return "github.com"
// you can use `url.set_hostname` depending on your usage.
  • Get/Update port
ada::result<ada::url_aggregator> url = ada::parse<ada::url_aggregator>("https://www.google.com");
url->set_port("8080");
// url->get_port() will return "8080"
  • Get/Update pathname
ada::result<ada::url_aggregator> url = ada::parse<ada::url_aggregator>("https://www.google.com");
url->set_pathname("/my-super-long-path")
// url->get_pathname() will return "/my-super-long-path"
  • Get/Update search/query
ada::result<ada::url_aggregator> url = ada::parse<ada::url_aggregator>("https://www.google.com");
url->set_search("target=self");
// url->get_search() will return "?target=self"
  • Get/Update hash/fragment
ada::result<ada::url_aggregator> url = ada::parse<ada::url_aggregator>("https://www.google.com");
url->set_hash("is-this-the-real-life");
// url->get_hash() will return "#is-this-the-real-life"

For more information about command-line options, please refer to the CLI documentation.

  • URL search params
ada::url_search_params search_params("a=b&c=d&e=f");
search_params.append("g=h");

search_params.get("g");  // will return "h"

auto keys = search_params.get_keys();
while (keys.has_next()) {
  auto key = keys.next();  // "a", "c", "e", "g"
}

C wrapper

See the file include/ada_c.h for our C interface. We expect ASCII or UTF-8 strings.

#include "ada_c.h"
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>

static void ada_print(ada_string string) {
  printf("%.*s\n", (int)string.length, string.data);
}

int main(int c, char *arg[] ) {
  const char* input =
      "https://username:[email protected]:8080/"
      "pathname?query=true#hash-exists";
  ada_url url = ada_parse(input, strlen(input));
  if(!ada_is_valid(url)) { puts("failure"); return EXIT_FAILURE; }
  ada_print(ada_get_href(url)); // prints https://username:password@host:8080/pathname?query=true#hash-exists
  ada_print(ada_get_protocol(url)); // prints https:
  ada_print(ada_get_username(url)); // prints username
  ada_set_href(url, "https://www.yagiz.co", strlen("https://www.yagiz.co"));  
  if(!ada_is_valid(url)) { puts("failure"); return EXIT_FAILURE; }
  ada_set_hash(url, "new-hash", strlen("new-hash"));
  ada_set_hostname(url, "new-host", strlen("new-host"));
  ada_set_host(url, "changed-host:9090", strlen("changed-host:9090"));
  ada_set_pathname(url, "new-pathname", strlen("new-pathname"));
  ada_set_search(url, "new-search", strlen("new-search"));
  ada_set_protocol(url, "wss", 3);  
  ada_print(ada_get_href(url)); // will print wss://changed-host:9090/new-pathname?new-search#new-hash

  // Manipulating search params
  ada_string search = ada_get_search(url);
  ada_url_search_params search_params =
      ada_parse_search_params(search.data, search.length);
  ada_search_params_append(search_params, "a", 1, "b", 1);
  ada_owned_string result = ada_search_params_to_string(search_params);
  ada_set_search(url, result.data, result.length);
  ada_free_owned_string(result);
  ada_free_search_params(search_params);

  ada_free(url);
  return EXIT_SUCCESS;
}

When linking against the ada library from C++, be minding that ada requires access to the standard C++ library. E.g., you may link with the C++ compiler.

E.g., if you grab our single-header C++ files (ada.cpp and ada.h), as well as the C header (ada_c.h), you can often compile a C program (demo.c) as follows under Linux/macOS systems:

c++ -c ada.cpp -std=c++17
cc -c demo.c
c++ demo.o ada.o -o cdemo
./cdemo

CMake dependency

See the file tests/installation/CMakeLists.txt for an example of how you might use ada from your own CMake project, after having installed ada on your system.

Installation

Homebrew

Ada is available through Homebrew. You can install Ada using brew install ada-url.

Contributing

Building

Ada uses cmake as a build system. It's recommended you to run the following commands to build it locally.

  • Build: cmake -B build && cmake --build build
  • Test: ctest --output-on-failure --test-dir build

Windows users need additional flags to specify the build configuration, e.g. --config Release.

The project can also be built via docker using default docker file of repository with following commands.

docker build -t ada-builder . && docker run --rm -it -v ${PWD}:/repo ada-builder

Amalgamation

You may amalgamate all source files into only two files (ada.h and ada.cpp) by typing executing the Python 3 script singleheader/amalgamate.py. By default, the files are created in the singleheader directory.

License

This code is made available under the Apache License 2.0 as well as the MIT license.

Our tests include third-party code and data. The benchmarking code includes third-party code: it is provided for research purposes only and not part of the library.

Further reading

ada's People

Contributors

actions-user avatar anonrig avatar ayshvab avatar carlosedur avatar d3lm avatar dependabot[bot] avatar dfrostbytes avatar github-actions[bot] avatar iamtimmy avatar jasnell avatar lemire avatar miguelteixeiraa avatar myd7349 avatar nick-nuon avatar okanpinar avatar pascaldekloe avatar pratikpc avatar q66 avatar rluvaton avatar rockwotj avatar ronag avatar seantolstoyevski avatar star-hengxing avatar the-moisrex avatar tniessen avatar ttsugriy avatar ulisesgascon avatar vanemoraess avatar wx257osn2 avatar zzzode avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ada's Issues

Implement logging

We should allow compile-time selection of a logging flag so that a deep understanding of the execution is easy (without having to use a debugger).

See for example how we did it with simdjson:
simdjson/simdjson#1938

-Wunsequenced error on basic_fuzzer

/Users/yagiz/Developer/url-parser/tests/basic_fuzzer.cpp:37:59: warning: multiple unsequenced modifications to 'counter' [-Wunsequenced]
                  copy.insert(copy.begin()+(211311*counter++)%copy.size(), char(counter++*777));
                                                          ^                            ~~
1 warning generated.
[55/55] Linking CXX executable benchmarks/bench

Build finished

Capacity does not indicate the safe buffer size

The convention in the C++ runtime is that capacity() is merely an indication of the memory allocation that might have been done. However, it is not considered safe to write to this memory as far as I know.

Potentially unsafe usage:

result.data(), int32_t(result.capacity()),

Provide a way to construct ada::url from a struct

For example Node uses the following struct. We need to construct a ada::url from a struct. If we use ada::url value; and then use value.set_port() it might not work, since there are limitations on the setters side. (For example, you can't set port if host does not have a value etc. This needs to also properly set the private variables inside our ada::url struct.

struct url {
  int32_t flags = URL_FLAGS_NONE;
  int port = -1;
  std::string scheme;
  std::string username;
  std::string password;
  std::string host;
  std::string query;
  std::string fragment;
  std::vector<std::string> path;
  std::string href;
};

with the following flags:

#define FLAGS(XX)                                                             \
  XX(URL_FLAGS_NONE, 0)                                                       \
  XX(URL_FLAGS_FAILED, 0x01)                                                  \
  XX(URL_FLAGS_CANNOT_BE_BASE, 0x02)                                          \
  XX(URL_FLAGS_INVALID_PARSE_STATE, 0x04)                                     \
  XX(URL_FLAGS_TERMINATED, 0x08)                                              \
  XX(URL_FLAGS_SPECIAL, 0x10)                                                 \
  XX(URL_FLAGS_HAS_USERNAME, 0x20)                                            \
  XX(URL_FLAGS_HAS_PASSWORD, 0x40)                                            \
  XX(URL_FLAGS_HAS_HOST, 0x80)                                                \
  XX(URL_FLAGS_HAS_PATH, 0x100)                                               \
  XX(URL_FLAGS_HAS_QUERY, 0x200)                                              \
  XX(URL_FLAGS_HAS_FRAGMENT, 0x400)                                           \
  XX(URL_FLAGS_IS_DEFAULT_SCHEME_PORT, 0x800)                                 \

Why isn't parse_url a constructor?

The parse_url function returns a URL. In many URL parsing library, there is no parse_url and you just do...

ada::url url("http://google.com")

or the equivalent.

Now, having a parsing function could make sense if you have a parser. So you do...

ada::parser p;
ada::url = p.parse("http://google.com");

This can make sense because the parser instance can hold some ressources (such as space for temporary buffers) and it can also hold configuration switches.

Simplify parse_url

The current implementation of parse_url does not have an ideal implementation as far as maintainability is concerned.

  • We started out with std::string_view::iterator instances. The finite state machine would sometimes decrement the iterators before the string start, only to soon increment it.
  • We fixed this by short-circuiting the design with direct jumps.
  • In a later PR, we removed the std::string_view::iterator and now work with with integers (input_position). The integer is maintained in a range between 0 and the input length (inclusively).

@ronag proposed what I feel is a better design at: #169

Instead of systematically incrementing the integer position with each pass, he just expects the states to increment as needed. This is simpler because we don't need to decrement and then reincrement.

It should be possible to implement parse_url in a forward-only design (where you never go back, you only go forward).

Alternatively, it could be implemented with a string_view that gets shorter and shorter.

State of the project

Supported states:

  • AUTHORITY - #14
  • SCHEME_START - #2
  • SCHEME - #2
  • HOST - #12
  • NO_SCHEME - #14
  • FRAGMENT - #12
  • RELATIVE - #14
  • RELATIVE_SLASH - #8
  • FILE - #14
  • FILE_HOST - #14
  • FILE_SLASH - #14
  • PATH_OR_AUTHORITY - #8
  • SPECIAL_AUTHORITY_IGNORE_SLASHES - #8
  • SPECIAL_AUTHORITY_SLASHES - #8
  • SPECIAL_RELATIVE_OR_AUTHORITY - #8
  • QUERY - #12
  • PATH - #14
  • PATH_START - #14
  • OPAQUE_PATH - #13
  • PORT - #14

split the headers

We mix implementation (definition) and declaration in our headers. We should split the headers into something.h (just declaration) and implementation-inl.h (inline definitions).

Invalid unicode inputs?

Currently, we assume valid UTF-8 inputs and we only test with valid UTF-8 inputs. We do not actually check that we have valid UTF-8.

We need to determine whether the library is expected to handle invalid unicode inputs productively.

Provide `domainToASCII` and `domainToUnicode` functions

Node has domainToUnicode and domainToUnicode coupled with the URL parser. We should have a API layer that does the same thing:

void DomainToUnicode(const FunctionCallbackInfo<Value>& args) {
  Environment* env = Environment::GetCurrent(args);
  CHECK_GE(args.Length(), 1);
  CHECK(args[0]->IsString());
  Utf8Value value(env->isolate(), args[0]);

  URLHost host;
  // Assuming the host is used for a special scheme.
  host.ParseHost(*value, value.length(), true, true);
  if (host.ParsingFailed()) {
    args.GetReturnValue().Set(FIXED_ONE_BYTE_STRING(env->isolate(), ""));
    return;
  }
  std::string out = host.ToStringMove();
  args.GetReturnValue().Set(
      String::NewFromUtf8(env->isolate(), out.c_str()).ToLocalChecked());
}

path parsing fails in some cases

We fail to parse this scenario:

  {
    "input": "http:foo.com",
    "base": "http://example.org/foo/bar",
    "href": "http://example.org/foo/foo.com",
    "origin": "http://example.org",
    "protocol": "http:",
    "username": "",
    "password": "",
    "host": "example.org",
    "hostname": "example.org",
    "port": "",
    "pathname": "/foo/foo.com",
    "search": "",
    "hash": ""
  },

Also in this scenario:

  {
    "input": "a:\t foo.com",
    "base": "http://example.org/foo/bar",
    "href": "a: foo.com",
    "origin": "null",
    "protocol": "a:",
    "username": "",
    "password": "",
    "host": "",
    "hostname": "",
    "port": "",
    "pathname": " foo.com",
    "search": "",
    "hash": ""
  }

There are other failing cases.

We probably should not change the case of the domain

When you receive a domain name or label, you should preserve its case. The rationale for this choice is that we may someday need to add full binary domain names for new services; existing services would not be changed.

RFC 1034 : https://www.rfc-editor.org/rfc/rfc1034

I do not find anything at https://url.spec.whatwg.org/#url-parsing saying that we should lower the case. They refer to case-insensitve matching, but that can be accomplished by lowercasing the strings, but that's not the same thing as storing the lowercase version of the domain.

We should, similarly, check whether other strings that we manipulate should be stored with their case changed.

Test installation

We should run tests in CI where you do cmake --install and check then a project can make use of the result.

Here is how it is done in other projects...

          cmake --build .   &&
          ctest -j --output-on-failure &&
          cmake --install . &&
          cd ../tests/installation_tests/find &&
          mkdir build && cd build && cmake -DCMAKE_INSTALL_PREFIX:PATH=../../../build/destination .. &&  cmake --build .

(see simdutf)

Remove std::optional<ada::state> state_override from parse_url

If we fix #74 and #75 then the parse_url function will no longer need a std::optional<ada::state> state_override parameter. This will drastically simplify the parse_url function, increasing maintainability significantly. The resulting code will be more modular and easier to optimize.

UTF-16BE and UTF-16LE advertized for set_pathname but currently unsupported

I think that this should be the tests for it...

"comment": [
        "#Additional tests designed by the ada team."
    ],
    "pathname": [
        {
            "href": "https://lemire.me",
            "new_value": "école",
            "encoding": "UTF-8",
            "expected": {
                "href": "https://example.net/%C3%A9cole",
                "pathname": "/%C3%A9cole"
            }
        },
        {
            "href": "https://lemire.me",
            "new_value": "école",
            "encoding": "UTF-16LE",
            "expected": {
                "href": "https://example.net/%E9%00c%00o%00l%00e%00",
                "pathname": "/%E9%00c%00o%00l%00e%00"
            }
        },
        {
            "href": "https://example.net#nav",
            "new_value": "école",
            "encoding": "UTF-16BE",
            "expected": {
                "href": "https://lemire.me/%00%E9%00c%00o%00l%00e",
                "pathname": "/%00%E9%00c%00o%00l%00e"
            }
        }
    ]
}

Improve build process by separating benchmark & test

Right now, cmake includes both benchmark and tests folder if tests are enabled, but it reduces the build time for both ourselves, and for CI. It would be good to have a flags to distinguish tests and benchmarks.

Simplify set_host and set_hostname

Both of them have the same code, except for a single line:

If state override is given and state override is [hostname state](https://url.spec.whatwg.org/#hostname-state), then return.

How does IDNA should encode https://faß.ExAmPlE/?

Take the URI https://faß.ExAmPlE/ (encoded as https://fa\xc3\x9f.ExAmPlE/).

The Brave browser and Microsoft Edge will map this to https://fass.example

Firefox and Safari maps it https://xn--fa-hia.example

The command line curl tool can't seem to process it.

The command line wget tool maps it to fa\303\237.example.

If you try the following in curl...

#include <curl/curl.h>
#include <stdio.h>

int main() {
  CURLU *url = curl_url();
  CURLUcode rc = curl_url_set(url, CURLUPART_URL, "https://fa\xc3\x9f.ExAmPlE/",
                              CURLU_URLENCODE);
  if (!rc) {
    char *host;
    rc = curl_url_get(url, CURLUPART_HOST, &host, 0);
    if (!rc) {
      printf("the host is %s\n", host);
      curl_free(host);
    }

    rc = curl_url_get(url, CURLUPART_HOST, &host, CURLU_URLENCODE);
    if (!rc) {
      printf("the host is %s\n", host);
      curl_free(host);
    }
  }
  curl_url_cleanup(url);
}

(Compile as c++ test.cpp -lcurl), I get...

the host is faß.ExAmPlE
the host is faß.ExAmPlE

Maybe I am misusing curl?

There is an additional flag in curl related to punycode, but it is only available in curl 7.88.0 which is seemingly unrelease as I write this lines: https://curl.se/changes.html

Add WPT runner

We need a WPT runner to understand where we are and what kind of errors we have. This commit has a script to update the results from the Git repository of WPT. It would be satisfactory if we can parse the JSON and run the tests according to each object.

Implement getters & setters

It might look as follows...

std::string(input_url.get_scheme())
                   +":"
                   + (input_url.host.has_value() ? 
                   "//"
                   + input_url.username
                   + (input_url.password.empty() ? "" : ":" + input_url.password)
                   + (input_url.includes_credentials() ? "@" : "")
                   + input_url.host.value()
                   + (input_url.port.has_value() ? ":" + std::to_string(input_url.port.value()) : "")
                   : "")
                   + input_url.path 
                   + (input_url.query.has_value() ? "?" +input_url.query.value() : "")
                   + (input_url.fragment.has_value() ? "#" + input_url.fragment.value() : "");

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.