Coder Social home page Coder Social logo

scurl's Introduction

Scurl

Build Status codecov

About Scurl

Scurl is a library that is meant to replace some functions in urllib, such as urlparse, urlsplit and urljoin. It is built using the Chromium url parse source, which is called GURL.

In addition, this library is built to support the Scrapy project (hence the name Scurl). Therefore, an additional function is built, which is canonicalize_url, the bottleneck function in Scrapy spiders. It uses the canonicalize function from GURL to canonicalize the path, fragment and query of the urls.

Since the library is built based on Chromium source, the performance is greatly increased. The performance of urlparse, urlsplit and urljoin is 2-3 times faster than the urllib.

At the moment, we run the tests from urllib and w3lib. Nearly all the tests from urllib have passed (we are still working on passing all the tests :) ).

Credits

We want to give special thanks to urlparse4 since this project is built based on it.

GSoC 2018

This project is built under the funding of the Google Summer of Code program 2018. More detail about the program can be found here.

The final report, which contains more detail on how this project was made can be found here.

Supported functions

Since scurl meant to replace those functions in urllib, these are supported by Scurl: urlparse, urljoin, urlsplit and canonicalize_url.

Installation

Scurl has not been deployed to pypi yet. Currently the only way to install Scurl is cloning this repository

git clone https://github.com/scrapy/scurl
cd scurl
pip install -r requirements.txt
make clean
make build_ext
make install

Available Make commands

Make commands create a shorter way to type commands while developing :)

make clean

This will clean the build dir and the files that are generated when running build_ext command

make test

This will run all the tests found in the /tests folder

make build_ext

This will run the command python setup.py build_ext --inplace, which builds Cython code for this project.

make sdist

This will run python setup.py sdist command on this project.

make install

This will run python setup.py install command on this project.

make develop

This will run python setup.py develop command on this project.

make perf

Run the performance tests on urlparse, urlsplit and urljoin.

make cano:

Run the performance tests on canonicalize_url.

Profiling

Scurl repository has the built-in profiling tool, which you can turn on by adding this lines to the top of the *.pyx files in scurl/scurl:

# cython: profile=True

Then you can run python benchmarks/cython_profile.py --func [function-name] to get the cprofiling result. Currently, Scurl supports profiling urlparse, urlsplit and canonicalize.

This is not the most convenient way to profile Scurl with cprofiler, but we will come up with a way of improving this soon!

Benchmarking result report

urlparse, urlsplit and urljoin

This shows the performance difference between urlparse, urlsplit and urljoin from urllib.parse and those of Scurl (this is measured by running these functions with the urls from the file chromiumUrls.txt, which can also be found in this project):

The chromiumUrls.txt file contains ~83k urls. This measure the time it takes to run the performance_test.py test.

urlparse urlsplit urljoin
urllib.parse 0.52 sec 0.39 sec 1.33 sec
Scurl 0.19 sec 0.10 sec 0.17 sec

Canonicalize urls

The speed of canonicalize_url from scrapy/w3lib compared to the speed of canonicalize_url from Scurl (this is measured by running canonicalize_url with the urls from the file chromiumUrls.txt, which can also be found in this project):

This measures the speed of both functions. The test can be found in canonicalize_test.py file.

canonicalize_url
scrapy/w3lib 22,757 items/sec
Scurl 46,199 items/sec

Feedback

Any feedback is highly appreciated :) Please feel free to submit any error/feedback in the repository issue tab!

scurl's People

Contributors

lopuhin avatar malloxpb avatar sylvinus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scurl's Issues

Additional components to implement

Here are some components that can be implemented to further enhance the performance of canonicalize_url func:

  • parse_qsl_to_bytes, which includes unquote_to_bytes from stdlib
  • urlencode
  • _unquotepath
  • quote
  • urlunsplit
  • urlparse
  • urlsplit
  • urljoin

Test failed on Mac OS py35 and py34

tox has been found failed on Mac OS on py34 and py35 env.
Tox creates an environment that is Mac OS 10.6 for some reason and this is the traceback for the compiling error:

    In file included from scurl/cgurl.cpp:647:
    In file included from scurl/../third_party/chromium/url/third_party/mozilla/url_parse.h:8:
    ./third_party/chromium/base/strings/string16.h:207:8: error: explicit specialization of non-template struct 'hash'
    struct hash<base::string16> {
           ^   ~~~~~~~~~~~~~~~~
    In file included from scurl/cgurl.cpp:654:
    scurl/../third_party/chromium/url/gurl.h:467:8: error: no template named 'unique_ptr' in namespace 'std'
      std::unique_ptr<GURL> inner_url_;
      ~~~~~^
    2 errors generated.
    error: command '/usr/bin/clang' failed with exit status 1

Fix the test for Scurl urljoin

This test here checks nothing. We need to check if the result from urljoin of Scurl is equal to the result from the urljoin of urllib.parse

Enable windows support for scurl

Right now Scurl fails to run on Windows. We will need to come up with a way to support windows :) Details on this are coming soom

Installation instructions are wrong

I was following the install instructions from the README (macOS 10.14.5).

There was one warning about

s3fs 0.2.1 has requirement six>=1.12.0, but you'll have six 1.11.0 which is incompatible.

... which I ignored. And then this failed:

[...]
$ make build_ext
python setup.py build_ext --inplace
Compiling scurl/cgurl.pyx because it changed.
Compiling scurl/canonicalize.pyx because it changed.
[1/2] Cythonizing scurl/canonicalize.pyx
[2/2] Cythonizing scurl/cgurl.pyx
running build_ext
building 'scurl.cgurl' extension
creating build
creating build/temp.macosx-10.14-x86_64-3.7
creating build/temp.macosx-10.14-x86_64-3.7/scurl
creating build/temp.macosx-10.14-x86_64-3.7/third_party
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/base
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/base/strings
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/base/third_party
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/base/third_party/icu
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/url
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/url/third_party
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/url/third_party/mozilla
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -I. -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/include/python3.7m -c scurl/cgurl.cpp -o build/temp.macosx-10.14-x86_64-3.7/scurl/cgurl.o -std=gnu++14 -I./third_party/chromium/ -fPIC -Ofast -pthread -w -DU_COMMON_IMPLEMENTATION
scurl/cgurl.cpp:638:10: fatal error: 
      '../third_party/chromium/url/third_party/mozilla/url_parse.h' file not
      found
#include "../third_party/chromium/url/third_party/mozilla/url_parse.h"
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
error: command 'clang' failed with exit status 1
make: *** [build_ext] Error 1

The offending folder exists but is empty.

Integrate scurl into w3lib and scrapy

We will need to work on integrating this library into Scrapy and W3lib. Make it an option for users to install it. Right now, we can prompt a message if the library is not installed while the users are installing Scrapy!

GURL idna encoding

Right now GURL could not handle such idna urls. All of the idna urls are marked as invalid. Although Google Chrome does parse these urls correctly!

>>> URL('банки.рф'.encode('idna')).is_valid()
False

URL parse error

The error can be found here where it fails to handle the type of url to parse

Segfault or encoding error when parsing a URL

See #58 (comment) and #58 (comment)

Also repeating here

Traceback (most recent call last):
  File "./bin/triage_links", line 34, in get_url_parts
    link = urljoin(record.url, record.href)
  File "scurl/cgurl.pyx", line 308, in scurl.cgurl.urljoin
  File "scurl/cgurl.pyx", line 353, in scurl.cgurl.urljoin
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 503: invalid continuation byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./bin/triage_links", line 102, in <module>
    main()
  File "./bin/triage_links", line 13, in main
    CSVPipeline(callback=process).execute()
  File "./bin/../crawler/utils/csvpipeline.py", line 42, in execute
    self.save_csv()
  File "./bin/../crawler/utils/csvpipeline.py", line 96, in save_csv
    df = df.compute()
  File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 175, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 446, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dask/threaded.py", line 82, in get
Segmentation fault: 11

To reproduce, run a broad crawl on this dataset and extract all links:

https://www.kaggle.com/cheedcheed/top1m

use urljoin() and urlsplit() on each one.

Raise the exception error for invalid urls

In the test file, there are some invalid ipv6 urls that SCURL does not raise exception error as can be seen in this comment. We will need to work on implementing this to pass the test :)
Those invalid urls can be found here with the invalid urls are:

'http://::12.34.56.78]/',
'http://[::1/foo/',
'ftp://[::1/foo/bad]/bad',
'http://[::1/foo/bad]/bad',
'http://[::ffff:12.34.56.78'

This can probably be fixed by adding exception handling in this class, according to this and this from python repository

icu is commented out in Chromium source

Right now, this line is commented out. We should figure out a way to enable icu for this project :)

The old chromium source of icu has this commented out probably because the third_party library ICU is difficult to configure. However, because the idna block of code in chromium source is commented out, the idna hostname, for example: ουτοπία.δπθ.gr is not converted to the ascii form.

At the moment, we have this handled by encoding the hostname with python encode() function, this can be seen here. However, it would be really nice to improve the performance of SCURL by figuring out how to use ICU in chromium source!

undefined symbol: _ZTVN10__cxxabiv117__class_type_infoE

In [2]: from scurl import urlparse                                                                                                                    
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-2-a0b8582bdd7c> in <module>
----> 1 from scurl import urlparse

~/.virtualenvs/myenv/lib/python3.6/site-packages/scurl/__init__.py in <module>
     13 _original_urlparse = urlparse
     14 
---> 15 from scurl.cgurl import urlsplit, urljoin, urlparse
     16 """
     17 TODO: find some way to not import parse_url

ImportError: /home/user/.virtualenvs/myenv/lib/python3.6/site-packages/scurl/cgurl.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZTVN10__cxxabiv117__class_type_infoE

System: Ubuntu 16.04
Python 3.6

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.