Coder Social home page Coder Social logo

howie6879 / ruia Goto Github PK

View Code? Open in Web Editor NEW
1.7K 42.0 181.0 4.57 MB

Async Python 3.6+ web scraping micro-framework based on asyncio

Home Page: https://www.howie6879.cn/ruia/

License: Apache License 2.0

Python 94.43% HTML 5.54% Shell 0.04%
asyncio aiohttp asyncio-spider crawler crawling-framework spider uvloop ruia python-ruia middlewares

ruia's Introduction

Ruia logo

Ruia

🕸️ Async Python 3.6+ web scraping micro-framework based on asyncio.

⚡ Write less, run faster.

travis codecov PyPI - Python Version PyPI Downloads gitter

Overview

Ruia is an async web scraping micro-framework, written with asyncio and aiohttp, aims to make crawling url as convenient as possible.

Write less, run faster:

Features

  • Easy: Declarative programming
  • Fast: Powered by asyncio
  • Extensible: Middlewares and plugins
  • Powerful: JavaScript support

Installation

# For Linux & Mac
pip install -U ruia[uvloop]

# For Windows
pip install -U ruia

# New features
pip install git+https://github.com/howie6879/ruia

Tutorials

  1. Overview
  2. Installation
  3. Define Data Items
  4. Spider Control
  5. Request & Response
  6. Customize Middleware
  7. Write a Plugins

TODO

  • Cache for debug, to decreasing request limitation, ruia-cache
  • Provide an easy way to debug the script, ruia-shell
  • Distributed crawling/scraping

Contribution

Ruia is still under developing, feel free to open issues and pull requests:

  • Report or fix bugs
  • Require or publish plugins
  • Write or fix documentation
  • Add test cases

!!!Notice: We use black to format the code.

Thanks

ruia's People

Contributors

123seven avatar abmyii avatar daijiangtian avatar duolaaoa avatar fengdongfa1995 avatar fossabot avatar howie6879 avatar laggardkernel avatar leezj9671 avatar maxzheng avatar panhaoyu avatar ruiruizhou avatar ruter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ruia's Issues

Type checking for start_urls

    def __init__(self, middleware=None, loop=None, is_async_start=False):
        if not self.start_urls or not isinstance(self.start_urls, list):
            raise ValueError("Spider must have a param named start_urls, eg: start_urls = ['https://www.github.com']")

This type checking should check for collections.Iterable,

Here's an instance:

I want to crawl a website, like hacker news.
Pages are orderd by number,
so I write such a start_urls:

start_urls = (f'https://some.site.com/{index}' for index in range(1, 100000))

Then it took a long time to start crawling.

Can I ask what is the cause of ERROR Spider <Callback[parse_item]: 'NoneType' object has no attribute 'html'

Thanks for your prompt response. Highly appreciate your effort and attitude. :)

I have used the same spider to scrap another financial news webpage. The codes are somehow the same and only the css selectors vary along with the page.

For this website, 'http://news.10jqka.com.cn/realtimenews.html' , do you know the cause of <Callback[parse_item]: 'NoneType' object has no attribute 'html'?????

Below are the codes:

domain='tenjqka'
domain_chinese = '同花順'
domain_page = 'http://news.10jqka.com.cn/realtimenews.html'


class frame_Item(Item):
    target_item = TextField(css_select='ul.newsText.all')  
    publish_times = TextField(css_select='li > div.newsTimer', many=True) 
    headline_texts = TextField(css_select='li > div.newsDetail > a > strong', many=True)    
    news_urls = AttrField(attr='href', css_select='li > div.newsDetail > a', many=True) 

class text_Item(Item):
    target_item = TextField(css_select="body > div.main-content.clearfix > div.main-fl.fl > div.main-text.atc-content")  
    news_texts = TextField(css_select="p", many=True)  

class realtime_Spider(Spider):
    start_urls = [domain_page]
    concurrency = 5

    async def parse(self, response):
        async for item in frame_Item.get_items(html=response.html):
            publish_times= [scrapy_date_str+' '+publish_time for publish_time in item.publish_times]
            for publish_time, title, url in zip(publish_times, item.headline_texts, item.news_urls):
                timediff = datetime.now() - datetime.strptime(publish_time, '%Y-%m-%d %H:%M')
                if timediff.seconds <= 60 * 60 and re.search('news',url):
                    yield Request(url, callback=self.parse_item,
                                  metadata={'publish_time': publish_time, 'title': title},
                                  headers={'User-Agent': ua.random},
                                  pyppeteer_launch_options={"headless": True},
                                  pyppeteer_page_options={'waitUntil': 'networkidle2'},
                                  close_pyppeteer_browser=True)

    async def parse_item(self, response):
        news_info = []
        async for item in text_Item.get_items(html=response.html):
            publish_time = response.metadata['publish_time']
            title = response.metadata['title']
            news_info.append((publish_time, title, ' '.join(item.news_texts)))

        pd.DataFrame(news_info).to_csv(os.path.join(working_dir, '{}_{}.txt').format(domain, scrapy_date), mode='a', index=None, encoding='utf-8')



def run():
    realtime_Spider.start(middleware=middleware)          # 12. 同花順

if __name__ == '__main__':
    run()
    db.close()

[2019:03:05 19:15:27] WARNING Spider Ruia tried to use loop.add_signal_handler but it is not implemented on this platform.
[2019:03:05 19:15:27] INFO Spider Spider started!
[2019:03:05 19:15:27] INFO Spider Worker started: 3055864995080
[2019:03:05 19:15:27] INFO Spider Worker started: 3055864995216
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:56232/devtools/browser/179c9946-9fae-4249-863e-616d9212b160
[I:pyppeteer.launcher] terminate chrome process...
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:56261/devtools/browser/2f63f643-241a-4bb5-ba21-ac93c74cc9f8
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:56263/devtools/browser/4eb898d6-2907-409a-873c-33af40346726
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:56267/devtools/browser/559dfab3-0ef8-48d9-80d6-0ab03afddadd
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:56269/devtools/browser/b686bf95-59ea-4156-ad28-fd255b393fd7
[2019:03:05 19:15:42] INFO Request <Retry url: http://news.10jqka.com.cn/20190305/c610071745.shtml>, Retry times: 1
[2019:03:05 19:15:42] INFO Request <Retry url: http://news.10jqka.com.cn/20190305/c610071136.shtml>, Retry times: 1
[2019:03:05 19:15:42] INFO Request <Retry url: http://news.10jqka.com.cn/20190305/c610072845.shtml>, Retry times: 1
[2019:03:05 19:15:42] INFO Request <Retry url: http://news.10jqka.com.cn/20190305/c610070478.shtml>, Retry times: 1
[I:pyppeteer.connection] connection closed
[I:pyppeteer.connection] connection closed
[I:pyppeteer.connection] connection closed
[I:pyppeteer.connection] connection closed
[I:pyppeteer.launcher] terminate chrome process...
[I:pyppeteer.launcher] terminate chrome process...
[2019:03:05 19:15:52] ERROR Request <Error: http://news.10jqka.com.cn/20190305/c610071745.shtml Protocol error Target.createTarget: Target closed.>
[2019:03:05 19:15:52] ERROR Spider <Callback[parse_item]: 'NoneType' object has no attribute 'html'
[I:pyppeteer.launcher] terminate chrome process...
[I:pyppeteer.launcher] terminate chrome process...
[2019:03:05 19:15:52] ERROR Request <Error: http://news.10jqka.com.cn/20190305/c610072845.shtml Protocol error Target.createTarget: Target closed.>
[2019:03:05 19:15:52] ERROR Spider <Callback[parse_item]: 'NoneType' object has no attribute 'html'
[I:pyppeteer.launcher] terminate chrome process...
[I:pyppeteer.launcher] terminate chrome process...
[2019:03:05 19:15:52] ERROR Request <Error: http://news.10jqka.com.cn/20190305/c610071136.shtml Protocol error Target.createTarget: Target closed.>
[2019:03:05 19:15:52] ERROR Spider <Callback[parse_item]: 'NoneType' object has no attribute 'html'
[I:pyppeteer.launcher] terminate chrome process...
[I:pyppeteer.launcher] terminate chrome process...
[2019:03:05 19:15:52] ERROR Request <Error: http://news.10jqka.com.cn/20190305/c610070478.shtml Protocol error Target.createTarget: Target closed.>
[2019:03:05 19:15:52] ERROR Spider <Callback[parse_item]: 'NoneType' object has no attribute 'html'
[2019:03:05 19:15:52] INFO Spider Stopping spider: Ruia
[2019:03:05 19:15:52] INFO Spider Total requests: 1
[2019:03:05 19:15:52] INFO Spider Time usage: 0:00:25.378160
[2019:03:05 19:15:52] INFO Spider Spider finished!

spider.request is not awaitable

I just start to use Ruia and I most say that I love it.
But I find my self a little confused with some code in the documentation not working.

The link of the documentation page is https://howie6879.github.io/ruia/en/tutorials/spider.html.

I am trying to replicate:

class MySpider(Spider):
    async def parse(self, response):
        for i in range(10):
            response = await self.request(f'https://some.site/{i}')
            yield self.parse_next(response)

    async def parse_next(self, response):
        print(response.html)

and when give an error when calling self.request and say it's not awaitable

I take a look to the source (installed on my computer) and in did spider.request now returns a Response object.
My question is how to proceed now to get the same result of this line response = await self.request(f'https://some.site/{i}') and be able to manipulate the response

process_item is only call when the callback_result is an Item

This is a question of curiosity. Why limit the process_item for item resulting.
I think that must be called on every result.

the lines are the 199-202 of spider.py

                elif isinstance(callback_result, Item):
                    # Process target item
                    await self.process_item(callback_result)

target_item expected error

I wrote a spider, the code is there https://codeshare.io/5NQXr1.
I just want to retrieve the links in one Item element, I've got a link = AttrField(css_select='a', attr='href'), but when it's called in a async for, i've got a target_item expected error. In the examples you have done in the repo, when calling item with get_items(), they all got a clean_link or clean_title async method. But i don't think it's the issue. Also response.html contains links.

Characters not supported

Here I met such a problem, that some characters could not be captured in item.

image

The url is http://qq.ip138.com/train/shanxi/.

asyncio `RuntimeError`

ERROR asyncio Exception in callback BaseSelectorEventLoop._sock_write_done(150)(<Future finished result=None>)
handle: <Handle BaseSelectorEventLoop._sock_write_done(150)(<Future finished result=None>)>
Traceback (most recent call last):
  File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 516, in _sock_write_done
    self.remove_writer(fd)
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 346, in remove_writer
    self._ensure_fd_no_transport(fd)
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 251, in _ensure_fd_no_transport
    raise RuntimeError(
RuntimeError: File descriptor 150 is used by transport <_SelectorSocketTransport fd=150 read=idle write=<polling, bufsize=0>>

Getting this quite a bit still. I don't think it's ruia directly, but aiohttp. Any ideas?

One thing that may be causing it is that in clean functions I call other functions synchronously, i.e.:

    async def clean_<...>(self, value):
        return <function>(value)

Could that be causing it? I tried doing return await ... but the error still persisted.

In spider.py, class properties are not safe

class Spider:
    name = 'ruia'  # Used for log
    request_config = None

    # Default values passing to each request object. Not implemented yet.
    headers: dict = {}
    metadata: dict = {}
    kwargs: dict = {}

    res_type: str = 'text'

    # Some fields for statistics
    failed_counts: int = 0
    success_counts: int = 0

    # Concurrency control
    concurrency: int = 3

    # Spider entry
    start_urls: list = []

    # A queue to save coroutines
    worker_tasks: list = []

The class properties may not safe, especially when there are two spiders in a program.
Should defined in __init__

HtmlField?

hello, I want to get the raw html code, so I write another field, named HtmlField.

import ruia.field

class HtmlField(ruia.field._LxmlElementField):
    def _parse_element(self, element):
        return etree.tostring(element)

Is it useful to add it to ruia? Or I'll write an tutorial of adding custom fields.

I don't know if there is a better implementation with current version.

No field for capturing raw LXML elements

I need to capture the raw LXML element(s) and process them before converting to text. Right now I have created a very simple new field (not sure if best way to implement):

class ElementField(_LxmlElementField):

    def _parse_element(self, element):
        return element

I think this should be part of ruia core, not an extension.

Log crucial information regardless of log-level

I've reduced the log level of a Spider in my script as I find it too verbose, however I also filter out crucial info, particularly the after completion info (number of requests, time, ect.) -

ruia/ruia/spider.py

Lines 280 to 287 in 651fac5

self.logger.info(
f"Total requests: {self.failed_counts + self.success_counts}"
)
if self.failed_counts:
self.logger.info(f"Failed requests: {self.failed_counts}")
self.logger.info(f"Time usage: {end_time - start_time}")
self.logger.info("Spider finished!")

This is code I currently use to reduce verbosity:

import logging

# Disable logging (for speed)
logging.root.setLevel(logging.ERROR)

I'm thinking of changing the code so that it shows regardless of log level, but will there ever be a case where you wouldn't want to see it?

With ten concurrent request, there's an Exception

Mainly because the server rejected our request.

import asyncio
from ruia import Item, TextField, AttrField


class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')


async def parse_one_page(page):
    url = f'https://news.ycombinator.com/news?p={page}'
    return await HackerNewsItem.get_items(url=url)


async def main():
    coroutine_list = [parse_one_page(page) for page in range(1, 10)]
    result = await asyncio.gather(*coroutine_list)
    news_list = list()
    for one_page_list in result:
        news_list.extend(one_page_list)
    for news in news_list:
        print(news.title, news.url)


if __name__ == '__main__':
    asyncio.run(main())

result:

C:\Users\wolf\work\ruia\venv\Scripts\python.exe "C:\Program Files\JetBrains\PyCharm 2018.3.2\helpers\pydev\pydevd.py" --multiproc --qt-support=auto --client 127.0.0.1 --port 57124 --file C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py
pydev debugger: process 9764 is connecting

Connected to pydev debugger (build 183.4886.43)
[2019:01:20 22:08:47]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=1>
[2019:01:20 22:08:47]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=2>
[2019:01:20 22:08:47]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=3>
[2019:01:20 22:08:47]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=4>
[2019:01:20 22:08:47]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=5>
[2019:01:20 22:08:47]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=6>
[2019:01:20 22:08:47]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=7>
[2019:01:20 22:08:47]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=8>
[2019:01:20 22:08:47]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=9>
[2019:01:20 22:08:48]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=8 503 >
[2019:01:20 22:08:48]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=8>, Retry times: 1
[2019:01:20 22:08:48]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=8>
[2019:01:20 22:08:49]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=8 503 >
[2019:01:20 22:08:49]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=8>, Retry times: 2
[2019:01:20 22:08:49]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=8>
[2019:01:20 22:08:49]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=8 503 >
[2019:01:20 22:08:49]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=8>, Retry times: 3
[2019:01:20 22:08:49]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=8>
[2019:01:20 22:08:49]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=4 503 >
[2019:01:20 22:08:49]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=4>, Retry times: 1
[2019:01:20 22:08:49]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=4>
[2019:01:20 22:08:49]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=1 503 >
[2019:01:20 22:08:49]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=1>, Retry times: 1
[2019:01:20 22:08:49]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=1>
[2019:01:20 22:08:49]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=4 503 >
[2019:01:20 22:08:49]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=4>, Retry times: 2
[2019:01:20 22:08:49]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=4>
[2019:01:20 22:08:50]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=1 503 >
[2019:01:20 22:08:50]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=1>, Retry times: 2
[2019:01:20 22:08:50]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=1>
[2019:01:20 22:08:50]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=4 503 >
[2019:01:20 22:08:50]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=4>, Retry times: 3
[2019:01:20 22:08:50]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=4>
[2019:01:20 22:08:50]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=4 503 >
[2019:01:20 22:08:50]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=9 0 >
[2019:01:20 22:08:50]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=9>, Retry times: 1
[2019:01:20 22:08:50]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=9>
[2019:01:20 22:08:50]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=8 0 >
[2019:01:20 22:08:50]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=5 0 >
[2019:01:20 22:08:50]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=5>, Retry times: 1
[2019:01:20 22:08:50]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=5>
[2019:01:20 22:08:50]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=3 0 >
[2019:01:20 22:08:50]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=3>, Retry times: 1
[2019:01:20 22:08:50]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=3>
[2019:01:20 22:08:50]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=2 0 >
[2019:01:20 22:08:50]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=2>, Retry times: 1
[2019:01:20 22:08:50]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=2>
[2019:01:20 22:08:50]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=1 0 >
[2019:01:20 22:08:50]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=1>, Retry times: 3
[2019:01:20 22:08:50]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=1>
[2019:01:20 22:08:52]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=1 503 >
[2019:01:20 22:08:53]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=9 503 >
[2019:01:20 22:08:53]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=9>, Retry times: 2
[2019:01:20 22:08:53]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=9>
[2019:01:20 22:08:57]-asyncio-ERROR base_events: unhandled exception during asyncio.run() shutdown
task: <Task finished coro=<parse_one_page() done, defined at C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py:11> exception=ValueError('can only parse strings')>
Traceback (most recent call last):
  File "C:\Program Files\Python37\lib\asyncio\runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "C:\Program Files\Python37\lib\asyncio\base_events.py", line 584, in run_until_complete
    return future.result()
  File "C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py", line 18, in main
    result = await asyncio.gather(*coroutine_list)
  File "C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py", line 13, in parse_one_page
    return await HackerNewsItem.get_items(url=url)
  File "C:\Users\wolf\work\ruia\ruia\item.py", line 53, in get_items
    html_etree = await cls._get_html(html, url, **kwargs)
  File "C:\Users\wolf\work\ruia\ruia\item.py", line 41, in _get_html
    return etree.HTML(html)
  File "src\lxml\etree.pyx", line 3159, in lxml.etree.HTML
  File "src\lxml\parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
ValueError: can only parse strings

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py", line 13, in parse_one_page
    return await HackerNewsItem.get_items(url=url)
  File "C:\Users\wolf\work\ruia\ruia\item.py", line 53, in get_items
    html_etree = await cls._get_html(html, url, **kwargs)
  File "C:\Users\wolf\work\ruia\ruia\item.py", line 41, in _get_html
    return etree.HTML(html)
  File "src\lxml\etree.pyx", line 3159, in lxml.etree.HTML
  File "src\lxml\parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
ValueError: can only parse strings
[2019:01:20 22:08:57]-asyncio-ERROR base_events: unhandled exception during asyncio.run() shutdown
task: <Task finished coro=<parse_one_page() done, defined at C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py:11> exception=ValueError('can only parse strings')>
Traceback (most recent call last):
  File "C:\Program Files\Python37\lib\asyncio\runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "C:\Program Files\Python37\lib\asyncio\base_events.py", line 584, in run_until_complete
    return future.result()
  File "C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py", line 18, in main
    result = await asyncio.gather(*coroutine_list)
  File "C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py", line 13, in parse_one_page
    return await HackerNewsItem.get_items(url=url)
  File "C:\Users\wolf\work\ruia\ruia\item.py", line 53, in get_items
    html_etree = await cls._get_html(html, url, **kwargs)
  File "C:\Users\wolf\work\ruia\ruia\item.py", line 41, in _get_html
    return etree.HTML(html)
  File "src\lxml\etree.pyx", line 3159, in lxml.etree.HTML
  File "src\lxml\parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
ValueError: can only parse strings

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py", line 13, in parse_one_page
    return await HackerNewsItem.get_items(url=url)
  File "C:\Users\wolf\work\ruia\ruia\item.py", line 53, in get_items
    html_etree = await cls._get_html(html, url, **kwargs)
  File "C:\Users\wolf\work\ruia\ruia\item.py", line 41, in _get_html
    return etree.HTML(html)
  File "src\lxml\etree.pyx", line 3159, in lxml.etree.HTML
  File "src\lxml\parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
ValueError: can only parse strings
Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm 2018.3.2\helpers\pydev\pydevd.py", line 1741, in <module>
    main()
  File "C:\Program Files\JetBrains\PyCharm 2018.3.2\helpers\pydev\pydevd.py", line 1735, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "C:\Program Files\JetBrains\PyCharm 2018.3.2\helpers\pydev\pydevd.py", line 1135, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm 2018.3.2\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py", line 27, in <module>
    asyncio.run(main())
  File "C:\Program Files\Python37\lib\asyncio\runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "C:\Program Files\Python37\lib\asyncio\base_events.py", line 584, in run_until_complete
    return future.result()
  File "C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py", line 18, in main
    result = await asyncio.gather(*coroutine_list)
  File "C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py", line 13, in parse_one_page
    return await HackerNewsItem.get_items(url=url)
  File "C:\Users\wolf\work\ruia\ruia\item.py", line 53, in get_items
    html_etree = await cls._get_html(html, url, **kwargs)
  File "C:\Users\wolf\work\ruia\ruia\item.py", line 41, in _get_html
    return etree.HTML(html)
  File "src\lxml\etree.pyx", line 3159, in lxml.etree.HTML
  File "src\lxml\parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
ValueError: can only parse strings

多个 spider 同时开始,实现真的异步

按照现在的设计思路,应该是1个网站一个spider的模式了,但是想同时抓取多个网站,有没有办法同时开多个spider,同时抓取。

而且设计成类方法和类变量的话感觉不是很灵活,对于这种情况。

Hook for response

For example:

class SpiderDemo(Spider):

    async def process_succeed_response(self, request, response):
        """Process succeed response"""
        pass

    async def process_failed_response(self, request, response):
        """Process failed response"""
        pass

Supported process invalid callback result type

For example:

class SpiderDemo(Spider):
    start_urls = ['https://www.httpbin.org/get?p=0']
    result = {
        'process_callback_result': False
    }

    async def parse(self, response):
        yield {}


async def process_dict_callback_result(spider_ins, callback_result):
    print(callback_result)
    spider_ins.result['process_callback_result'] = True


class CustomCallbackResultType:

    @classmethod
    def init_spider(cls, spider):
        spider.callback_result_map = spider.callback_result_map or {}
        setattr(spider, 'process_dict_callback_result', process_dict_callback_result)
        spider.callback_result_map.update({'dict': 'process_dict_callback_result'})


CustomCallbackResultType.init_spider(SpiderDemo)

loop = asyncio.new_event_loop()
SpiderDemo.start(loop=loop)
assert SpiderDemo.result['process_callback_result'] == True

How can I add a proxies-rotator into the middleware?

Hi, Thanks for your awesome contribution for this wonderful open library.

Just one question. Besides random ua, I also wanna add my own proxy-rotator into the ruia spider.

I have searched the py.files of requests and ua but yet to sort out the input location of my proxy rotator.

Would you tell me where i can plug-in that into the module?

Documentation about IgnoreThisItem

Hi my friend. I'm making a spider today, and I couldn't find the function in the documentation. However, I find it in the source code.

class CityItem(ruia.Item):
    target_item = ruia.TextField(css_select='body > table:nth-of-type(4) td')
    url = ruia.AttrField(attr='href', css_select='a')
    name = ruia.TextField(css_select='a')

    async def clean_name(self, value):
        raise ruia.exceptions.IgnoreThisItem

I have a small suggestion, that we should add IgnoreThisItem to global namespace of ruia, instead of ruia.exceptions.IgnoreThisItem.

I'll write the docs in a few days.

I think write it here is ok.

`DELAY` attribute specifically for retries

I assumed the DELAY attr would set the delay for retries but instead it applies to all requests. I would appreciate it if there was a DELAY attr specifically for retries (RETRY_DELAY). I'd be happy to implement it if given the go-ahead.

Thank you for this great library!

How can I scrap the text contents from a list of urls extracted from parse function in spider?

from ruia import AttrField, TextField, Item
from ruia_pyppeteer import PyppeteerSpider as Spider
from ruia_pyppeteer import PyppeteerRequest as Request
from ruia_ua import middleware

domain='caixin'
domain_page = 'http://www.caixin.com/search/scroll/index.jsp'

class frame_Item(Item):
    target_item = TextField(css_select='body > div.indexBody > div.news > div.news_content')  
    publish_times = TextField(css_select='dl > dd > span', many=True)
    headline_texts = TextField(css_select='dl> dd > a', many=True)  
    news_urls = AttrField(attr='href',css_select='dl> dd > a',many=True) 


class texts_Item(Item):
    target_item = TextField(css_select='r"/body > div.comMain > div.conlf > #the_content"') 
    news_texts = TextField(css_select='"div.content > div.textbox" ', many=True)


class caixin_realtime_Spider(Spider):
    start_urls = [domain_page]
    concurrency = 5

    async def parse(self, response):
        async for item in frame_Item.get_items(html=response.html):
            yield item

    async def process_item(self, item:frame_Item):
        for url in item.news_urls:
            yield Request(url, callback=self.parse_item) 

    async def parse_item(self, response):
        async for item in caixin_news_Item.get_items(html=response.html):
             print(item.news_texts)


def run():
    print('Scraping ', domain,  domain_page)
    caixin_realtime_Spider.start(middleware=middleware)       


if __name__ == '__main__':
    run()
    db.close()

The thing is ... I only have the follow result:

Scraping caixin 財新網 http://www.caixin.com/search/scroll/index.jsp
[2019:03:05 17:14:57] WARNING Spider Ruia tried to use loop.add_signal_handler but it is not implemented on this platform.
[2019:03:05 17:14:57] WARNING Spider Ruia tried to use loop.add_signal_handler but it is not implemented on this platform.
[2019:03:05 17:14:57] INFO Spider Spider started!
[2019:03:05 17:14:57] INFO Spider Worker started: 1947326139520
[2019:03:05 17:14:57] INFO Spider Worker started: 1947326139656
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:54430/devtools/browser/b72b7efd-7b6b-4353-a7d4-acec7a7f02b4
[I:pyppeteer.launcher] terminate chrome process...
[2019:03:05 17:15:00] ERROR Spider object async_generator can't be used in 'await' expression
[2019:03:05 17:15:00] INFO Spider Stopping spider: Ruia
[2019:03:05 17:15:00] INFO Spider Total requests: 1
[2019:03:05 17:15:00] INFO Spider Time usage: 0:00:02.876312
[2019:03:05 17:15:00] INFO Spider Spider finished!
[2019:03:05 17:15:00] ERROR asyncio Task was destroyed but it is pending!
task: <Task pending coro=<async_generator_athrow()>>

question: Is there any option for continuous scraping

Scrapy (an old but time-proven twisted-based framework) has Frontera - a library for managing scalable clusters of scrapers, where URLs can be added to a queue dynamically by some management-node and actual scraping performed by another.
So my question is about is there any mechanism for performing similar things with ruia? (cause I was about to write my own implementation of similar thing)

Crashes on Windows

Traceback (most recent call last):
File "weibospider.py", line 26, in
HackerNewsSpider.start()
File "C:\Users\hwywhywl\StudioProjects\weibo_splider\lib\site-packages\aspider\spider.py", line 92, in start
spider_ins.loop.add_signal_handler(_signal, lambda: asyncio.ensure_future(spider_ins.stop(_signal)))
File "C:\Users\hwywhywl\Anaconda3\lib\asyncio\events.py", line 499, in add_signal_handler
raise NotImplementedError
NotImplementedError

loop.add_signal_handler is currently not supported on Windows.

建议和疑惑

能否在创建自己的Item类的同时,对AttrField中获取href的时候,获取到的是绝对地址。这样就可以省下写一个clean函数的时间了。(逃)

target_item is expected error

I have the next piece of code, taken (and modified) from Spider Control:

import re

import aiofiles
from ruia import TextField, Item, Spider


class Player(Item):
    # target_item = TextField(xpath_select='//div[@class="info"]/div[@class="meta bp3-text-overflow-ellipsis"]')
    player_id = TextField(xpath_select='//div[@class="info"]/h1')
    info = TextField(xpath_select='//div[@class="info"]/div[@class="meta bp3-text-overflow-ellipsis"]')

    async def clean_player_id(self, value):
        return int(re.search(r'[0-9]{2,7}', value).group())

    async def clean_info(self, value):
        return re.search(r'([A-Z](.\ [A-Z])?[a-z]+ *)+', value).group()


class SoFifa(Spider):
    start_urls = [
        "https://sofifa.com/player/239053/federico-valverde/200012/",
        "https://sofifa.com/player/197928/jonathan-bond/200012/",
        "https://sofifa.com/player/248243/eduardo-camavinga/200012/"
    ]

    async def parse(self, response):
        async for item in Player.get_items(html=response.html):
            yield item

    async def process_item(self, item: Player):
        print(f"{item.info} (ID: {item.player_id})")

        """Ruia build-in method"""
        async with aiofiles.open('./players.txt', 'a') as f:
            await f.write(f"{item.info} (ID: {item.player_id})" + '\n')


if __name__ == '__main__':
    SoFifa.start()

And when running, I receive the next error:

[2019:11:27 20:22:42] INFO Spider Spider started!
[2019:11:27 20:22:42] INFO Spider Worker started: 140065780358320
[2019:11:27 20:22:42] INFO Spider Worker started: 140065780358496
[2019:11:27 20:22:42] INFO Request <GET: https://sofifa.com/player/239053/federico-valverde/200012/>
[2019:11:27 20:22:42] INFO Request <GET: https://sofifa.com/player/197928/jonathan-bond/200012/>
[2019:11:27 20:22:42] INFO Request <GET: https://sofifa.com/player/248243/eduardo-camavinga/200012/>
[2019:11:27 20:22:43] ERROR Spider target_item is expected
[2019:11:27 20:22:43] ERROR Spider target_item is expected
[2019:11:27 20:22:43] ERROR Spider target_item is expected
[2019:11:27 20:22:43] INFO Spider Stopping spider: Ruia
[2019:11:27 20:22:43] INFO Spider Total requests: 3
[2019:11:27 20:22:43] INFO Spider Time usage: 0:00:00.726866
[2019:11:27 20:22:43] INFO Spider Spider finished!

But when I remove the comment on

target_item = TextField(xpath_select='//div[@class="info"]/div[@class="meta bp3-text-overflow-ellipsis"]')

the error dissapear. Why is target_item required or what I'm misunderstanding?

Thanks

Show URL in Error for easier debugging

I think errors would be more useful if they also showed the URL of the parsed page. Example:

ERROR Spider <Item: extract ... error, please check selector or set parameter named default>, https://...

I hacked a solution together by passing around the url parameter, but I can't think of a clean solution ATM. Any ideas? I can also push my changes if you would like to see them (very hacky).

教程中的middleware只有1个参数,但应该有2个参数,同时希望修改报错信息

版本信息:ruia 0.5.7

问题1:教程中间件代码错误
教程中自定义的中间件

@middleware.request
async def print_on_request(request):
    ua = 'ruia user-agent'
    request.headers.update({'User-Agent': ua})

这样会报错,因为调用的时候await middleware(self, request)有2个参数
真实错误信息如下:

    await middleware(self, request)

TypeError: print_on_request() takes 1 positional argument but 2 were given

问题2:报错信息错误
显示错误信息是:

[2019:05:04 22:51:03] ERROR Spider  <Middleware print_on_request: must be a coroutine function

代码在 spider.py210行

                except TypeError:
                    self.logger.error(
                        f"<Middleware {middleware.__name__}: must be a coroutine function"
                    )

share event loop with other aplication

I am looking a python micro crawler framework to use in another python application,
I hope to share event loop with my application, and run many spiders,
like

async def main():
    [some code]
    await spider1.start()
    await spider2.start()
loop = asyncio.get_event_loop()
loop.run_until_complete(main())

is there a simple way to do that?

A Ruia plugin that uses the motor to store data

ruia-motor will be automatically store data to mongodb:

from ruia import AttrField, TextField, Item, Spider
from ruia_motor import RuiaMotor


class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')

    async def clean_title(self, value):
        return value.strip()


class HackerNewsSpider(Spider):
    start_urls = ['https://news.ycombinator.com/news?p=1', 'https://news.ycombinator.com/news?p=2']

    async def parse(self, response):
        async for item in HackerNewsItem.get_items(html=response.html):
            yield RuiaMotor(collection='hn_demo', data=item.results)


async def init_plugins_after_start(spider_ins):
    spider_ins.mongodb_config = {
        'host': '127.0.0.1',
        'port': 27017,
        'db': 'ruia_motor'
    }
    RuiaMotor.init_spider(spider_ins=spider_ins)


if __name__ == '__main__':
    HackerNewsSpider.start(after_start=init_plugins_after_start)

`text()` in xpath selector causes an error

I assume this is because of using TextField rather than another field. Which is more ideal? The error caused by line 120:

strings = [node.strip() for node in element.itertext()]

'lxml.etree._ElementUnicodeResult' object has no attribute 'itertext'

I was thinking that itertext shouldn't be used if the element has that type to avoid that error. That way text() will work fine.

About AttrField.extract_value

def extract_value(self, html, is_source=False):
       """
       Use css_select or re_select to extract a field value
       :return:
       """
       if self.css_select:
           value = html.cssselect(self.css_select)
           value = value[0].get(self.attr, value) if len(value) == 1 else value
       elif self.xpath_select:
           value = html.xpath(self.xpath_select)
       else:
           raise ValueError('%s field: css_select or xpath_select is expected' % self.__class__.__name__)
       if is_source:
           return value
       if self.default is not None:
           value = value if value else self.default
       return value

There's such a line:

value = value[0].get(self.attr, value) if len(value) == 1 else value

I have such an idea, to extract data from each item, and then check the count:

if self.css_select:
    value = html.cssselect(self.css_select)
    value = [item.get(self.attr, item) for item in value]
    if len(value) == 1:
    value = value[0]

Besides, I don't think it's necessary to check count.
It may cause confusion when some pages have several items while some pages have only one item.
For example, there is a field named 'tag'.
I think always return a list is better.

Multiple request supported

For example:

from ruia import Spider, Middleware


class TestSpider(Spider):
    start_urls = ['http://www.httpbin.org/get']

    async def parse(self, response):
        pages = [{'url': f'http://www.httpbin.org/get?p={i}'} for i in range(1, 9)]
        async for resp in self.multiple_request(pages):
            yield self.parse_next(resp, any_param='hello')

    async def parse_next(self, response, any_param):
        yield self.request(
            url=response.url,
            callback=self.parse_item
        )

    async def parse_item(self, response):
        item_data = response.html

Rate Limiting?

Hi,

Thanks for this wonderful project.

What is the recommended way to add rate limiting to the spider?

I normally add a randomized delay like for scraping:

requests.get(url)
time.sleep(abs(random.gauss(1, 0.5)) * 2)

To get a random delay with a mean of 2.

Another common one is rate limiting to 10 requests per minute or 1 request per 5 seconds.

What would be the recommended way to rate limit in ruia?

`TextField` strips strings which may not be desirable

strings = [node.strip() for node in element.itertext()]

My use case is extracting paragraphs which have newlines between them, and these are stripped out by TextField. Should a new field be introduced (I have already made one for my scraper), or should the stripping be optional? Perhaps both is best.

ruia 似乎没有使用一个队列来维护所有的任务,如果有突发情况停止了爬虫,下次重新启动就需要重新开始一遍

因为之前使用 scrapy 比较多,但是 scrapy 是基于 twisted 的框架,偶然发现了 ruia 是基于 aiohttp 和 asyncio 的异步框架,可是看完文档之后发现并没有像 scrapy 一样的一个 scheduler,如果爬虫突然停止之后是不是就没法儿从上次停止的地方继续爬取,如果量很大的话从头再来一次就要做很多重复性工作了,不知道开发者有没有想过添加这个功能,使用了 scheduler 之后还能够对任务进行比较简单的去重。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.