howie6879 / ruia Goto Github PK

View Code? Open in Web Editor NEW

1.7K 42.0 181.0 4.57 MB

Async Python 3.6+ web scraping micro-framework based on asyncio

Home Page: https://www.howie6879.com/ruia/

License: Apache License 2.0

Python 94.43% HTML 5.54% Shell 0.04%

asyncio aiohttp asyncio-spider crawler crawling-framework spider uvloop ruia python-ruia middlewares

ruia's Introduction

Ruia

🕸️ Async Python 3.6+ web scraping micro-framework based on asyncio.

⚡ Write less, run faster.

Overview

Ruia is an async web scraping micro-framework, written with asyncio and aiohttp, aims to make crawling url as convenient as possible.

Write less, run faster:

Documentation: 中文文档 |documentation
Organization: python-ruia
Plugin: awesome-ruia(Any contributions you make are greatly appreciated!)

Features

Easy: Declarative programming
Fast: Powered by asyncio
Extensible: Middlewares and plugins
Powerful: JavaScript support

Installation

# For Linux & Mac
pip install -U ruia[uvloop]

# For Windows
pip install -U ruia

# New features
pip install git+https://github.com/howie6879/ruia

Tutorials

TODO

Cache for debug, to decreasing request limitation, ruia-cache
Provide an easy way to debug the script, ruia-shell
Distributed crawling/scraping

Contribution

Ruia is still under developing, feel free to open issues and pull requests:

Report or fix bugs
Require or publish plugins
Write or fix documentation
Add test cases

!!!Notice: We use black to format the code.

Thanks

ruia's People

Contributors

Stargazers

Watchers

Forkers

douglas-xie zer0fire sangecoder intfrr puppycodes shaunstanislauslau monkeyfx xiongjunhan polarp lizimingchina wenqiang-china fengshow12345 buhfur thomasdic2000 gridl lihuanshuai lthodavdopl lfykid ruter nqzhang licshire efwfe lwx124 bingryan acracker xiaxichen gardsted nicktet farolding fossabot panhaoyu lemongeek josh-techindeed misterpilou hubitor lotapp hohocho duruyi yuxing666 cl545740896 vikibeta jangocheng fanlingzhi1234 fakegit forksource chaitanya4v pubfork leo-xxx stoltzmaniac wmm1996528 wersonlu hhy5277 rubbish822 davidalphafox fengchangqing01 paul-chiang himmatong alanjpython alienlu maxzheng chenrun666 sanyuesiyuewuyue ming123jew brickc7 kingking888 zyxyuanxiao strategist922 cxhuan cnxue cindyuan mn3711698 lairdutemps greenjoson sty001 hingzzz fengzifrank zhrbing songhao8080 yqsmile flagang 1005281342 asdlei99 sunwenquan ziux esprengle zhu1979 abmyii 91xxn xlbf22 awoziji vveiyi rubythonode shuiliuwusheng futurefish sunzq071 guapier zhang-wangz nk2018ljk tdlist aqfofp

ruia's Issues

Scrape multiple websites & save results in a database [Question]

What are the possible options for scraping multiple websites e.g. through a list or a file and saving the results in a database?

share event loop with other aplication

I am looking a python micro crawler framework to use in another python application,
I hope to share event loop with my application, and run many spiders,
like

async def main():
    [some code]
    await spider1.start()
    await spider2.start()
loop = asyncio.get_event_loop()
loop.run_until_complete(main())

is there a simple way to do that?

About AttrField.extract_value

def extract_value(self, html, is_source=False):
       """
       Use css_select or re_select to extract a field value
       :return:
       """
       if self.css_select:
           value = html.cssselect(self.css_select)
           value = value[0].get(self.attr, value) if len(value) == 1 else value
       elif self.xpath_select:
           value = html.xpath(self.xpath_select)
       else:
           raise ValueError('%s field: css_select or xpath_select is expected' % self.__class__.__name__)
       if is_source:
           return value
       if self.default is not None:
           value = value if value else self.default
       return value

There's such a line:

value = value[0].get(self.attr, value) if len(value) == 1 else value

I have such an idea, to extract data from each item, and then check the count:

if self.css_select:
    value = html.cssselect(self.css_select)
    value = [item.get(self.attr, item) for item in value]
    if len(value) == 1:
    value = value[0]

Besides, I don't think it's necessary to check count.
It may cause confusion when some pages have several items while some pages have only one item.
For example, there is a field named 'tag'.
I think always return a list is better.

asyncio `RuntimeError`

ERROR asyncio Exception in callback BaseSelectorEventLoop._sock_write_done(150)(<Future finished result=None>)
handle: <Handle BaseSelectorEventLoop._sock_write_done(150)(<Future finished result=None>)>
Traceback (most recent call last):
  File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 516, in _sock_write_done
    self.remove_writer(fd)
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 346, in remove_writer
    self._ensure_fd_no_transport(fd)
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 251, in _ensure_fd_no_transport
    raise RuntimeError(
RuntimeError: File descriptor 150 is used by transport <_SelectorSocketTransport fd=150 read=idle write=<polling, bufsize=0>>

Getting this quite a bit still. I don't think it's ruia directly, but aiohttp. Any ideas?

One thing that may be causing it is that in clean functions I call other functions synchronously, i.e.:

    async def clean_<...>(self, value):
        return <function>(value)

Could that be causing it? I tried doing return await ... but the error still persisted.

建议和疑惑

能否在创建自己的Item类的同时，对AttrField中获取href的时候，获取到的是绝对地址。这样就可以省下写一个clean函数的时间了。（逃）

write a middleware to filter repeat request

`text()` in xpath selector causes an error

I assume this is because of using TextField rather than another field. Which is more ideal? The error caused by line 120:

ruia/ruia/field.py

Line 120 in 8a91c01

strings = [node.strip() for node in element.itertext()]

'lxml.etree._ElementUnicodeResult' object has no attribute 'itertext'

I was thinking that itertext shouldn't be used if the element has that type to avoid that error. That way text() will work fine.

With ten concurrent request, there's an Exception

Mainly because the server rejected our request.

import asyncio
from ruia import Item, TextField, AttrField


class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')


async def parse_one_page(page):
    url = f'https://news.ycombinator.com/news?p={page}'
    return await HackerNewsItem.get_items(url=url)


async def main():
    coroutine_list = [parse_one_page(page) for page in range(1, 10)]
    result = await asyncio.gather(*coroutine_list)
    news_list = list()
    for one_page_list in result:
        news_list.extend(one_page_list)
    for news in news_list:
        print(news.title, news.url)


if __name__ == '__main__':
    asyncio.run(main())

result:

C:\Users\wolf\work\ruia\venv\Scripts\python.exe "C:\Program Files\JetBrains\PyCharm 2018.3.2\helpers\pydev\pydevd.py" --multiproc --qt-support=auto --client 127.0.0.1 --port 57124 --file C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py
pydev debugger: process 9764 is connecting

Connected to pydev debugger (build 183.4886.43)
[2019:01:20 22:08:47]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=1>
[2019:01:20 22:08:47]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=2>
[2019:01:20 22:08:47]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=3>
[2019:01:20 22:08:47]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=4>
[2019:01:20 22:08:47]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=5>
[2019:01:20 22:08:47]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=6>
[2019:01:20 22:08:47]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=7>
[2019:01:20 22:08:47]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=8>
[2019:01:20 22:08:47]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=9>
[2019:01:20 22:08:48]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=8 503 >
[2019:01:20 22:08:48]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=8>, Retry times: 1
[2019:01:20 22:08:48]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=8>
[2019:01:20 22:08:49]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=8 503 >
[2019:01:20 22:08:49]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=8>, Retry times: 2
[2019:01:20 22:08:49]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=8>
[2019:01:20 22:08:49]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=8 503 >
[2019:01:20 22:08:49]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=8>, Retry times: 3
[2019:01:20 22:08:49]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=8>
[2019:01:20 22:08:49]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=4 503 >
[2019:01:20 22:08:49]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=4>, Retry times: 1
[2019:01:20 22:08:49]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=4>
[2019:01:20 22:08:49]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=1 503 >
[2019:01:20 22:08:49]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=1>, Retry times: 1
[2019:01:20 22:08:49]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=1>
[2019:01:20 22:08:49]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=4 503 >
[2019:01:20 22:08:49]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=4>, Retry times: 2
[2019:01:20 22:08:49]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=4>
[2019:01:20 22:08:50]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=1 503 >
[2019:01:20 22:08:50]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=1>, Retry times: 2
[2019:01:20 22:08:50]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=1>
[2019:01:20 22:08:50]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=4 503 >
[2019:01:20 22:08:50]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=4>, Retry times: 3
[2019:01:20 22:08:50]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=4>
[2019:01:20 22:08:50]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=4 503 >
[2019:01:20 22:08:50]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=9 0 >
[2019:01:20 22:08:50]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=9>, Retry times: 1
[2019:01:20 22:08:50]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=9>
[2019:01:20 22:08:50]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=8 0 >
[2019:01:20 22:08:50]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=5 0 >
[2019:01:20 22:08:50]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=5>, Retry times: 1
[2019:01:20 22:08:50]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=5>
[2019:01:20 22:08:50]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=3 0 >
[2019:01:20 22:08:50]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=3>, Retry times: 1
[2019:01:20 22:08:50]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=3>
[2019:01:20 22:08:50]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=2 0 >
[2019:01:20 22:08:50]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=2>, Retry times: 1
[2019:01:20 22:08:50]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=2>
[2019:01:20 22:08:50]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=1 0 >
[2019:01:20 22:08:50]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=1>, Retry times: 3
[2019:01:20 22:08:50]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=1>
[2019:01:20 22:08:52]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=1 503 >
[2019:01:20 22:08:53]-Request-ERROR request: <Error: https://news.ycombinator.com/news?p=9 503 >
[2019:01:20 22:08:53]-Request-INFO  request: <Retry url: https://news.ycombinator.com/news?p=9>, Retry times: 2
[2019:01:20 22:08:53]-Request-INFO  request: <GET: https://news.ycombinator.com/news?p=9>
[2019:01:20 22:08:57]-asyncio-ERROR base_events: unhandled exception during asyncio.run() shutdown
task: <Task finished coro=<parse_one_page() done, defined at C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py:11> exception=ValueError('can only parse strings')>
Traceback (most recent call last):
  File "C:\Program Files\Python37\lib\asyncio\runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "C:\Program Files\Python37\lib\asyncio\base_events.py", line 584, in run_until_complete
    return future.result()
  File "C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py", line 18, in main
    result = await asyncio.gather(*coroutine_list)
  File "C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py", line 13, in parse_one_page
    return await HackerNewsItem.get_items(url=url)
  File "C:\Users\wolf\work\ruia\ruia\item.py", line 53, in get_items
    html_etree = await cls._get_html(html, url, **kwargs)
  File "C:\Users\wolf\work\ruia\ruia\item.py", line 41, in _get_html
    return etree.HTML(html)
  File "src\lxml\etree.pyx", line 3159, in lxml.etree.HTML
  File "src\lxml\parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
ValueError: can only parse strings

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py", line 13, in parse_one_page
    return await HackerNewsItem.get_items(url=url)
  File "C:\Users\wolf\work\ruia\ruia\item.py", line 53, in get_items
    html_etree = await cls._get_html(html, url, **kwargs)
  File "C:\Users\wolf\work\ruia\ruia\item.py", line 41, in _get_html
    return etree.HTML(html)
  File "src\lxml\etree.pyx", line 3159, in lxml.etree.HTML
  File "src\lxml\parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
ValueError: can only parse strings
[2019:01:20 22:08:57]-asyncio-ERROR base_events: unhandled exception during asyncio.run() shutdown
task: <Task finished coro=<parse_one_page() done, defined at C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py:11> exception=ValueError('can only parse strings')>
Traceback (most recent call last):
  File "C:\Program Files\Python37\lib\asyncio\runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "C:\Program Files\Python37\lib\asyncio\base_events.py", line 584, in run_until_complete
    return future.result()
  File "C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py", line 18, in main
    result = await asyncio.gather(*coroutine_list)
  File "C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py", line 13, in parse_one_page
    return await HackerNewsItem.get_items(url=url)
  File "C:\Users\wolf\work\ruia\ruia\item.py", line 53, in get_items
    html_etree = await cls._get_html(html, url, **kwargs)
  File "C:\Users\wolf\work\ruia\ruia\item.py", line 41, in _get_html
    return etree.HTML(html)
  File "src\lxml\etree.pyx", line 3159, in lxml.etree.HTML
  File "src\lxml\parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
ValueError: can only parse strings

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py", line 13, in parse_one_page
    return await HackerNewsItem.get_items(url=url)
  File "C:\Users\wolf\work\ruia\ruia\item.py", line 53, in get_items
    html_etree = await cls._get_html(html, url, **kwargs)
  File "C:\Users\wolf\work\ruia\ruia\item.py", line 41, in _get_html
    return etree.HTML(html)
  File "src\lxml\etree.pyx", line 3159, in lxml.etree.HTML
  File "src\lxml\parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
ValueError: can only parse strings
Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm 2018.3.2\helpers\pydev\pydevd.py", line 1741, in <module>
    main()
  File "C:\Program Files\JetBrains\PyCharm 2018.3.2\helpers\pydev\pydevd.py", line 1735, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "C:\Program Files\JetBrains\PyCharm 2018.3.2\helpers\pydev\pydevd.py", line 1135, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm 2018.3.2\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py", line 27, in <module>
    asyncio.run(main())
  File "C:\Program Files\Python37\lib\asyncio\runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "C:\Program Files\Python37\lib\asyncio\base_events.py", line 584, in run_until_complete
    return future.result()
  File "C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py", line 18, in main
    result = await asyncio.gather(*coroutine_list)
  File "C:/Users/wolf/work/ruia/examples/concise_hacker_news_spider/main.py", line 13, in parse_one_page
    return await HackerNewsItem.get_items(url=url)
  File "C:\Users\wolf\work\ruia\ruia\item.py", line 53, in get_items
    html_etree = await cls._get_html(html, url, **kwargs)
  File "C:\Users\wolf\work\ruia\ruia\item.py", line 41, in _get_html
    return etree.HTML(html)
  File "src\lxml\etree.pyx", line 3159, in lxml.etree.HTML
  File "src\lxml\parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
ValueError: can only parse strings

`DELAY` attribute specifically for retries

I assumed the DELAY attr would set the delay for retries but instead it applies to all requests. I would appreciate it if there was a DELAY attr specifically for retries (RETRY_DELAY). I'd be happy to implement it if given the go-ahead.

Thank you for this great library!

多个 spider 同时开始，实现真的异步

按照现在的设计思路，应该是1个网站一个spider的模式了，但是想同时抓取多个网站，有没有办法同时开多个spider，同时抓取。

而且设计成类方法和类变量的话感觉不是很灵活，对于这种情况。

Change spider request control into producer-consumer model

Set a maximun size of queue,
to get a better performance at start,
when there are million-level start urls.

How can I scrap the text contents from a list of urls extracted from parse function in spider?

from ruia import AttrField, TextField, Item
from ruia_pyppeteer import PyppeteerSpider as Spider
from ruia_pyppeteer import PyppeteerRequest as Request
from ruia_ua import middleware

domain='caixin'
domain_page = 'http://www.caixin.com/search/scroll/index.jsp'

class frame_Item(Item):
    target_item = TextField(css_select='body > div.indexBody > div.news > div.news_content')  
    publish_times = TextField(css_select='dl > dd > span', many=True)
    headline_texts = TextField(css_select='dl> dd > a', many=True)  
    news_urls = AttrField(attr='href',css_select='dl> dd > a',many=True) 


class texts_Item(Item):
    target_item = TextField(css_select='r"/body > div.comMain > div.conlf > #the_content"') 
    news_texts = TextField(css_select='"div.content > div.textbox" ', many=True)


class caixin_realtime_Spider(Spider):
    start_urls = [domain_page]
    concurrency = 5

    async def parse(self, response):
        async for item in frame_Item.get_items(html=response.html):
            yield item

    async def process_item(self, item:frame_Item):
        for url in item.news_urls:
            yield Request(url, callback=self.parse_item) 

    async def parse_item(self, response):
        async for item in caixin_news_Item.get_items(html=response.html):
             print(item.news_texts)


def run():
    print('Scraping ', domain,  domain_page)
    caixin_realtime_Spider.start(middleware=middleware)       


if __name__ == '__main__':
    run()
    db.close()

The thing is ... I only have the follow result:

Scraping caixin 財新網 http://www.caixin.com/search/scroll/index.jsp
[2019:03:05 17:14:57] WARNING Spider Ruia tried to use loop.add_signal_handler but it is not implemented on this platform.
[2019:03:05 17:14:57] WARNING Spider Ruia tried to use loop.add_signal_handler but it is not implemented on this platform.
[2019:03:05 17:14:57] INFO Spider Spider started!
[2019:03:05 17:14:57] INFO Spider Worker started: 1947326139520
[2019:03:05 17:14:57] INFO Spider Worker started: 1947326139656
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:54430/devtools/browser/b72b7efd-7b6b-4353-a7d4-acec7a7f02b4
[I:pyppeteer.launcher] terminate chrome process...
[2019:03:05 17:15:00] ERROR Spider object async_generator can't be used in 'await' expression
[2019:03:05 17:15:00] INFO Spider Stopping spider: Ruia
[2019:03:05 17:15:00] INFO Spider Total requests: 1
[2019:03:05 17:15:00] INFO Spider Time usage: 0:00:02.876312
[2019:03:05 17:15:00] INFO Spider Spider finished!
[2019:03:05 17:15:00] ERROR asyncio Task was destroyed but it is pending!
task: <Task pending coro=<async_generator_athrow()>>

请问如何使用代理 proxy

请问如果使用代理 proxy 应该也是在中间件和定义 useragent 一个位置吗？可以给一个例子吗？

Supported process invalid callback result type

For example:

class SpiderDemo(Spider):
    start_urls = ['https://www.httpbin.org/get?p=0']
    result = {
        'process_callback_result': False
    }

    async def parse(self, response):
        yield {}


async def process_dict_callback_result(spider_ins, callback_result):
    print(callback_result)
    spider_ins.result['process_callback_result'] = True


class CustomCallbackResultType:

    @classmethod
    def init_spider(cls, spider):
        spider.callback_result_map = spider.callback_result_map or {}
        setattr(spider, 'process_dict_callback_result', process_dict_callback_result)
        spider.callback_result_map.update({'dict': 'process_dict_callback_result'})


CustomCallbackResultType.init_spider(SpiderDemo)

loop = asyncio.new_event_loop()
SpiderDemo.start(loop=loop)
assert SpiderDemo.result['process_callback_result'] == True

target_item expected error

I wrote a spider, the code is there https://codeshare.io/5NQXr1.
I just want to retrieve the links in one Item element, I've got a link = AttrField(css_select='a', attr='href'), but when it's called in a async for, i've got a target_item expected error. In the examples you have done in the repo, when calling item with get_items(), they all got a clean_link or clean_title async method. But i don't think it's the issue. Also response.html contains links.

process_item is only call when the callback_result is an Item

This is a question of curiosity. Why limit the process_item for item resulting.
I think that must be called on every result.

the lines are the 199-202 of spider.py

                elif isinstance(callback_result, Item):
                    # Process target item
                    await self.process_item(callback_result)

Crashes on Windows

Traceback (most recent call last):
File "weibospider.py", line 26, in
HackerNewsSpider.start()
File "C:\Users\hwywhywl\StudioProjects\weibo_splider\lib\site-packages\aspider\spider.py", line 92, in start
spider_ins.loop.add_signal_handler(_signal, lambda: asyncio.ensure_future(spider_ins.stop(_signal)))
File "C:\Users\hwywhywl\Anaconda3\lib\asyncio\events.py", line 499, in add_signal_handler
raise NotImplementedError
NotImplementedError

loop.add_signal_handler is currently not supported on Windows.

Default shoudn't be enclosed in list

ruia/ruia/field.py

Line 79 in b8aa61d

results = [self.default]

This leads to confusing unexpected outputs like:

TextField(css_select="div.brand b", default=[], many=True)

> [[]]

Show URL in Error for easier debugging

I think errors would be more useful if they also showed the URL of the parsed page. Example:

ERROR Spider <Item: extract ... error, please check selector or set parameter named default>, https://...

I hacked a solution together by passing around the url parameter, but I can't think of a clean solution ATM. Any ideas? I can also push my changes if you would like to see them (very hacky).

Type checking for start_urls

    def __init__(self, middleware=None, loop=None, is_async_start=False):
        if not self.start_urls or not isinstance(self.start_urls, list):
            raise ValueError("Spider must have a param named start_urls, eg: start_urls = ['https://www.github.com']")

This type checking should check for collections.Iterable,

Here's an instance:

I want to crawl a website, like hacker news.
Pages are orderd by number,
so I write such a start_urls:

start_urls = (f'https://some.site.com/{index}' for index in range(1, 100000))

Then it took a long time to start crawling.

spider.request is not awaitable

I just start to use Ruia and I most say that I love it.
But I find my self a little confused with some code in the documentation not working.

The link of the documentation page is https://howie6879.github.io/ruia/en/tutorials/spider.html.

I am trying to replicate:

class MySpider(Spider):
    async def parse(self, response):
        for i in range(10):
            response = await self.request(f'https://some.site/{i}')
            yield self.parse_next(response)

    async def parse_next(self, response):
        print(response.html)

and when give an error when calling self.request and say it's not awaitable

I take a look to the source (installed on my computer) and in did spider.request now returns a Response object.
My question is how to proceed now to get the same result of this line response = await self.request(f'https://some.site/{i}') and be able to manipulate the response

`TextField` strips strings which may not be desirable

ruia/ruia/field.py

Line 120 in 8a91c01

strings = [node.strip() for node in element.itertext()]

My use case is extracting paragraphs which have newlines between them, and these are stripped out by TextField. Should a new field be introduced (I have already made one for my scraper), or should the stripping be optional? Perhaps both is best.

Multiple request supported

For example:

from ruia import Spider, Middleware


class TestSpider(Spider):
    start_urls = ['http://www.httpbin.org/get']

    async def parse(self, response):
        pages = [{'url': f'http://www.httpbin.org/get?p={i}'} for i in range(1, 9)]
        async for resp in self.multiple_request(pages):
            yield self.parse_next(resp, any_param='hello')

    async def parse_next(self, response, any_param):
        yield self.request(
            url=response.url,
            callback=self.parse_item
        )

    async def parse_item(self, response):
        item_data = response.html

Update Documentation of use of spider.request method

In the example code (https://howie6879.github.io/ruia/en/tutorials/spider.html) for example is needed to substitute the calls of self.request adding a call of method fetch of Request object

A Ruia plugin that uses the motor to store data

ruia-motor will be automatically store data to mongodb:

from ruia import AttrField, TextField, Item, Spider
from ruia_motor import RuiaMotor


class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')

    async def clean_title(self, value):
        return value.strip()


class HackerNewsSpider(Spider):
    start_urls = ['https://news.ycombinator.com/news?p=1', 'https://news.ycombinator.com/news?p=2']

    async def parse(self, response):
        async for item in HackerNewsItem.get_items(html=response.html):
            yield RuiaMotor(collection='hn_demo', data=item.results)


async def init_plugins_after_start(spider_ins):
    spider_ins.mongodb_config = {
        'host': '127.0.0.1',
        'port': 27017,
        'db': 'ruia_motor'
    }
    RuiaMotor.init_spider(spider_ins=spider_ins)


if __name__ == '__main__':
    HackerNewsSpider.start(after_start=init_plugins_after_start)

Can I ask what is the cause of ERROR Spider <Callback[parse_item]: 'NoneType' object has no attribute 'html'

Thanks for your prompt response. Highly appreciate your effort and attitude. :)

I have used the same spider to scrap another financial news webpage. The codes are somehow the same and only the css selectors vary along with the page.

For this website, 'http://news.10jqka.com.cn/realtimenews.html' , do you know the cause of <Callback[parse_item]: 'NoneType' object has no attribute 'html'?????

Below are the codes:

domain='tenjqka'
domain_chinese = '同花順'
domain_page = 'http://news.10jqka.com.cn/realtimenews.html'


class frame_Item(Item):
    target_item = TextField(css_select='ul.newsText.all')  
    publish_times = TextField(css_select='li > div.newsTimer', many=True) 
    headline_texts = TextField(css_select='li > div.newsDetail > a > strong', many=True)    
    news_urls = AttrField(attr='href', css_select='li > div.newsDetail > a', many=True) 

class text_Item(Item):
    target_item = TextField(css_select="body > div.main-content.clearfix > div.main-fl.fl > div.main-text.atc-content")  
    news_texts = TextField(css_select="p", many=True)  

class realtime_Spider(Spider):
    start_urls = [domain_page]
    concurrency = 5

    async def parse(self, response):
        async for item in frame_Item.get_items(html=response.html):
            publish_times= [scrapy_date_str+' '+publish_time for publish_time in item.publish_times]
            for publish_time, title, url in zip(publish_times, item.headline_texts, item.news_urls):
                timediff = datetime.now() - datetime.strptime(publish_time, '%Y-%m-%d %H:%M')
                if timediff.seconds <= 60 * 60 and re.search('news',url):
                    yield Request(url, callback=self.parse_item,
                                  metadata={'publish_time': publish_time, 'title': title},
                                  headers={'User-Agent': ua.random},
                                  pyppeteer_launch_options={"headless": True},
                                  pyppeteer_page_options={'waitUntil': 'networkidle2'},
                                  close_pyppeteer_browser=True)

    async def parse_item(self, response):
        news_info = []
        async for item in text_Item.get_items(html=response.html):
            publish_time = response.metadata['publish_time']
            title = response.metadata['title']
            news_info.append((publish_time, title, ' '.join(item.news_texts)))

        pd.DataFrame(news_info).to_csv(os.path.join(working_dir, '{}_{}.txt').format(domain, scrapy_date), mode='a', index=None, encoding='utf-8')



def run():
    realtime_Spider.start(middleware=middleware)          # 12. 同花順

if __name__ == '__main__':
    run()
    db.close()

[2019:03:05 19:15:27] WARNING Spider Ruia tried to use loop.add_signal_handler but it is not implemented on this platform.
[2019:03:05 19:15:27] INFO Spider Spider started!
[2019:03:05 19:15:27] INFO Spider Worker started: 3055864995080
[2019:03:05 19:15:27] INFO Spider Worker started: 3055864995216
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:56232/devtools/browser/179c9946-9fae-4249-863e-616d9212b160
[I:pyppeteer.launcher] terminate chrome process...
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:56261/devtools/browser/2f63f643-241a-4bb5-ba21-ac93c74cc9f8
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:56263/devtools/browser/4eb898d6-2907-409a-873c-33af40346726
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:56267/devtools/browser/559dfab3-0ef8-48d9-80d6-0ab03afddadd
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:56269/devtools/browser/b686bf95-59ea-4156-ad28-fd255b393fd7
[2019:03:05 19:15:42] INFO Request <Retry url: http://news.10jqka.com.cn/20190305/c610071745.shtml>, Retry times: 1
[2019:03:05 19:15:42] INFO Request <Retry url: http://news.10jqka.com.cn/20190305/c610071136.shtml>, Retry times: 1
[2019:03:05 19:15:42] INFO Request <Retry url: http://news.10jqka.com.cn/20190305/c610072845.shtml>, Retry times: 1
[2019:03:05 19:15:42] INFO Request <Retry url: http://news.10jqka.com.cn/20190305/c610070478.shtml>, Retry times: 1
[I:pyppeteer.connection] connection closed
[I:pyppeteer.connection] connection closed
[I:pyppeteer.connection] connection closed
[I:pyppeteer.connection] connection closed
[I:pyppeteer.launcher] terminate chrome process...
[I:pyppeteer.launcher] terminate chrome process...
[2019:03:05 19:15:52] ERROR Request <Error: http://news.10jqka.com.cn/20190305/c610071745.shtml Protocol error Target.createTarget: Target closed.>
[2019:03:05 19:15:52] ERROR Spider <Callback[parse_item]: 'NoneType' object has no attribute 'html'
[I:pyppeteer.launcher] terminate chrome process...
[I:pyppeteer.launcher] terminate chrome process...
[2019:03:05 19:15:52] ERROR Request <Error: http://news.10jqka.com.cn/20190305/c610072845.shtml Protocol error Target.createTarget: Target closed.>
[2019:03:05 19:15:52] ERROR Spider <Callback[parse_item]: 'NoneType' object has no attribute 'html'
[I:pyppeteer.launcher] terminate chrome process...
[I:pyppeteer.launcher] terminate chrome process...
[2019:03:05 19:15:52] ERROR Request <Error: http://news.10jqka.com.cn/20190305/c610071136.shtml Protocol error Target.createTarget: Target closed.>
[2019:03:05 19:15:52] ERROR Spider <Callback[parse_item]: 'NoneType' object has no attribute 'html'
[I:pyppeteer.launcher] terminate chrome process...
[I:pyppeteer.launcher] terminate chrome process...
[2019:03:05 19:15:52] ERROR Request <Error: http://news.10jqka.com.cn/20190305/c610070478.shtml Protocol error Target.createTarget: Target closed.>
[2019:03:05 19:15:52] ERROR Spider <Callback[parse_item]: 'NoneType' object has no attribute 'html'
[2019:03:05 19:15:52] INFO Spider Stopping spider: Ruia
[2019:03:05 19:15:52] INFO Spider Total requests: 1
[2019:03:05 19:15:52] INFO Spider Time usage: 0:00:25.378160
[2019:03:05 19:15:52] INFO Spider Spider finished!

target_item is expected error

I have the next piece of code, taken (and modified) from Spider Control:

import re

import aiofiles
from ruia import TextField, Item, Spider


class Player(Item):
    # target_item = TextField(xpath_select='//div[@class="info"]/div[@class="meta bp3-text-overflow-ellipsis"]')
    player_id = TextField(xpath_select='//div[@class="info"]/h1')
    info = TextField(xpath_select='//div[@class="info"]/div[@class="meta bp3-text-overflow-ellipsis"]')

    async def clean_player_id(self, value):
        return int(re.search(r'[0-9]{2,7}', value).group())

    async def clean_info(self, value):
        return re.search(r'([A-Z](.\ [A-Z])?[a-z]+ *)+', value).group()


class SoFifa(Spider):
    start_urls = [
        "https://sofifa.com/player/239053/federico-valverde/200012/",
        "https://sofifa.com/player/197928/jonathan-bond/200012/",
        "https://sofifa.com/player/248243/eduardo-camavinga/200012/"
    ]

    async def parse(self, response):
        async for item in Player.get_items(html=response.html):
            yield item

    async def process_item(self, item: Player):
        print(f"{item.info} (ID: {item.player_id})")

        """Ruia build-in method"""
        async with aiofiles.open('./players.txt', 'a') as f:
            await f.write(f"{item.info} (ID: {item.player_id})" + '\n')


if __name__ == '__main__':
    SoFifa.start()

And when running, I receive the next error:

[2019:11:27 20:22:42] INFO Spider Spider started!
[2019:11:27 20:22:42] INFO Spider Worker started: 140065780358320
[2019:11:27 20:22:42] INFO Spider Worker started: 140065780358496
[2019:11:27 20:22:42] INFO Request <GET: https://sofifa.com/player/239053/federico-valverde/200012/>
[2019:11:27 20:22:42] INFO Request <GET: https://sofifa.com/player/197928/jonathan-bond/200012/>
[2019:11:27 20:22:42] INFO Request <GET: https://sofifa.com/player/248243/eduardo-camavinga/200012/>
[2019:11:27 20:22:43] ERROR Spider target_item is expected
[2019:11:27 20:22:43] ERROR Spider target_item is expected
[2019:11:27 20:22:43] ERROR Spider target_item is expected
[2019:11:27 20:22:43] INFO Spider Stopping spider: Ruia
[2019:11:27 20:22:43] INFO Spider Total requests: 3
[2019:11:27 20:22:43] INFO Spider Time usage: 0:00:00.726866
[2019:11:27 20:22:43] INFO Spider Spider finished!

But when I remove the comment on

target_item = TextField(xpath_select='//div[@class="info"]/div[@class="meta bp3-text-overflow-ellipsis"]')

the error dissapear. Why is target_item required or what I'm misunderstanding?

Thanks

建议取消页面转码的注释，并添加ignore，忽略非法字符

gb2312 编码的网页会存在问题

#request.py

import chardet

#line 121
#res_data = await resp.text()
content = await resp.read()
charset = chardet.detect(content)
res_data = content.decode(charset['encoding'], 'ignore')

Log crucial information regardless of log-level

I've reduced the log level of a Spider in my script as I find it too verbose, however I also filter out crucial info, particularly the after completion info (number of requests, time, ect.) -

ruia/ruia/spider.py

Lines 280 to 287 in 651fac5

    
           self.logger.info( 
        
               f"Total requests: {self.failed_counts + self.success_counts}" 
        
           ) 
        
           if self.failed_counts: 
        
               self.logger.info(f"Failed requests: {self.failed_counts}") 
        
           self.logger.info(f"Time usage: {end_time - start_time}") 
        
           self.logger.info("Spider finished!")

This is code I currently use to reduce verbosity:

import logging

# Disable logging (for speed)
logging.root.setLevel(logging.ERROR)

I'm thinking of changing the code so that it shows regardless of log level, but will there ever be a case where you wouldn't want to see it?

New logo wanted for Ruia

Need a logo meaning "Light and Fast crawler", I think the logo looks like a running girl.

question: Is there any option for continuous scraping

Scrapy (an old but time-proven twisted-based framework) has Frontera - a library for managing scalable clusters of scrapers, where URLs can be added to a queue dynamically by some management-node and actual scraping performed by another.
So my question is about is there any mechanism for performing similar things with ruia? (cause I was about to write my own implementation of similar thing)

能不能简单说下为什么Spider类不需要新建实例，直接空降了个start就能运行了

How can I add a proxies-rotator into the middleware?

Hi, Thanks for your awesome contribution for this wonderful open library.

Just one question. Besides random ua, I also wanna add my own proxy-rotator into the ruia spider.

I have searched the py.files of requests and ua but yet to sort out the input location of my proxy rotator.

Would you tell me where i can plug-in that into the module?

Simple user-agent middleware for ruia

ruia-ua: simple user-agent middleware for ruia

No field for capturing raw LXML elements

I need to capture the raw LXML element(s) and process them before converting to text. Right now I have created a very simple new field (not sure if best way to implement):

class ElementField(_LxmlElementField):

    def _parse_element(self, element):
        return element

I think this should be part of ruia core, not an extension.

从第一个例子开始，就报错了。。。然后我仿照着去爬别的网站，也会报错。

这是复制例子的。

$TLM{JIS}BAF{RM18@_}UX3H$

这是我自己写的一个。

In spider.py, class properties are not safe

class Spider:
    name = 'ruia'  # Used for log
    request_config = None

    # Default values passing to each request object. Not implemented yet.
    headers: dict = {}
    metadata: dict = {}
    kwargs: dict = {}

    res_type: str = 'text'

    # Some fields for statistics
    failed_counts: int = 0
    success_counts: int = 0

    # Concurrency control
    concurrency: int = 3

    # Spider entry
    start_urls: list = []

    # A queue to save coroutines
    worker_tasks: list = []

The class properties may not safe, especially when there are two spiders in a program.
Should defined in __init__

Rate Limiting?

Hi,

Thanks for this wonderful project.

What is the recommended way to add rate limiting to the spider?

I normally add a randomized delay like for scraping:

requests.get(url)
time.sleep(abs(random.gauss(1, 0.5)) * 2)

To get a random delay with a mean of 2.

Another common one is rate limiting to 10 requests per minute or 1 request per 5 seconds.

What would be the recommended way to rate limit in ruia?

如何在爬取过程中增加新的目标url页面呢

这个应该是必须的功能把

一般只定义一个入口页面, 后面应该是自动发现的啊

Characters not supported

Here I met such a problem, that some characters could not be captured in item.

The url is http://qq.ip138.com/train/shanxi/.

如何将response中的metadata传递给item

将一个网页的response解析出来后，需要和详情界面中的信息进行组合

Documentation about IgnoreThisItem

Hi my friend. I'm making a spider today, and I couldn't find the function in the documentation. However, I find it in the source code.

class CityItem(ruia.Item):
    target_item = ruia.TextField(css_select='body > table:nth-of-type(4) td')
    url = ruia.AttrField(attr='href', css_select='a')
    name = ruia.TextField(css_select='a')

    async def clean_name(self, value):
        raise ruia.exceptions.IgnoreThisItem

I have a small suggestion, that we should add IgnoreThisItem to global namespace of ruia, instead of ruia.exceptions.IgnoreThisItem.

I'll write the docs in a few days.

I think write it here is ok.

教程中的middleware只有1个参数，但应该有2个参数，同时希望修改报错信息

版本信息：ruia 0.5.7

问题1：教程中间件代码错误
教程中自定义的中间件

@middleware.request
async def print_on_request(request):
    ua = 'ruia user-agent'
    request.headers.update({'User-Agent': ua})

这样会报错，因为调用的时候await middleware(self, request)有2个参数
真实错误信息如下：

    await middleware(self, request)

TypeError: print_on_request() takes 1 positional argument but 2 were given

问题2：报错信息错误
显示错误信息是：

[2019:05:04 22:51:03] ERROR Spider  <Middleware print_on_request: must be a coroutine function

代码在 spider.py210行

                except TypeError:
                    self.logger.error(
                        f"<Middleware {middleware.__name__}: must be a coroutine function"
                    )

class SpiderDemo(Spider):

    async def process_succeed_response(self, request, response):
        """Process succeed response"""
        pass

    async def process_failed_response(self, request, response):
        """Process failed response"""
        pass

HtmlField?

hello, I want to get the raw html code, so I write another field, named HtmlField.

import ruia.field

class HtmlField(ruia.field._LxmlElementField):
    def _parse_element(self, element):
        return etree.tostring(element)

Is it useful to add it to ruia? Or I'll write an tutorial of adding custom fields.

I don't know if there is a better implementation with current version.

	self.logger.info(
	f"Total requests: {self.failed_counts + self.success_counts}"
	)

	if self.failed_counts:
	self.logger.info(f"Failed requests: {self.failed_counts}")
	self.logger.info(f"Time usage: {end_time - start_time}")
	self.logger.info("Spider finished!")

howie6879 / ruia Goto Github PK

ruia's Introduction

Ruia

Overview

Features

Installation

Tutorials

TODO

Contribution

Thanks

ruia's People

Contributors

Stargazers

Watchers

Forkers

ruia's Issues

Recommend Projects

Recommend Topics

Recommend Org