gaojiuli / toapi Goto Github PK

View Code? Open in Web Editor NEW

3.5K 77.0 235.0 1.74 MB

Every web site provides APIs.

Home Page: https://gaojiuli.github.io/toapi/

License: MIT License

Python 100.00%

html json api python web spider crawler flask toapi

toapi's Issues

使用toapi run 时出现问题

Multiple site routing problems.

The example:

from toapi import XPath, Item, Api

api = Api()


class Movie(Item):
    __base_url__ = 'http://www.dy2018.com'

    url = XPath('//b//a[@class="ulink"]/@href')
    title = XPath('//b//a[@class="ulink"]/text()')

    class Meta:
        source = XPath('//table[@class="tbspan"]')
        route = '/'


class Post(Item):
    __base_url__ = 'https://news.ycombinator.com/'

    url = XPath('//a[@class="storylink"]/@href')
    title = XPath('//a[@class="storylink"]/text()')

    class Meta:
        source = XPath('//tr[@class="athing"]')
        route = '/'

api.register(Movie)
api.register(Post)

api.serve()

How to set expiration on Local Storage

Hello,
I am using v1.0.0.
And wanted to check if there is a way to set an expiration time on the local storage. Since the source website changes frequently, I want to only cache the local storage for 24 hours and then force it to refetch the content.
Is there a way to achieve this, and similar to a ttl on cache settings.

Thanks,

Elements not always present on page

I use:

class ProductPage(Item):
      coupon = Attr('.coupon', 'title')

However some product pages do not contain the coupon html
so they fail with

  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/htmlparsing.py", line 79, in parse
    return element.css(self.selector)[0].attrs[self.attr]
IndexError: list index out of range

What's the best practice to deal with that situation?

文档未更新

文档中toapi-pic的例子
toapi new toapi-pic，这行代码将会拉取toapi/toapi-template项目, 改为toapi new toapi/toapi-pic 才可以拉取toapi/toapi-pic项目

install toapi occur errors.

when i install toapi through the commond "pip install toapi",the errors occurs .

Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/commands/install.py", line 324, in run
    requirement_set.prepare_files(finder)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_set.py", line 380, in prepare_files
    ignore_dependencies=self.ignore_dependencies))
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_set.py", line 620, in _prepare_file
    session=self.session, hashes=hashes)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/download.py", line 821, in unpack_url
    hashes=hashes
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/download.py", line 659, in unpack_http_url
    hashes)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/download.py", line 882, in _download_http_url
    _download_url(resp, link, content_file, hashes)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/download.py", line 603, in _download_url
    hashes.check_against_chunks(downloaded_chunks)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/utils/hashes.py", line 46, in check_against_chunks
    for chunk in chunks:
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/download.py", line 571, in written_chunks
    for chunk in chunks:
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/utils/ui.py", line 139, in iter
    for x in it:
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/download.py", line 560, in resp_read
    decode_content=False):
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/_vendor/requests/packages/urllib3/response.py", line 357, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/_vendor/requests/packages/urllib3/response.py", line 324, in read
    flush_decoder = True
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/_vendor/requests/packages/urllib3/response.py", line 246, in _error_catcher
    raise ReadTimeoutError(self._pool, None, 'Read timed out.')
ReadTimeoutError: HTTPSConnectionPool(host='pypi.python.org', port=443): Read timed out.

on macOS, python version is 2.7 and pip version is 9.0.1
any one help me out ?

Docs clarification on cache update logic

Hi.

On the diagram in the README (nice & clear BTW), I can read:

HTML storage update trigger cache update

Could you somehow please explain what that means in the docs ?
When exactly / how does the cache gets updated ?

python2.7安装报错

Traceback (most recent call last):
File "app.py", line 2, in
from htmlparsing import Attr, Text
File "/usr/local/lib/python2.7/dist-packages/htmlparsing-0.1.5-py2.7.egg/htmlparsing.py", line 21
def init(self, text: str):
^
SyntaxError: invalid syntax

Storage timeout.

Storage should support timeout.

SyntaxError: invalid syntax

Hi,

If I have run this comment toapi -v I am getting the below issue.

Kindly check and give the needfull suggession

Thanks...

Traceback (most recent call last):
  File "/usr/local/bin/toapi", line 9, in <module>
    load_entry_point('toapi==2.1.2', 'console_scripts', 'toapi')()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 542, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2569, in load_entry_point
    return ep.load()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2229, in load
    return self.resolve()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2235, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/usr/local/lib/python2.7/dist-packages/toapi/__init__.py", line 1, in <module>
    from toapi.api import Api
  File "/usr/local/lib/python2.7/dist-packages/toapi/api.py", line 20
    def __init__(self, site: str = '', browser: str = None) -> None:
                           ^
SyntaxError: invalid syntax

关于post数据的获取和item是编写

如果我想解析的一个页面是通过post请求才能得到的

请问 toapi 提供这样的方式么？
我看在定义settings是有一个参数是ajax=true
那么我发送ajax请求的data应该定义在哪里呢？
翻了一圈文档和issues都没找到

关于items的编写

自带的XPath方法返回的好像是处理过的值而不是一个 etree element
这样我比如想要获取h1下所有的文本（包括子标签）就不可以用string(.)方法
必须得再写一个clean_xx的方法

另外能否加入bs4的支持呢？

最后希望该项目能越做越好！
真的很棒！

Default api issues.

/cache Not finish.
/storage Not finish.
/items Result is not excepted.
/status All right.

Cache TTL clarification

Good evening,

I have a question regarding the cache and its time to live.

Let's say I want to turn some site into an API and want the results of the very first request to be cached for one hour. How would I specify that in the settings? Is such a setup even possible?

I tried setting ttl: 60 * 60, assuming that that would do the trick. But to me it seems it doesn't...

Could you please clarify?

Thanks in advance.

HTML permanent storage.

The aim is that our api server do not affect the source site.

Save results into files.
File's name is hash(path)
File's content is HTML
Update strategy? what should be updated, what should be permanent storaged?

Different route load the same item.

class MovieList(Item):
    __base_url = 'http://www.dy2018.com'

    url = XPath('//b//a[@class="ulink"]/@href')
    title = XPath('//b//a[@class="ulink"]/text()')

    class Meta:
        source = XPath('//table[@class="tbspan"]')
        route = {'/movies/?page=1': '/html/gndy/dyzz/',
                 '/movies/?page=:page': '/html/gndy/dyzz/index_:page.html'}

关于资源获取路径路由的问题

现在获取资源的链接一般是 https://yoursite.com/https://targetsite.com/resource/path/

这样一来会有两个问题：

丑，资源请求路径太长，不好看。
直接暴露源站。

提出一个设想：

是否可以在 Meta 中增加一个 alias 作为源站 base_url 的替代或标识，并且作为一级资源路径插入到路由中，如 https://yoursite.com/<alias>/resource/path/，这样一来既可以满足区分多站点的需求，又可以解决上面提到的两个问题。

在官方仓库中的例子（查看源码）有利用 flask 的路由进行自定义路由的实现，如果有多个站点多个请求路径，这样写在 items 里有一份路由，在这里面又要再写一份路由，显得有点机械了。

Thanks.

toapi is easy toStar hard todevelope

Modify routing argument

class Meta:
source = NONE
route = {'/search/:id': '/search/:id'}

Right now ID for host url passes directly into source url. Is there a way we can modify ID before passing them on?

For example, I need to map the query 127.0.0.1:5000/search/1 to bing.com/search/100

So I am going to have to multiply :id with 100 before passing it as argument. Not sure if that makes sense.

Ip proxies.

Access to RawHTML from selectors

Hello,
I need to get access to the raw HTML in one of Item instances.
Currently the XPath or CSS selectors always convert the node as a string. But in my use case once I select certain part of my webpage, I need to do some post-processing in my clean_ method.
But I can only get a string passed into it. Is there a way to get a rawHTML passed into my clean_ method for a given key.

Thank you,

Web interface.

A Web console interface.

Monitor API.
Statistical

Production Deployment Instructions

Hello,
I am relatively new to python web development. And while I am mainly working on a mobile app. I found topapi to be a perfect companion for my backend requirements.
I am now almost ready to launch my app, but am struggling to find a good production hosting environment for the toapi server code.
Mainly looking around using heroku or aws or google app engine for hosting server.

I was wondering if you can provide some instructions for deploying to a production quality server. I did go over this deploy link but still not able to link the content to the actual toapi codebase.

And advise on how can I move forward with this.

Thank you again,

Log Support.

The request sent to source site.
The request received.
The items parsed.
HTML stored
Cache got.
Errors.

Example:

[Sent] TIME , URL, LENGTH ,STATUS
[Received] TIME, URL, LENGTH ,STATUS
[Parsed] TIME, URL,ITEM, TOTAL, STATUS
[Cache] TIME, URL,STATUS
[warning] TIME, DETAIL
etc...

This should be a log.py file.

过滤删除无用的item

在对item进行解析的时候
有些是垃圾信息
比如一些低质量的广告

请问有什么方法可以实现删除这些信息呢？

我的做法是: 复写一下Item类里的方法
但是值是层层嵌套进去的，这样处理就很不规范
我水平比较差..请见谅

class MyItem(Item):
    @classmethod
    def parse(cls, html):
        """Parse html to json"""
        if cls.Meta.source is None:
            return cls._parse_item(html)
        else:
            sections = cls.Meta.source.parse(html, is_source=True)
            results = []
            for section in sections:
                res = cls._parse_item(section)
                if res:
                    results.append(res)
            return results

    @classmethod
    def _parse_item(cls, html):
        item = OrderedDict()
        for name, selector in cls.__selectors__.items():
            try:
                item[name] = selector.parse(html)
            except Exception:
                item[name] = ''
            clean_method = getattr(cls, 'clean_%s' % name, None)
            if clean_method is not None:
                res = clean_method(cls, item[name])
                if res == None:
                    return None
                else:
                    item[name] = res
        return item


class HotBook(MyItem):
    title = XPath('//a[@class="xst"]/text()[1]')
    
    def clean_title(self, title):
        if '《' in title:
            return title[title.find('\u300a') + 1:title.find('\u300b')][:10]
        else:
            return None
 
    class Meta:
        source = XPath('//tbody[@class="thread_tbody"]')
        route = {'/hotbook?page=:page': '/forum-171-:page.html'}

Item clean data support.

class MyItem(Item):
    class Meta:
        route = '\.+'
    def clean_data(self, data):
        """Do some thing here"""       
       return cleaned_data

Any way to send HTTP POST requests?

In working with toapi I came across a scenario where the web page had an HTML table that was paginated.

Clicking on "next page" would issue an ajax post request to fetch the next set of records in the data set.

Is there anyway to accomplish this with toapi?

ImportError: cannot import name 'XPath'

XPath去哪啦？？？看起来你好像移除了xpath？？？

Custom phantomjs headers.

没怎么用过python

Traceback (most recent call last):
File "test.py", line 1, in
from toapi import XPath, Item, Api
File "/Library/Python/2.7/site-packages/toapi/init.py", line 2, in
from toapi.item import *
File "/Library/Python/2.7/site-packages/toapi/item.py", line 17
class Item(metaclass=ItemType):
^
SyntaxError: invalid syntax

can't work when use https website

it's normal when i use xiafufang.com but toapi-pic can not work,what should i config for https website ? thanks

modify routing argument 2

Hi
The site I am scraping has urls like:

http://remote.com/i-love-cats/1
http://remote.com/dogs-are-really-great/2
http://remote.com/pupies_kanGourOUs/3

I want to match them with local urls like those:

http://localhost:5000/1
http://localhost:5000/2
http://localhost:5000/3

Is there some magical way to do it?

Or I need to do like #107 and also
add custom code in an external two-column db table to match

1 => http://remote.com/i-love-cats/1
...

Sure as an alternate solution I could maybe add a route like
@api.route('page/{complete_remote_url}', '{complete_remote_url}')
and do like :

wget http://localhost:5000/page/http://remote.com/i-love-cats/1

but I want to hide the scraped site url so the caller does not see the url

Parse XPath result in string but not list.

Now result:

{'url': ['https://*'], 
'title': ['Making It ']}

Excepted:

{'url': 'https://*', 
'title': 'Making It'}

Error: No such command "new".

[root@test python3]# python --version
Python 3.6.2

[root@test python3]# toapi new toapi/toapi-pic
Usage: toapi [OPTIONS] COMMAND [ARGS]...

Error: No such command "new".

help ?

运行topic run报错

python版本是3.5 toapi版本0.2.2

toapi new api
cd api
toapi run

执行topic run时报错

➜  api toapi run
Traceback (most recent call last):
  File "/usr/local/bin/toapi", line 9, in <module>
    load_entry_point('toapi==0.2.2', 'console_scripts', 'toapi')()
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/toapi/cli.py", line 81, in run
    app = importlib.import_module('app', base_path)
  File "/usr/lib/python3.5/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 986, in _gcd_import
  File "<frozen importlib._bootstrap>", line 969, in _find_and_load
  File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 665, in exec_module
  File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
  File "/home/zz/code/python/toapi_project/api/app.py", line 7, in <module>
    api.register(Page)
  File "/usr/local/lib/python3.5/dist-packages/toapi/api.py", line 31, in register
    item.__pattern__ = re.compile(item.__base_url__ + item.Meta.route)
TypeError: Can't convert 'dict' object to str implicitly

item.__base_url__是str, item.Meta.route是字典

Settings check.

Whenever we run the app, we should make sure the settings is correct. Just like the django runserver command.

Flask logging error

python 3.7
toapi 2.1.1

Traceback (most recent call last):
  File "main.py", line 5, in <module>
    api = Api()
  File "/usr/local/lib/python3.7/site-packages/toapi/api.py", line 24, in __init__
    self.__init_server()
  File "/usr/local/lib/python3.7/site-packages/toapi/api.py", line 27, in __init_server
    self.app.logger.setLevel(logging.ERROR)
AttributeError: module 'flask.logging' has no attribute 'ERROR'

Received log excute time.

items获取元素内容的建议

目前 toapi 获取页面元素的处理方式有两种：

精确定位到唯一的元素节点后，返回该节点下的所有 text，实现是通过遍历 node.itertext() 然后将所有的文本内容拼接返回
若定位到的元素节点是多个的，则返回由这些元素节点（Element）组成的一个 list

假设有如下 DOM 结构：

<div class="block">
  <p class="p1">
    hello, P1<br/>
    this is new line
  </p>
  <p class="p2">
    <span>this is a line</span>
    <a href="#">this is a link</a>
  </p>
</div>

有如下 item：

class Demo(Item):
    block = Css('div.block')
    p = Css('div.block > p')
    p1 = Css('div.block > p.p1')

得到的结果将是：

block: hello, P1 this is new line this is a line this is a link
p: [<Element p 0x*********>, <Element p 0x*********>]
p1: hello, P1 this is new line

现在有一种情况是，我想要取得对应元素节点的 HTML 内容而不是它的所有 text，例如 .p1 中带有换行符 <br/>，我想要定位得到该元素下的 HTML 内容，即包含标签 <br/> 和 text 的所有内容，然后自己通过 clean_* 方法处理替换 <br/> 为 '\n'，可是从上面的结果可以看到，获取的 p1 直接取出来了其中的文本内容，若我想得到 p1 原始内容，只能通过 p 获取到一组元素，然后再遍历得到 p1 之后，将它转换成字符串再作处理。

以上只是一个简单的使用场景，实际应用当然不会只有替换换行符这种操作，所以我建议可以给选择器增加一个可选的属性值 raw，默认值取为 False，即 Selector(rule, raw=False)，然后配合 clean_* 方法可以给开发者更大的自由度去处理获取到的内容。

以上。

补充：

选择器还有一个是 Regex，如果使用这个的话，可以得到想要的原始内容，另外两个选择器 Css 和 XPath 情况如前所述。

Setting and updating of storage and cache issues.

If sent error, don't set storage.
If parse error, don't set cache.

Add force-refresh on api

Hello,
I plan to use this awesome tool, to fetch some content from a dynamic website. Although the cache expiration works for standard use cases.
I was wondering if there is a way on the api to perform a complete fetch from the website instead of using the cache or the storage. A use case would on a mobile device if the user does a refresh to see the latest content from the website.

Any ideas ?

Thank you for this...

使用选择器 Regex 无法解析带 source 的内容

如果 Meta 中source 不为 None，选择器得到的 html 内容为 etree 的对象 Element，其中 Regex 这个选择器的实现中没有对 html 进行判断 isinstance(html, etree._Element)，所以会导致 item 中使用 Regex 选择器时会无法过滤出任何内容，在执行 re.findall(self.rule, html) 时会得到错误 expected string or bytes-like object

解决方法：

toapi/selector.py

class Regex(Selector):
    """Regex expression"""

    def parse(self, html):
        if isinstance(html, etree._Element):
            html = etree.tostring(html)
        return re.findall(self.rule, html)

Multiple processes share cache.

Minor bug when passing url port data to flask

Bug output

$ toapi run
2017/12/17 19:07:28 [Serving ] OK http://127.0.0.1:5000 
Traceback (most recent call last):
  File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/bin/toapi", line 11, in <module>
    sys.exit(cli())
  File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/toapi/cli.py", line 61, in run
    app.api.serve(ip=ip, port=port)
  File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/toapi/api.py", line 42, in serve
    self.app.run(ip, port, debug=False, **options)
  File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/flask/app.py", line 841, in run
    run_simple(host, port, self, **options)
  File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/werkzeug/serving.py", line 733, in run_simple
    raise TypeError('port must be an integer')
TypeError: port must be an integer

Command line interface.

toapi new [project_name]
toapi run
toapi info
toapi status
toapi clear storage
toapi clear cache

Expose wsgi inteface for gunicorn or uwsgi.

API service cache.

Suppose we have some permanent storaged HTML.

Now we should improve the performance of api service.

Use redis or what else better?
Cache the JSON that parsed from HTML.

        route = {'/movies/?page=1': '/html/gndy/dyzz/',
                 '/movies/?page=:page': '/html/gndy/dyzz/index_:page.html',
                 '/movies/': '/html/gndy/dyzz/'}

The problem is the ordering.

Use tuple or OrderedDict

Encrypt URL.

We don't want the API users know any information about the source site.

Example:

http://127.0.0.1/adauoai1duaou/

The `/adauoai1duaou/` is  an encrypted URL which means `/users/`

We should encrypt all the relative URLs in all items. Such as

{
    'titile':'My Life',
    'url':'/movie/2017/'
}

Convert it to

{
    'titile':'My Life',
    'url':'/uo23uodaoi123udo/'
}

I need a encrypt.py file.

gaojiuli / toapi Goto Github PK

toapi's Issues

Recommend Projects

Recommend Topics

Recommend Org