gaojiuli / toapi Goto Github PK
View Code? Open in Web Editor NEWEvery web site provides APIs.
Home Page: https://gaojiuli.github.io/toapi/
License: MIT License
Every web site provides APIs.
Home Page: https://gaojiuli.github.io/toapi/
License: MIT License
The example:
from toapi import XPath, Item, Api
api = Api()
class Movie(Item):
__base_url__ = 'http://www.dy2018.com'
url = XPath('//b//a[@class="ulink"]/@href')
title = XPath('//b//a[@class="ulink"]/text()')
class Meta:
source = XPath('//table[@class="tbspan"]')
route = '/'
class Post(Item):
__base_url__ = 'https://news.ycombinator.com/'
url = XPath('//a[@class="storylink"]/@href')
title = XPath('//a[@class="storylink"]/text()')
class Meta:
source = XPath('//tr[@class="athing"]')
route = '/'
api.register(Movie)
api.register(Post)
api.serve()
Hello,
I am using v1.0.0
.
And wanted to check if there is a way to set an expiration time on the local storage. Since the source website changes frequently, I want to only cache the local storage for 24 hours and then force it to refetch the content.
Is there a way to achieve this, and similar to a ttl
on cache settings.
Thanks,
I use:
class ProductPage(Item):
coupon = Attr('.coupon', 'title')
However some product pages do not contain the coupon html
so they fail with
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/htmlparsing.py", line 79, in parse
return element.css(self.selector)[0].attrs[self.attr]
IndexError: list index out of range
What's the best practice to deal with that situation?
文档中toapi-pic的例子
toapi new toapi-pic
, 这行代码将会拉取toapi/toapi-template
项目, 改为toapi new toapi/toapi-pic
才可以拉取toapi/toapi-pic
项目
when i install toapi through the commond "pip install toapi",the errors occurs .
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/basecommand.py", line 215, in main
status = self.run(options, args)
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/commands/install.py", line 324, in run
requirement_set.prepare_files(finder)
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_set.py", line 380, in prepare_files
ignore_dependencies=self.ignore_dependencies))
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_set.py", line 620, in _prepare_file
session=self.session, hashes=hashes)
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/download.py", line 821, in unpack_url
hashes=hashes
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/download.py", line 659, in unpack_http_url
hashes)
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/download.py", line 882, in _download_http_url
_download_url(resp, link, content_file, hashes)
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/download.py", line 603, in _download_url
hashes.check_against_chunks(downloaded_chunks)
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/utils/hashes.py", line 46, in check_against_chunks
for chunk in chunks:
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/download.py", line 571, in written_chunks
for chunk in chunks:
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/utils/ui.py", line 139, in iter
for x in it:
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/download.py", line 560, in resp_read
decode_content=False):
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/_vendor/requests/packages/urllib3/response.py", line 357, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/_vendor/requests/packages/urllib3/response.py", line 324, in read
flush_decoder = True
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 35, in __exit__
self.gen.throw(type, value, traceback)
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/_vendor/requests/packages/urllib3/response.py", line 246, in _error_catcher
raise ReadTimeoutError(self._pool, None, 'Read timed out.')
ReadTimeoutError: HTTPSConnectionPool(host='pypi.python.org', port=443): Read timed out.
on macOS, python version is 2.7 and pip version is 9.0.1
any one help me out ?
Hi.
On the diagram in the README (nice & clear BTW), I can read:
HTML storage update trigger cache update
Could you somehow please explain what that means in the docs ?
When exactly / how does the cache gets updated ?
Traceback (most recent call last):
File "app.py", line 2, in
from htmlparsing import Attr, Text
File "/usr/local/lib/python2.7/dist-packages/htmlparsing-0.1.5-py2.7.egg/htmlparsing.py", line 21
def init(self, text: str):
^
SyntaxError: invalid syntax
Storage should support timeout.
Hi,
If I have run this comment toapi -v I am getting the below issue.
Kindly check and give the needfull suggession
Thanks...
Traceback (most recent call last):
File "/usr/local/bin/toapi", line 9, in <module>
load_entry_point('toapi==2.1.2', 'console_scripts', 'toapi')()
File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 542, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2569, in load_entry_point
return ep.load()
File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2229, in load
return self.resolve()
File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2235, in resolve
module = __import__(self.module_name, fromlist=['__name__'], level=0)
File "/usr/local/lib/python2.7/dist-packages/toapi/__init__.py", line 1, in <module>
from toapi.api import Api
File "/usr/local/lib/python2.7/dist-packages/toapi/api.py", line 20
def __init__(self, site: str = '', browser: str = None) -> None:
^
SyntaxError: invalid syntax
请问 toapi 提供这样的方式么?
我看在定义settings
是有一个参数是ajax=true
那么我发送ajax请求的data应该定义在哪里呢?
翻了一圈文档和issues都没找到
items
的编写自带的XPath
方法返回的好像是处理过的值而不是一个 etree element
这样我比如想要获取h1下所有的文本(包括子标签)就不可以用string(.)
方法
必须得再写一个clean_xx
的方法
另外 能否加入bs4的支持呢?
最后希望该项目能越做越好!
真的很棒!
/cache Not finish.
/storage Not finish.
/items Result is not excepted.
/status All right.
Good evening,
I have a question regarding the cache and its time to live.
Let's say I want to turn some site into an API and want the results of the very first request to be cached for one hour. How would I specify that in the settings? Is such a setup even possible?
I tried setting ttl: 60 * 60
, assuming that that would do the trick. But to me it seems it doesn't...
Could you please clarify?
Thanks in advance.
The aim is that our api server do not affect the source site.
Save results into files.
File's name is hash(path)
File's content is HTML
Update strategy? what should be updated, what should be permanent storaged?
class MovieList(Item):
__base_url = 'http://www.dy2018.com'
url = XPath('//b//a[@class="ulink"]/@href')
title = XPath('//b//a[@class="ulink"]/text()')
class Meta:
source = XPath('//table[@class="tbspan"]')
route = {'/movies/?page=1': '/html/gndy/dyzz/',
'/movies/?page=:page': '/html/gndy/dyzz/index_:page.html'}
现在获取资源的链接一般是 https://yoursite.com/https://targetsite.com/resource/path/
这样一来会有两个问题:
提出一个设想:
是否可以在 Meta
中增加一个 alias
作为源站 base_url
的替代或标识,并且作为一级资源路径插入到路由中,如 https://yoursite.com/<alias>/resource/path/
,这样一来既可以满足区分多站点的需求,又可以解决上面提到的两个问题。
在官方仓库中的例子(查看源码)有利用 flask
的路由进行自定义路由的实现,如果有多个站点多个请求路径,这样写在 items
里有一份路由,在这里面又要再写一份路由,显得有点机械了。
Thanks.
class Meta:
source = NONE
route = {'/search/:id': '/search/:id'}
Right now ID for host url passes directly into source url. Is there a way we can modify ID before passing them on?
For example, I need to map the query 127.0.0.1:5000/search/1 to bing.com/search/100
So I am going to have to multiply :id with 100 before passing it as argument. Not sure if that makes sense.
Hello,
I need to get access to the raw HTML in one of Item instances.
Currently the XPath
or CSS
selectors always convert the node as a string. But in my use case once I select certain part of my webpage, I need to do some post-processing in my clean_
method.
But I can only get a string passed into it. Is there a way to get a rawHTML passed into my clean_
method for a given key.
Thank you,
A Web console interface.
Hello,
I am relatively new to python web development. And while I am mainly working on a mobile app. I found topapi
to be a perfect companion for my backend requirements.
I am now almost ready to launch my app, but am struggling to find a good production hosting environment for the toapi
server code.
Mainly looking around using heroku
or aws
or google app engine
for hosting server.
I was wondering if you can provide some instructions for deploying to a production quality server. I did go over this deploy link but still not able to link the content to the actual toapi codebase.
And advise on how can I move forward with this.
Thank you again,
Example:
[Sent] TIME , URL, LENGTH ,STATUS
[Received] TIME, URL, LENGTH ,STATUS
[Parsed] TIME, URL,ITEM, TOTAL, STATUS
[Cache] TIME, URL,STATUS
[warning] TIME, DETAIL
etc...
This should be a log.py
file.
在对item进行解析的时候
有些是垃圾信息
比如一些低质量的广告
请问有什么方法可以实现 删除这些信息呢?
我的做法是: 复写一下Item
类里的方法
但是值是层层嵌套进去的,这样处理就很不规范
我水平比较差..请见谅
class MyItem(Item):
@classmethod
def parse(cls, html):
"""Parse html to json"""
if cls.Meta.source is None:
return cls._parse_item(html)
else:
sections = cls.Meta.source.parse(html, is_source=True)
results = []
for section in sections:
res = cls._parse_item(section)
if res:
results.append(res)
return results
@classmethod
def _parse_item(cls, html):
item = OrderedDict()
for name, selector in cls.__selectors__.items():
try:
item[name] = selector.parse(html)
except Exception:
item[name] = ''
clean_method = getattr(cls, 'clean_%s' % name, None)
if clean_method is not None:
res = clean_method(cls, item[name])
if res == None:
return None
else:
item[name] = res
return item
class HotBook(MyItem):
title = XPath('//a[@class="xst"]/text()[1]')
def clean_title(self, title):
if '《' in title:
return title[title.find('\u300a') + 1:title.find('\u300b')][:10]
else:
return None
class Meta:
source = XPath('//tbody[@class="thread_tbody"]')
route = {'/hotbook?page=:page': '/forum-171-:page.html'}
class MyItem(Item):
class Meta:
route = '\.+'
def clean_data(self, data):
"""Do some thing here"""
return cleaned_data
In working with toapi I came across a scenario where the web page had an HTML table that was paginated.
Clicking on "next page" would issue an ajax post request to fetch the next set of records in the data set.
Is there anyway to accomplish this with toapi?
Traceback (most recent call last):
File "test.py", line 1, in
from toapi import XPath, Item, Api
File "/Library/Python/2.7/site-packages/toapi/init.py", line 2, in
from toapi.item import *
File "/Library/Python/2.7/site-packages/toapi/item.py", line 17
class Item(metaclass=ItemType):
^
SyntaxError: invalid syntax
it's normal when i use xiafufang.com but toapi-pic can not work,what should i config for https website ? thanks
Hi
The site I am scraping has urls like:
http://remote.com/i-love-cats/1
http://remote.com/dogs-are-really-great/2
http://remote.com/pupies_kanGourOUs/3
I want to match them with local urls like those:
http://localhost:5000/1
http://localhost:5000/2
http://localhost:5000/3
Is there some magical way to do it?
Or I need to do like #107 and also
add custom code in an external two-column db table to match
1 => http://remote.com/i-love-cats/1
...
Sure as an alternate solution I could maybe add a route like
@api.route('page/{complete_remote_url}', '{complete_remote_url}')
and do like :
wget http://localhost:5000/page/http://remote.com/i-love-cats/1
but I want to hide the scraped site url so the caller does not see the url
Now result:
{'url': ['https://*'],
'title': ['Making It ']}
Excepted:
{'url': 'https://*',
'title': 'Making It'}
[root@test python3]# python --version
Python 3.6.2
[root@test python3]# toapi new toapi/toapi-pic
Usage: toapi [OPTIONS] COMMAND [ARGS]...
Error: No such command "new".
help ?
python版本是3.5 toapi版本0.2.2
toapi new api
cd api
toapi run
执行topic run时报错
➜ api toapi run
Traceback (most recent call last):
File "/usr/local/bin/toapi", line 9, in <module>
load_entry_point('toapi==0.2.2', 'console_scripts', 'toapi')()
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/toapi/cli.py", line 81, in run
app = importlib.import_module('app', base_path)
File "/usr/lib/python3.5/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 986, in _gcd_import
File "<frozen importlib._bootstrap>", line 969, in _find_and_load
File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 665, in exec_module
File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
File "/home/zz/code/python/toapi_project/api/app.py", line 7, in <module>
api.register(Page)
File "/usr/local/lib/python3.5/dist-packages/toapi/api.py", line 31, in register
item.__pattern__ = re.compile(item.__base_url__ + item.Meta.route)
TypeError: Can't convert 'dict' object to str implicitly
item.__base_url__是str, item.Meta.route是字典
Whenever we run the app, we should make sure the settings is correct. Just like the django runserver command.
python 3.7
toapi 2.1.1
Traceback (most recent call last):
File "main.py", line 5, in <module>
api = Api()
File "/usr/local/lib/python3.7/site-packages/toapi/api.py", line 24, in __init__
self.__init_server()
File "/usr/local/lib/python3.7/site-packages/toapi/api.py", line 27, in __init_server
self.app.logger.setLevel(logging.ERROR)
AttributeError: module 'flask.logging' has no attribute 'ERROR'
目前 toapi 获取页面元素的处理方式有两种:
假设有如下 DOM 结构:
<div class="block">
<p class="p1">
hello, P1<br/>
this is new line
</p>
<p class="p2">
<span>this is a line</span>
<a href="#">this is a link</a>
</p>
</div>
有如下 item:
class Demo(Item):
block = Css('div.block')
p = Css('div.block > p')
p1 = Css('div.block > p.p1')
得到的结果将是:
现在有一种情况是,我想要取得对应元素节点的 HTML 内容而不是它的所有 text,例如 .p1
中带有换行符 <br/>
,我想要定位得到该元素下的 HTML 内容,即包含标签 <br/>
和 text 的所有内容,然后自己通过 clean_*
方法处理替换 <br/>
为 '\n',可是从上面的结果可以看到,获取的 p1 直接取出来了其中的文本内容,若我想得到 p1 原始内容,只能通过 p 获取到一组元素,然后再遍历得到 p1 之后,将它转换成字符串再作处理。
以上只是一个简单的使用场景,实际应用当然不会只有替换换行符这种操作,所以我建议可以给选择器增加一个可选的属性值 raw
,默认值取为 False
,即 Selector(rule, raw=False)
,然后配合 clean_*
方法可以给开发者更大的自由度去处理获取到的内容。
以上。
补充:
选择器还有一个是 Regex
,如果使用这个的话,可以得到想要的原始内容,另外两个选择器 Css
和 XPath
情况如前所述。
If sent error, don't set storage.
If parse error, don't set cache.
Hello,
I plan to use this awesome tool, to fetch some content from a dynamic website. Although the cache expiration works for standard use cases.
I was wondering if there is a way on the api to perform a complete fetch from the website instead of using the cache or the storage. A use case would on a mobile device if the user does a refresh
to see the latest content from the website.
Any ideas ?
Thank you for this...
如果 Meta 中source 不为 None,选择器得到的 html 内容为 etree 的对象 Element,其中 Regex 这个选择器的实现中没有对 html 进行判断 isinstance(html, etree._Element)
,所以会导致 item 中使用 Regex 选择器时会无法过滤出任何内容,在执行 re.findall(self.rule, html)
时会得到错误 expected string or bytes-like object
解决方法:
toapi/selector.py
class Regex(Selector):
"""Regex expression"""
def parse(self, html):
if isinstance(html, etree._Element):
html = etree.tostring(html)
return re.findall(self.rule, html)
Bug output
$ toapi run
2017/12/17 19:07:28 [Serving ] OK http://127.0.0.1:5000
Traceback (most recent call last):
File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/bin/toapi", line 11, in <module>
sys.exit(cli())
File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/toapi/cli.py", line 61, in run
app.api.serve(ip=ip, port=port)
File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/toapi/api.py", line 42, in serve
self.app.run(ip, port, debug=False, **options)
File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/flask/app.py", line 841, in run
run_simple(host, port, self, **options)
File "/home/user/.local/share/virtualenvs/toapi_test-UdiKVlKi/lib/python3.5/site-packages/werkzeug/serving.py", line 733, in run_simple
raise TypeError('port must be an integer')
TypeError: port must be an integer
toapi new [project_name]
toapi run
toapi info
toapi status
toapi clear storage
toapi clear cache
Suppose we have some permanent storaged HTML.
Now we should improve the performance of api service.
Congratulations to create a awesome lib and a great code, beautiful and great!!!
路由中包含中文一般是在搜索时会出现这种情况,查看了源码 toapi/api.py
的实现,发现 _alias_re_string
并没有对中文做匹配,尝试修改匹配规则为 '(?P<{}>[A-Za-z0-9_?&/=\u4e00-\u9fa5]+)'
后测试中文匹配成功。
另外空字符也是没有做匹配的,当搜索多关键词时就会只匹配第一个关键词,修改规则为 '(?P<{}>[A-Za-z0-9_?&/=\s\u4e00-\u9fa5]+)'
可以匹配多关键词。
At present, we define route as follows:
route = {'/movies/?page=1': '/html/gndy/dyzz/',
'/movies/?page=:page': '/html/gndy/dyzz/index_:page.html',
'/movies/': '/html/gndy/dyzz/'}
The problem is the ordering.
Use tuple or OrderedDict
We don't want the API users know any information about the source site.
Example:
http://127.0.0.1/adauoai1duaou/
The `/adauoai1duaou/` is an encrypted URL which means `/users/`
We should encrypt all the relative URLs in all items. Such as
{
'titile':'My Life',
'url':'/movie/2017/'
}
Convert it to
{
'titile':'My Life',
'url':'/uo23uodaoi123udo/'
}
I need a encrypt.py
file.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.