Coder Social home page Coder Social logo

henryhaohao / wenshu_spider Goto Github PK

View Code? Open in Web Editor NEW
189.0 16.0 71.0 2.43 MB

:rainbow:Wenshu_Spider-Scrapy框架爬取**裁判文书网案件数据(2019-1-9最新版)

Home Page: http://wenshu.court.gov.cn/

License: MIT License

Python 100.00%
wenshu scrapy proxy-server abuyun judgement decrypt

wenshu_spider's Issues

根据现有代码修改,异步存储测试通过

from twisted.internet import defer,reactor
class MyspiderPipeline(object):
def init(self):
host = settings['MONGODB_HOST']
port = settings['MONGODB_PORT']
dbname = settings['MONGODB_DBNAME']
docname = settings['MONGODB_DOCNAME']
self.client = pymongo.MongoClient(host=host,port=port)
db = self.client[dbname]
self.post = db[docname]

def close_spider(self, spider):
    self.client.close()

# 下面的操作是重点
@defer.inlineCallbacks
def process_item(self, item, spider):
    out = defer.Deferred()
    reactor.callInThread(self._insert, item, out, spider)
    yield out
    defer.returnValue(item)
    return item

def _insert(self, item, out, spider):
    time.sleep(10)
    try:
        data = dict(item)
        self.post.insert(data)
        reactor.callFromThread(out.callback, item)
    except BaseException:
        # 索引相同,即为重复数据,捕获错误
        spider.logger.debug('duplicate key error collection')
        reactor.callFromThread(out.callback, item)

博主,你好,如何遍历文书网的所有的文书

鉴于目前文书网一次返回的结果太少。我的想法是地点加日期,可是现在不能单独已某天的范围来抓取数据呢,如:param:"案件类型:民事案件,中级法院:北京市第二中级人民法院,裁判日期:2018-11-13 TO 2018-11-20",这中条件似乎不能返回结果。如果去掉日期改为:param:"案件类型:民事案件,中级法院:北京市第二中级人民法院",则可以。
我就很困惑,似乎也没有其他的筛选方法了。
谢谢。

每次爬到一定数量后就爬不了了,不知道其他人有没有这样的问题

2019-01-02 15:08:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:09:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:10:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:11:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:12:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:13:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:14:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:15:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:16:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:17:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:18:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:19:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:20:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:21:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:22:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:23:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:24:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)
2019-01-02 15:25:21 [scrapy.extensions.logstats] INFO: Crawled 1293 pages (at 0 pages/min), scraped 585 items (at 0 items/min)

大神

为什么我执行完这一步
_20181213163815就不往下执行了

Conda环境下,运行报错。

(base) E:\2018CoutDocu\1HenryhaohaoWenshu_Spider\Wenshu_Spider-master\Wenshu_Pro
ject>scrapy crawl Wenshu
Traceback (most recent call last):
File "C:\Anaconda3\Scripts\scrapy-script.py", line 10, in
sys.exit(execute())
File "C:\Anaconda3\lib\site-packages\scrapy\cmdline.py", line 149, in execute
cmd.crawler_process = CrawlerProcess(settings)
File "C:\Anaconda3\lib\site-packages\scrapy\crawler.py", line 252, in init

log_scrapy_info(self.settings)

File "C:\Anaconda3\lib\site-packages\scrapy\utils\log.py", line 149, in log_sc
rapy_info
for name, version in scrapy_components_versions()
File "C:\Anaconda3\lib\site-packages\scrapy\utils\versions.py", line 35, in sc
rapy_components_versions
("pyOpenSSL", _get_openssl_version()),
File "C:\Anaconda3\lib\site-packages\scrapy\utils\versions.py", line 43, in g
et_openssl_version
import OpenSSL
File "C:\Anaconda3\lib\site-packages\OpenSSL_init
.py", line 8, in

from OpenSSL import rand, crypto, SSL

File "C:\Anaconda3\lib\site-packages\OpenSSL\rand.py", line 12, in
from OpenSSL._util import (
File "C:\Anaconda3\lib\site-packages\OpenSSL_util.py", line 6, in
from cryptography.hazmat.bindings.openssl.binding import Binding
File "C:\Anaconda3\lib\site-packages\cryptography\hazmat\bindings\openssl\bind
ing.py", line 12, in
from cryptography.hazmat.bindings._openssl import ffi, lib
ImportError: DLL load failed: 操作系统无法运行 %1。

加密好像改了,现在无法获取docid了

有几种报错,我不太明白,您能帮我看一下吗?
1、
Traceback (most recent call last):
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/Users/FiveMeter/Desktop/kaoputou-project/wenshu_monitor/Wenshu/spiders/wenshu.py", line 118, in get_docid
result = eval(json.loads(html))
File "", line 1
["RunEval":"w61aSW7Cg0AQfAvClg/Cg8KIw7IBw6TCk8KfwpDDowhZFnZiDjHDkcKYwpwsw789QCwCw4wCwoTCrRlKQmUxa3V3dQ8gby/DkcOpfAtFw7TClcOsw54SEV0/XsOfRcO8wrnCvxzDhT4+wp3CmcOneDwAwpDChhc4AcKwDsKFw5hgw4hKw5IVVQnCsQXDgMOvw7AsAMORwoPDhxBcw7gTwo7CgcOBwoDDlUcAwrrCgyvDoUXDmA9AaBA4AMKiAj/DgTjCmATCgBBswrYGADkBwqApAAA6wrPDnC4AVXB9w7/CsMObwoTDscO1wpbCiMOvMMKJw4XDhj/DsEPCkF7CjHnDt0c2wqzCuMOuD8KXw7/DjsOkKQTCgcOHwpzCvMODw6XDlTXCs8KOw6bCnsO8dsO7w7fCl2vDjsKyw6Z0BHfCsh4ew7DDtMKnwrZhdcK4w7PCnMKgw5/CrMOGWznDlsOIwqnCvMKawrrCvcOYKmctwrtIwqLCoMK0w4HCsHZBwrLCnUdLwrfCjcK0W8KOw6gywqzDs8OYw79Nw6gxwqvDr8OUQcOmD8K3SFUecsKmw6YXwqslw71wOw/DqcKNUsKuwpjDtMOdTcOeM0VEYVfCpcKiwqXCrUV5NcOWHcKSNk3CtXrCjy3CmkrDryhUJ8OdR2PCpsOqwqwcwpnDhMO0ZmscUG7DlcKdwpdTwolAOsKIw5U8Ryk4w7XCnU3DjwTDj8OHwq7CpcOJZSbCuTXDrz0nwrHCgQnDkD5dF3DDtsOTYGEsCFQMb2nCvcKqwq7CqsOrbMO9ccKrw5paXsKQw7d4RSPDpDzDsMORw6teLyLCg8K8wqzDukBBSxbCusOUwpVywrvDvTfCqMK+dlp0wpzCkMKMNjVmw6ZEwpYrwocpwo8nPhtQw6TCu0Z6w4zCj0ZrdMOpw6wcwpfDoMKfZsKiFsK9K8KOwrswwrnCisOnMsOXw78B",]
^
SyntaxError: invalid syntax

2、
Traceback (most recent call last):
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/Users/FiveMeter/Desktop/kaoputou-project/wenshu_monitor/Wenshu/spiders/wenshu.py", line 123, in get_docid
docid = self.js_2.call('getdocid', runeval, casewenshuid)
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/execjs/_abstract_runtime_context.py", line 37, in call
return self._call(name, *args)
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/execjs/_external_runtime.py", line 92, in _call
return self._eval("{identifier}.apply(this, {args})".format(identifier=identifier, args=args))
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/execjs/_external_runtime.py", line 78, in eval
return self.exec
(code)
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/execjs/abstract_runtime_context.py", line 18, in exec
return self.exec(source)
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/execjs/_external_runtime.py", line 88, in exec
return self._extract_result(output)
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/execjs/_external_runtime.py", line 167, in _extract_result
raise ProgramError(value)
execjs._exceptions.ProgramError: Error: Malformed UTF-8 data

3、
Traceback (most recent call last):
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/Users/FiveMeter/Desktop/kaoputou-project/venv/wenshu-venv/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/Users/FiveMeter/Desktop/kaoputou-project/wenshu_monitor/Wenshu/spiders/wenshu.py", line 150, in get_detail
content_1 = json.loads(re.search(r'JSON.stringify((.*?));', html).group(1))
AttributeError: 'NoneType' object has no attribute 'group'

还有就是我爬取的筛选条件下,比如数据条数,是错误的,感觉就和随机出来的数字一样。
能解答一下吗?谢谢

get_docid.js调用错误

在获取docid时一直报错'execjs._exceptions.ProgramError: TypeError: 'key' 为 null 或不是对象'

Thank

谢谢大佬,实测有用哈哈,牛皮 ,以后我也要开源。

docid解密有另一种,有些无法解密

"[{\"RunEval\":\"w63Cm8OdbsKCQBDChcKfBcOjw4USwprCvgDDscOKR8Oow6XChsKYBm3DtcKiw5Igwr0ywr57w4FSKsKwAsKWRXbDpUvDiDHCsD9zw6bDjMOsLGzDonzCu1tvDmHCvMO7TBYvScK8w5vCvz/Cv8OFw5HDh3LDuxovwqPDtUZ4wo4nA8OAaHhCBGAaGcOyCMKOTGTCuVLClcKILcKAw64oCxA9FCPCuEgJwpBOw7gKEBXCsg0gwqfCkBc5AMOiwoPDuABqwqMJegJIDsKQMMOowoRzCMKDw4PDhMKQCAAkAsOew6A1QMKeAcKABnB9f8K1wpjChcORw77CkMOEX2ESw4UzfyVXQXoJw6EdT1nCt8OiOsKeXMO5M1LCphAEwp5ww44Nwq4sw4/CmTXCtMK3wpxvw6d/w786wpIie8Kcw7bCkE7DliIDwpnDvlQMwpbCuzvDucKAw6vDhirCvMKVfRs5XcOOwqZ+XsKYOsKzwq5LVMKjwqDDtMKhYcOuwoJkN0Uvbltpd8OscUvCt8Kbw7vDvm9Awo9RfcKHahnDnycXwobDoMKow4/CqsOmw6l0w45UXcKLwqrCqsOSw5vDtSHDpFQ4USrCj8KDwpnCu8ODw6zCg8Kmw6rCgHEYwpfDhMKVwp1sw7BAw53DlcKOw4JYw77CjnBHw7vCv3LDi8OSwrIbR8KLwpFCYMKQw60iw7llw6sLwpvChcKiGMK7C8KPTsOFIFfDmsO5wphGw5ZcUsKGM8KDw57Co8OTwrPChmNOVFQ+w647HV3DmMKqwrrDojDCo8Omwr9UfQ1df1nDq3UcY8KXwr5Qw5bDhcKHY07DrcKhacKsZMO5woPDrsOHw499w4nCscKHGsOEw4fCtl3CrGkhw5crR8OTOn7CmF3Dh1bDnsK2Sl1kwpbDlFMPR1JpWnVudA84wqbCmA5vG8KLwrErXMO/Gw==\",\"Count\":\"1\"},{\"裁判要旨段原文\":\"本院认为,原告方国生在被告**人寿保险股份有限公司唐山市路北支公司投保了《保险合同》,系双方真实意思表示,合法有效,双方均应按约定及法律规定履行相应的权利义务。第三人王燕军提交的书证作为授权委托书应当准确的载明代理的具体事项、权限、时间等内容,该书证未写明时间\",\"不公开理由\":\"\",\"案件类型\":\"2\",\"裁判日期\":\"2017-03-28\",\"案件名称\":\"方国生与**人寿保险股份有限公司唐山市路北支公司保险纠纷一审民事判决书\",\"文书ID\":\"FcOOwrsRBDEIBcOBwpTDhB/DjCcEw7nCh3R7w544U8OVwpIcacKMw60Dwr7CmyzDuWXDicOsS8KBXms2w6cVwqtpPcKlAlR2wpPCmsOfHcKIOXTDojrCoyVfwrPCuMKTw5fCnDxCS1IRc8K1woPCucO2wpHDsBDDtywrwpgzw6tew7fCrcOYw4fCpsOew7owGUFtEgllw4rDr8Kyw4ARwr49eGfDlQM2wrQiwonCjzTDhSHDiMORwoB7w4Eqw45fwrPDuyDDqXXDvMKYw78A\",\"审判程序\":\"一审\",\"案号\":\"(2016)冀0203民初1093号\",\"法院名称\":\"唐山市路北区人民法院\"}]"

比如上面这个,就解析不了

大神能不能共享一份爬出来的数据,我不会Python,但是想要一份数据!

如题,
大神能不能共享一份爬出来的数据,我不会Python,下载源码后运行没成功爬到数据,但是想要一份数据!

2018-12-17 10:11:06 [scrapy.core.engine] INFO: Spider opened 2018-12-17 10:11:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-12-17 10:11:08 [scrapy.core.scraper] ERROR: Spider error processing <POST http://wenshu.court.gov.cn/List/ListContent> (referer: http://wenshu.court.gov.cn/ValiCode/GetCode) Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback yield next(it) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output for x in result: File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr> return (_set_referer(r) for r in result or ()) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "/Users/koy/git/wenshu/Wenshu_Spider/Wenshu_Project/Wenshu/spiders/wenshu.py", line 66, in get_content result = eval(json.loads(html)) File "<string>", line 1, in <module> NameError: name 'remind' is not defined 2018-12-17 10:11:08 [scrapy.core.scraper] ERROR: Spider error processing <POST http://wenshu.court.gov.cn/List/ListContent> (referer: http://wenshu.court.gov.cn/ValiCode/GetCode) Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback yield next(it) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output for x in result: File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr> return (_set_referer(r) for r in result or ()) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "/Users/koy/git/wenshu/Wenshu_Spider/Wenshu_Project/Wenshu/spiders/wenshu.py", line 66, in get_content result = eval(json.loads(html)) File "<string>", line 1, in <module> NameError: name 'remind' is not defined 2018-12-17 10:11:08 [scrapy.core.engine] INFO: Closing spider (finished)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.