will4906 / patentcrawler Goto Github PK
View Code? Open in Web Editor NEWscrapy专利爬虫(停止维护)
License: Apache License 2.0
scrapy专利爬虫(停止维护)
License: Apache License 2.0
代理配置好后,经常出现验证码,识别后,换代理再次验证码,不知道是否是我操作不当。
另外,可否提供爬虫速度调节,某些情况下,速度不是最重要的,能拿到数据才比较重要,比如现在基本可以确定是被封ip+封账号...(没走代理)
一点建议,感谢~
请教一下,我使用申请日和公开日作为查询条件时,翻页的响应不对,您知道是怎么回事吗
2018-05-30 15:33:15 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: crawler)
2018-05-30 15:33:15 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.5.0, Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.5.0 (OpenSSL 1.0.2n 7 Dec 2017), cryptography 2.1.4, Platform Windows-10-10.0.16299-SP0
2018-05-30 15:33:15 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'crawler', 'COOKIES_DEBUG': True, 'DOWNLOAD_DELAY': 1.0, 'DOWNLOAD_TIMEOUT': 10, 'LOG_FILE': 'C:\Users\myh\Desktop\PatentCrawler-master\output\20180530_153315\PatentCrawler.log', 'NEWSPIDER_MODULE': 'crawler.spiders', 'RETRY_TIMES': 3, 'SPIDER_MODULES': ['crawler.spiders']}
2018-05-30 15:33:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-05-30 15:33:16 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'crawler.middlewares.PatentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-05-30 15:33:16 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-05-30 15:33:16 [scrapy.middleware] INFO: Enabled item pipelines:
['crawler.pipelines.CrawlerPipeline']
2018-05-30 15:33:16 [scrapy.core.engine] INFO: Spider opened
2018-05-30 15:33:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-05-30 15:33:16 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-05-30 15:33:17 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:17 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/pageIsUesd-pageUsed.shtml HTTP/1.1" 200 None
2018-05-30 15:33:17 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:17 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/patentsearch/tableSearch-showTableSearchIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:18 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:18 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/login-showPic.shtml HTTP/1.1" 200 None
2018-05-30 15:33:18 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:18 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/wee/platform/wee_security_check HTTP/1.1" 302 None
2018-05-30 15:33:18 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uilogin-loginSuccess.shtml?params=991CFE73D4DF553253D44E119219BF31366856FF4B15222669397E093A956A2C&j_loginsuccess_url= HTTP/1.1" 302 None
2018-05-30 15:33:18 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uiIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/showViewList-jumpToView.shtml HTTP/1.1" 200 None
2018-05-30 15:33:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml> (failed 1 times): unlogin
2018-05-30 15:33:19 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml>
Cookie: JSESSIONID=x1Sv9YxmnHdXesCJk04Y3SMqTX3yBIpnhcwf0uKlEOg9TlE-gYYY!309799008!187544033; IS_LOGIN=true; WEE_SID=x1Sv9YxmnHdXesCJk04Y3SMqTX3yBIpnhcwf0uKlEOg9TlE-gYYY!309799008!187544033!1527665495142
2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/pageIsUesd-pageUsed.shtml HTTP/1.1" 200 None
2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/patentsearch/tableSearch-showTableSearchIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/login-showPic.shtml HTTP/1.1" 200 None
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/wee/platform/wee_security_check HTTP/1.1" 302 None
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uilogin-loginSuccess.shtml?params=991CFE73D4DF553253D44E119219BF31366856FF4B15222669397E093A956A2C&j_loginsuccess_url= HTTP/1.1" 302 None
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uiIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/showViewList-jumpToView.shtml HTTP/1.1" 200 None
2018-05-30 15:33:20 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml> (failed 2 times): unlogin
2018-05-30 15:33:20 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml>
Cookie: JSESSIONID=enOv9ZPDdp7oeLhqlYjU_gHhiJA63dF52InwKDPUfwSJwT4OC0x4!309799008!187544033; IS_LOGIN=true; WEE_SID=enOv9ZPDdp7oeLhqlYjU_gHhiJA63dF52InwKDPUfwSJwT4OC0x4!309799008!187544033!1527665497027
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:21 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/pageIsUesd-pageUsed.shtml HTTP/1.1" 200 None
2018-05-30 15:33:21 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:21 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/patentsearch/tableSearch-showTableSearchIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:21 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:21 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/login-showPic.shtml HTTP/1.1" 200 None
2018-05-30 15:33:21 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/wee/platform/wee_security_check HTTP/1.1" 302 None
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uilogin-loginSuccess.shtml?params=991CFE73D4DF553253D44E119219BF31366856FF4B15222669397E093A956A2C&j_loginsuccess_url= HTTP/1.1" 302 None
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uiIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/showViewList-jumpToView.shtml HTTP/1.1" 200 None
2018-05-30 15:33:22 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml> (failed 3 times): unlogin
2018-05-30 15:33:22 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml>
Cookie: JSESSIONID=fdyv9Zmxa7oMcWvdvBHwiuh8nvKhmeaYnZ03iat0rUfX2SfDs-5E!309799008!187544033; IS_LOGIN=true; WEE_SID=fdyv9Zmxa7oMcWvdvBHwiuh8nvKhmeaYnZ03iat0rUfX2SfDs-5E!309799008!187544033!1527665498545
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/pageIsUesd-pageUsed.shtml HTTP/1.1" 200 None
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:23 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/patentsearch/tableSearch-showTableSearchIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:23 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:23 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/login-showPic.shtml HTTP/1.1" 200 None
2018-05-30 15:33:23 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:23 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/wee/platform/wee_security_check HTTP/1.1" 302 None
2018-05-30 15:33:23 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uilogin-loginSuccess.shtml?params=991CFE73D4DF553253D44E119219BF31366856FF4B15222669397E093A956A2C&j_loginsuccess_url= HTTP/1.1" 302 None
2018-05-30 15:33:24 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uiIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:24 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:24 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/showViewList-jumpToView.shtml HTTP/1.1" 200 None
2018-05-30 15:33:24 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml> (failed 4 times): unlogin
2018-05-30 15:33:24 [scrapy.core.scraper] ERROR: Error downloading <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml>
Traceback (most recent call last):
File "D:\Program Files (x86)\anaconda\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "D:\Program Files (x86)\anaconda\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "D:\Program Files (x86)\anaconda\lib\site-packages\twisted\internet\defer.py", line 1363, in returnValue
raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <404 http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\Program Files (x86)\anaconda\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "D:\Program Files (x86)\anaconda\lib\site-packages\scrapy\core\downloader\middleware.py", line 56, in process_response
(six.get_method_self(method).class.name, type(response))
AssertionError: Middleware PatentMiddleware.process_response must return Response or Request, got <class 'NoneType'>
2018-05-30 15:33:24 [scrapy.core.engine] INFO: Closing spider (finished)
2018-05-30 15:33:24 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4368,
'downloader/request_count': 4,
'downloader/request_method_count/POST': 4,
'downloader/response_bytes': 6301,
'downloader/response_count': 4,
'downloader/response_status_count/404': 4,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 5, 30, 7, 33, 24, 666286),
'log_count/DEBUG': 56,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'retry/count': 3,
'retry/max_reached': 1,
'retry/reason_count/unlogin': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2018, 5, 30, 7, 33, 16, 985230)}
2018-05-30 15:33:24 [scrapy.core.engine] INFO: Spider closed (finished)
您好,首先感谢你所分享的代码~
首先申明一下背景,本人使用的是Mac笔记本,在配置环境的时候,pypiwin32这个库怎么都安装不成功。于是在网上查询到pypiwin32是用于访问windows系统API的库。所以,这就是我安装失败的原因?
我有两个问题:一是,这个库在代码中的作用是什么;二是,苹果系统要不要安装这个库,如果要安装,有没有与之类似的库。
期待您的回复,祝好~
Put required libraries along with the their version infos into requirements, so that other developers can install those libraries using pip install -r requirements.txt
.
您好,首先感谢此项目,非常棒~
反馈一个问题,不知道是否是个例,账号密码填写正确后,无论检索什么内容,包括示例程序的“人工智能”均显示
共0 页
finished
是否是我的原因,需要我提供什么以供排查么?
感谢~
这些数据都是免费完整提供的,爬虫会对正常用户的体验造成很大影响,浪费公共资源
需要数据可以去http://patdata.sipo.gov.cn/ 直接下载
将patent.py第145行,改为如下即可:
sipocrawler['abstract'] = BeautifulSoup(detail.get('abstractInfoDTO').get('abIndexList')[0].get('value')).text.replace('\n','')
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.