songgeb / bdindexspider Goto Github PK
View Code? Open in Web Editor NEW百度指数爬取工具,基于webdriver。开放源码提供一个抓百度指数的思路
百度指数爬取工具,基于webdriver。开放源码提供一个抓百度指数的思路
我根本没要他爬取nba啊?
a.txt 如下
华为#2017/11/5#2018/11/4
Oppo#2017/11/5#2018/11/4
Vivo#2017/11/5#2018/11/4
苹果#2017/11/5#2018/11/4
小米#2017/11/5#2018/11/4
魅族#2017/11/5#2018/11/4
荣耀#2017/11/5#2018/11/4
联想#2017/11/5#2018/11/4
一加#2017/11/5#2018/11/4
目前只是限制一段时间,不会永久限制
准备这两天改一下
无论是自动下载后还是用工具都无法识别图片,无法输出图片代表的数字
历史遗留问题,工具最早的设计是只允许抓取当前月之前的数据,该问题,后面会优化
在tesseract安装失败,ocr识别失败时才会出现该问题。
完整的bug log:
2018-11-16 21:33:05 [ AWT-EventQueue-0:0 ] - [ INFO ] 载入策略类: com.bdindex.model.TXTReader
2018-11-16 21:33:05 [ AWT-EventQueue-0:0 ] - [ INFO ] 载入策略类: com.bdindex.model.CSVReader
2018-11-16 21:33:05 [ AWT-EventQueue-0:10 ] - [ INFO ] com.bdindex.model.TXTReader
2018-11-16 21:33:06 [ SwingWorker-pool-2-thread-1:1150 ] - [ INFO ] DriverFilePath: drivers/chromedriver.exe
2018-11-16 21:33:06 [ SwingWorker-pool-2-thread-1:1150 ] - [ INFO ] Driver folder exist
2018-11-16 21:34:04 [ SwingWorker-pool-2-thread-1:58809 ] - [ ERROR ] 初始化失败
org.openqa.selenium.TimeoutException: Expected condition failed: waiting for visibility of element located by By.id: TANGRAM__PSP_4__userName (tried for 3 second(s) with 500 milliseconds interval)
at org.openqa.selenium.support.ui.WebDriverWait.timeoutException(WebDriverWait.java:113)
at org.openqa.selenium.support.ui.FluentWait.until(FluentWait.java:283)
at com.selenium.Wait.waitForElementVisible(Wait.java:82)
at com.selenium.BDIndexAction.login(BDIndexAction.java:126)
at com.bdindex.core.BDIndexCoreWorker.init(BDIndexCoreWorker.java:73)
at com.bdindex.core.BDIndexCoreWorker.start(BDIndexCoreWorker.java:149)
at com.bdindex.core.BDIndexCoreWorker.doInBackground(BDIndexCoreWorker.java:254)
at com.bdindex.core.BDIndexCoreWorker.doInBackground(BDIndexCoreWorker.java:31)
at javax.swing.SwingWorker$1.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at javax.swing.SwingWorker.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.openqa.selenium.NoSuchElementException: Cannot locate an element using By.id: TANGRAM__PSP_4__userName
For documentation on this error, please visit: http://seleniumhq.org/exceptions/no_such_element.html
Build info: version: 'unknown', revision: 'unknown', time: 'unknown'
System info: host: 'joekjzhou-PC1', ip: '10.17.93.34', os.name: 'Windows 10', os.arch: 'x86', os.version: '10.0', java.version: '1.8.0_171'
Driver info: driver.version: unknown
at org.openqa.selenium.support.ui.ExpectedConditions.lambda$findElement$0(ExpectedConditions.java:896)
at java.util.Optional.orElseThrow(Unknown Source)
at org.openqa.selenium.support.ui.ExpectedConditions.findElement(ExpectedConditions.java:895)
at org.openqa.selenium.support.ui.ExpectedConditions.access$000(ExpectedConditions.java:44)
at org.openqa.selenium.support.ui.ExpectedConditions$7.apply(ExpectedConditions.java:206)
at org.openqa.selenium.support.ui.ExpectedConditions$7.apply(ExpectedConditions.java:202)
at org.openqa.selenium.support.ui.FluentWait.until(FluentWait.java:260)
... 12 more
您好:
最近想做城市之间的相互联系,想通过相互搜索指数来体现;
因此现在有个需求,按地域搜索关键词,(比如选择地域“天津”,搜索关键词“北京”)
烦请各位老师能不能把这个需求加上,给你们打call!
主要原因是,抓取到的图片,其中的字体和ocr算法中预定义的字体差别较大,导致识别时汉明距离较大,超过了阈值,识别失败
由于工具给予两次输入验证码的时间间隔太短,此时可能还未收到验证码,来不及输入验证码,页面就又刷新了。导致一直无法进入抓取页面。
真实的数据应该是在四位数的维度,但抓到的图片中数字可能是1位数或两位数,甚至有时是0。目前该问题原因不详,只有极少数的数据会出现该情况,不影响总体。
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.