wycm / zhihu-crawler Goto Github PK
View Code? Open in Web Editor NEWzhihu-crawler是一个基于Java的高性能、支持免费http代理池、支持横向扩展、分布式爬虫项目
License: Other
zhihu-crawler是一个基于Java的高性能、支持免费http代理池、支持横向扩展、分布式爬虫项目
License: Other
想请问关于线程池的关闭是在config文件配置downloadPageCount属性进行判断结束条件为依据的嘛?如果是这样好像线程池一直都没达到关闭的情形。希望可以解答。最后十分感谢大神的代码。
我想问一下,为什么我运行起来,显示可用代理为0呢,还有就是一直报400,500,是需要配置什么吗?希望楼主能够回答,万分感谢!
您好,运行后这个验证码是什么?
用你的代码试了试爬了不到500条就被封了,怎么处理账户被封呢?
刚才发现IP181的网址http://www.ip181.com/打开之后没有反应
proxies乱码是怎么回事呢,试了各种编码格式了
谢谢大神的代码,我刚学java。
跑的时候出现这个问题
Exception in thread "pool-1-thread-1" java.lang.NullPointerException
at com.crawl.parser.zhihu.ZhiHuUserIndexDetailPageParser.parseUserdetail(ZhiHuUserIndexDetailPageParser.java:44)
at com.crawl.parser.zhihu.ZhiHuUserIndexDetailPageParser.parse(ZhiHuUserIndexDetailPageParser.java:30)
at com.crawl.zhihu.task.ParseTask.parse(ParseTask.java:56)
at com.crawl.zhihu.task.ParseTask.run(ParseTask.java:39)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
请问怎么解决
刚才好像回错地方了,还不太会用这个!为啥我每次只爬到20多个就停掉了呢?
首先,感谢大神的code和doc,已正常从知乎抓取了数据,但有些地方,小弟不是很明白,想请教一下。
private void handleUrl(String url){ if(!Config.dbEnable){ zhiHuHttpClient.getDownloadThreadExecutor().execute(new DownloadTask(url)); return ; } String md5Url = Md5Util.Convert2Md5(url); boolean isRepeat = ZhiHuDAO.insertHref(md5Url); # _**if(!isRepeat || (!zhiHuHttpClient.getDownloadThreadExecutor().isShutdown() && zhiHuHttpClient.getDownloadThreadExecutor().getQueue().size() < 30)){ /** * 防止互相等待,导致死锁 */ zhiHuHttpClient.getDownloadThreadExecutor().execute(new DownloadTask(url));** }_ }
这个方法中的防止死锁,为什么这样可以防止死锁?或者说为什么会有产生死锁的可能?
不是很懂(java concurrent包中的很多类都不是很懂),
先提示序列化失败,然后好多400,404的响应码。
想问一下,JDK用的是什么版本,以及IDEA版本号,我的导入总是失败
作者您好,我们也是一家专业做IP代理的服务商,极速HTTP,我们注册认证会送10000IP(可以帮助您的学者适当薅羊毛试用 :) 。想跟您谈谈是否能够达成商业推广上的合作。如果您,有意愿的话,可以联系我,微信:13982004324 谢谢(如果没有意愿的话,抱歉,打扰了)
有比较多的类提示找不到对应的方法,其中这一处
com.github.wycm.zhihu.service.receiver.ZhihuUserTaskReceiver
@Override protected Runnable createNewTask(CrawlerMessage crawlerMessage) { ZhihuUserTask task = new ZhihuUserTask(crawlerMessage, zhihuComponent); task.setUrl(crawlerMessage.getUrl()); task.setCurrentRetryTimes(crawlerMessage.getCurrentRetryTimes()); task.setProxyFlag(true); return task; }
上面这个方法里,task对象里面根本没有setUrl方法,setCurrentRetryTimes 以及setProxyFlag都没有,不知道出了什么问题
大神,我实际跑了你的代码,现在项目文档中说的代理我的理解是不是指,用www.xicidaili.com提供的代理服务器来访问知乎,避免知乎的防抓取?
PS.试过把main中的ProxyHttpClient.getInstance().startCrawl();注释掉,结果就是大量的429,Too Many Requests (太多请求)
@OverRide
public void startCrawl() {
authorization = initAuthorization();
String startToken = Config.startUserToken;
String startUrl = String.format(Constants.USER_FOLLOWEES_URL, startToken, 0);
HttpGet request = new HttpGet(startUrl);
request.setHeader("authorization", "oauth " + ZhiHuHttpClient.getAuthorization());
detailListPageThreadPool.execute(new DetailListPageTask(request, Config.isProxy));
manageHttpClient();
}
这段是怎么保证他一直在抓取数据呢,怎么维持这个抓取逻辑的循环呢?
如题
Exception in thread "main" java.lang.RuntimeException: not get authorization
at com.crawl.zhihu.ZhiHuHttpClient.initAuthorization(ZhiHuHttpClient.java:168)
at com.crawl.zhihu.ZhiHuHttpClient.getAuthorization(ZhiHuHttpClient.java:173)
at com.crawl.zhihu.ZhiHuHttpClient.startCrawl(ZhiHuHttpClient.java:114)
at com.crawl.Main.main(Main.java:15)
爬出来的响应码全是400,403怎么办
java.lang.ExceptionInInitializerError
Caused by: java.lang.NullPointerException
at java.util.Properties$LineReader.readLine(Properties.java:434)
at java.util.Properties.load0(Properties.java:353)
at java.util.Properties.load(Properties.java:341)
at com.crawl.core.util.SimpleLogger.setLogProperty(SimpleLogger.java:18)
at com.crawl.core.util.SimpleLogger.getSimpleLogger(SimpleLogger.java:38)
at com.crawl.Main.(Main.java:13)
Exception in thread "main"
java.sql.SQLException: Illegal connection port value 'mysql:'
我修改配置为:
db.enable = true
db.host = jdbc:mysql://localhost:3306/zhihu
db.username = root
db.password = 123456
db.name = zhihu
Host name 'XXXXXXXXX' does not match the certificate subject provided by the peer
可参考
http://stackoverflow.com/questions/34655031/javax-net-ssl-sslpeerunverifiedexception-host-name-does-not-match-the-certifica
修改一下
然后就行了,
感谢up的代码.
请问可视化数据分析用得是什么插件呢
是不是哪个循环接口变更了
你好,看你代码DetailListPageTask中有一段:
if (zhiHuHttpClient.getDetailListPageThreadPool().getQueue().size() > 1000){ continue; }
我觉得当线程并发量上去后,这段代码还是很消耗资源的。
可以给线程池添加一个Descar饱和策略,最大线程以及阻塞队列都满了后继续添加任务直接抛弃掉。和你的这段代码效果一样,但是有线程池来抛弃效率肯定会更好。
你好:
JMS Exception-Could not connect to broker url
我启动之后就报这个错误,是不是ActiveMQ 需要配置下?
http://localhost:8161/admin/queues.jsp
这个url我也登不上
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.