Coder Social home page Coder Social logo

zhihu-crawler's People

Contributors

linao1996 avatar wycm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zhihu-crawler's Issues

关于关闭线程池问题

想请问关于线程池的关闭是在config文件配置downloadPageCount属性进行判断结束条件为依据的嘛?如果是这样好像线程池一直都没达到关闭的情形。希望可以解答。最后十分感谢大神的代码。

登陆成功,紧接着出现的提示不知怎么回事

谢谢大神的代码,我刚学java。
跑的时候出现这个问题
Exception in thread "pool-1-thread-1" java.lang.NullPointerException
at com.crawl.parser.zhihu.ZhiHuUserIndexDetailPageParser.parseUserdetail(ZhiHuUserIndexDetailPageParser.java:44)
at com.crawl.parser.zhihu.ZhiHuUserIndexDetailPageParser.parse(ZhiHuUserIndexDetailPageParser.java:30)
at com.crawl.zhihu.task.ParseTask.parse(ParseTask.java:56)
at com.crawl.zhihu.task.ParseTask.run(ParseTask.java:39)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
请问怎么解决

嗨,你好

刚才好像回错地方了,还不太会用这个!为啥我每次只爬到20多个就停掉了呢?

com.crawl.zhihu.task.ParseTask.handleUrl方法 防止死锁

首先,感谢大神的code和doc,已正常从知乎抓取了数据,但有些地方,小弟不是很明白,想请教一下。
private void handleUrl(String url){ if(!Config.dbEnable){ zhiHuHttpClient.getDownloadThreadExecutor().execute(new DownloadTask(url)); return ; } String md5Url = Md5Util.Convert2Md5(url); boolean isRepeat = ZhiHuDAO.insertHref(md5Url); # _**if(!isRepeat || (!zhiHuHttpClient.getDownloadThreadExecutor().isShutdown() && zhiHuHttpClient.getDownloadThreadExecutor().getQueue().size() < 30)){ /** * 防止互相等待,导致死锁 */ zhiHuHttpClient.getDownloadThreadExecutor().execute(new DownloadTask(url));** }_ }

这个方法中的防止死锁,为什么这样可以防止死锁?或者说为什么会有产生死锁的可能?
不是很懂(java concurrent包中的很多类都不是很懂),

请求商务推广合作

作者您好,我们也是一家专业做IP代理的服务商,极速HTTP,我们注册认证会送10000IP(可以帮助您的学者适当薅羊毛试用 :) 。想跟您谈谈是否能够达成商业推广上的合作。如果您,有意愿的话,可以联系我,微信:13982004324 谢谢(如果没有意愿的话,抱歉,打扰了)

比较多的类报找不到对应的方法

有比较多的类提示找不到对应的方法,其中这一处
com.github.wycm.zhihu.service.receiver.ZhihuUserTaskReceiver
@Override protected Runnable createNewTask(CrawlerMessage crawlerMessage) { ZhihuUserTask task = new ZhihuUserTask(crawlerMessage, zhihuComponent); task.setUrl(crawlerMessage.getUrl()); task.setCurrentRetryTimes(crawlerMessage.getCurrentRetryTimes()); task.setProxyFlag(true); return task; }
上面这个方法里,task对象里面根本没有setUrl方法,setCurrentRetryTimes 以及setProxyFlag都没有,不知道出了什么问题

关于代理的问题

大神,我实际跑了你的代码,现在项目文档中说的代理我的理解是不是指,用www.xicidaili.com提供的代理服务器来访问知乎,避免知乎的防抓取?
PS.试过把main中的ProxyHttpClient.getInstance().startCrawl();注释掉,结果就是大量的429,Too Many Requests (太多请求)

你好,我想问下,Main类下“ ZhiHuHttpClient.getInstance().startCrawl();” 这段是怎么循环的?

@OverRide
public void startCrawl() {
authorization = initAuthorization();

    String startToken = Config.startUserToken;
    String startUrl = String.format(Constants.USER_FOLLOWEES_URL, startToken, 0);
    HttpGet request = new HttpGet(startUrl);
    request.setHeader("authorization", "oauth " + ZhiHuHttpClient.getAuthorization());
    detailListPageThreadPool.execute(new DetailListPageTask(request, Config.isProxy));
    manageHttpClient();
}

这段是怎么保证他一直在抓取数据呢,怎么维持这个抓取逻辑的循环呢?

初始化authoriztion失败

Exception in thread "main" java.lang.RuntimeException: not get authorization
at com.crawl.zhihu.ZhiHuHttpClient.initAuthorization(ZhiHuHttpClient.java:168)
at com.crawl.zhihu.ZhiHuHttpClient.getAuthorization(ZhiHuHttpClient.java:173)
at com.crawl.zhihu.ZhiHuHttpClient.startCrawl(ZhiHuHttpClient.java:114)
at com.crawl.Main.main(Main.java:15)

java.lang.NullPointerException

java.lang.ExceptionInInitializerError
Caused by: java.lang.NullPointerException
at java.util.Properties$LineReader.readLine(Properties.java:434)
at java.util.Properties.load0(Properties.java:353)
at java.util.Properties.load(Properties.java:341)
at com.crawl.core.util.SimpleLogger.setLogProperty(SimpleLogger.java:18)
at com.crawl.core.util.SimpleLogger.getSimpleLogger(SimpleLogger.java:38)
at com.crawl.Main.(Main.java:13)
Exception in thread "main"

连接数据库异常

java.sql.SQLException: Illegal connection port value 'mysql:'

我修改配置为:
db.enable = true

数据库配置

db.host = jdbc:mysql://localhost:3306/zhihu
db.username = root
db.password = 123456

数据库名

db.name = zhihu

你好,是不是可以加个饱和策略

你好,看你代码DetailListPageTask中有一段:

if (zhiHuHttpClient.getDetailListPageThreadPool().getQueue().size() > 1000){ continue; }
我觉得当线程并发量上去后,这段代码还是很消耗资源的。
可以给线程池添加一个Descar饱和策略,最大线程以及阻塞队列都满了后继续添加任务直接抛弃掉。和你的这段代码效果一样,但是有线程池来抛弃效率肯定会更好。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.