Coder Social home page Coder Social logo

web-bee's People

Contributors

pkwenda avatar wangtonghe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

web-bee's Issues

完善httpClient封装

需要增加封装requestheader请求,并且可以让用户配置post,get,put,delete等请求方式

关于分页下一页

目前支持服务器回调的nextUrl
对于没有回调的需要完善处理进行自动获取
并且getNextUrl("key")处理的不好需要修改

httpClient封装请求

已完成httpClient封装get post put 请求,但是不够优雅,需要简化代码

异常处理

爬取过程中抛异常 eg:404 、 timeout、需要进行处理重新爬取3次机会再失败进行标记缓存java或持久化中间件中

封装elements

用户不需要自己处理elements元素,最好自己实现

爬取视频

爬取视频,并写一个慕课网视频爬取demo

分析知乎百万用户

初步打算使用echarts 图表 对知乎用户:

  • 回答数/点赞数/学历/留学经历:粉丝数 做出统计
  • 统计知乎粉丝top 100用户
  • 统计文章赞top100,评论top100
  • 统计知乎 xxx最xxx 的文章
  • 放到官网

完善webBee选择器

将webBee选择器进一步扩展和优化,基本达到jquery的目的,降低学习成本

异常处理

增加良好的处理异常措施,让用户一目了然,并防止程序宕机

java8

应该多使用java8语言优势

实现缓存

不依靠中间件实现java底层缓存api或选择器返回抓取数据,进行缓存,顺便进行学习

持续集成测试问题

需要解决 travis CI 持续集成 报1.6不支持的bug
场景 :抛出多个异常1.6不支持

定时器

对于一些反爬不错的网站 典型:知乎
应该特别注意模仿用户行为,应该增加定时器每天分三个时段进行智能爬取

缓存

爬取知乎百万数据为防止爬取重复用户以及生成关系策略需要进行缓存
java缓存百万数据不切实际额,需要开发webbee-redis插件

完善页面处理器

    //todo page.getJson/html/string().$('textarea.content').as('content').bulid().$('#img').as('img')
    //todo 期望结果: {content:[],img:[]} 一条{}多条[] 的json格式
    //todo page.nextUrl('span>ss>s')
    //todo 直接获取api接口

工厂模式

合理利用工厂模式,减少代码重复率

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.