Coder Social home page Coder Social logo

zhuyingda / webster Goto Github PK

View Code? Open in Web Editor NEW
505.0 33.0 56.0 120 KB

a reliable high-level web crawling & scraping framework for Node.js.

License: GNU General Public License v3.0

JavaScript 97.95% HTML 2.05%
scraping-framework crawler crawling headless-chrome chromium spider automation-ui automation-test nodejs nodejs-framework

webster's People

Contributors

monkeywithacupcake avatar zhuyingda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

webster's Issues

Consumer.browserRequest方法单一功能抽离

browserRequest的功能定位是发起网络请求,因此建议是否可以把
1.deviceType判断部分
2.parseHtml解析部分
放到browserRequest方法外部,browserRequest只负责发起请求。

image

image

希望爬虫支持的特性

  • 设置
    • 针对反爬虫
      • user agent
      • 禁用 Cookie
      • 抓取页面的间隔
      • ip池。检测ip被封时,可动态改。
    • 爬取效率
      • 重复的url不会爬。
      • 分布式爬。
  • 抓取页面前
    • 支持回调。一般是做登录。
  • 获得页面的响应后
    • 探测页面编码。防止乱码。
    • 支持回调。
  • 提取页面数据
    • 支持用 CSS 选择器和 xpath 来解析 HTML。
    • 设置 Cookie。
    • 能模拟用户操作。
  • 能动态添加要爬取的页面。

你的redis账户泄露了!!

let myConsumer = new MyConsumer({
    channel: 'baidu',
    sleepTime: 5000,
    deviceType: 'pc',
    dbConf: {
        redis: {
            host: 'redis-15455.c80.us-east-1-2.ec2.cloud.redislabs.com',
            port: 15455,
            password: 'L7hfNRGniDYdSZxJpCmdDtafqEsDxpaN'
        }
    }
});
myConsumer.startConsume();

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.