Coder Social home page Coder Social logo

crawlerclientcas's Introduction

readme

注意

  1. 微信爬虫运行需要系统装有firefox浏览器

配置

config.properties:

- 数据源    
- 配置读取方式 0:from xml file, 1: from database  
- 运行方式: run/test  

需要登录的站点

将账号配置在crawler_account表中
site_id与site_template表中对应模板的id一致

垂直监控配置

将要监控的url配置在monitor_site中,注意site_name

爬虫启动

maven 打包运行方式

  1. 本地maven库安装第三方依赖jar
//project 目录下运行命令
mvn install:install-file -Dfile=lib/ojdbc6.jar -DgroupId=com.oracle -DartifactId=ojdbc6 -Dversion=11.2.0 -Dpackaging=jar
mvn install:install-file -Dfile=lib/chardet-1.0.jar -DgroupId=org.mozilla.intl -DartifactId=chardet -Dversion=1.0 -Dpackaging=jar
mvn install:install-file -Dfile=lib/crwlerlog.jar -DgroupId=crwlerlog -DartifactId=crwlerlog -Dversion=1.0 -Dpackaging=jar
mvn install:install-file -Dfile=lib/hadoop-core-1.0.4.jar -DgroupId=org.apache -DartifactId=hadoop -Dversion=1.0.4 -Dpackaging=jar
mvn install:install-file -Dfile=lib/hbase-0.94.16-security.jar -DgroupId=org.apache.hadoop -DartifactId=hbase -Dversion=0.94.16 -Dpackaging=jar
mvn install:install-file -Dfile=lib/mail.jar -DgroupId=com.sun.mail -DartifactId=mail -Dversion=1.0 -Dpackaging=jar
  1. 打包 mvn clean package -DskipTests
  2. 复制 config 目录到target/
cp -r config target/
  1. 运行 java -jar CrawlerClientCAS-1.0.jar type=15 name=test project=66666
  • 参数
type=n

1: 新闻搜索; 2: 新闻垂直; 3: 论坛搜索; 4: 论坛垂直; 5: 博客搜索; 6: 博客垂直; 7: 微博搜索; 8: 微博垂直; 9:视频搜索; 13: 电商搜索; 14: 电商垂直; 15: 微信搜索; 16: 微信垂直; 21: 上市公司报告搜索; 34: 公司信息垂直 ; 37: 政务搜索; 39: 刊物搜索; 41: 外媒搜索 ;45:客户端搜索

crawlercount=n   

需要分布式部署多少个爬虫

clientindex=n   

当前是几号爬虫,同时该字段也对应与微博爬虫的账号选择, 选择方式是,clientindex = int[(crawler_account.valid+1)/2]

组内爬虫程序

目录:

config/ 配置文件目录

_ ./site 站点采集属性配置

_ ./app-sysconfig.xml 爬虫结构属性配置

_ ./config.properties 爬虫运行属性配置

site/ 站点采集模板(新配置站点已经采用数据库存储)

src/

_ ./common

_ _ ./bean

_ _ ./communicate

_ _ ./down

_ _ ./extractor

_ _ ./filter

_ _ ./http

_ _ ./rmi

_ _ ./service

_ _ ./siteinfo

_ _ ./system

_ _ ./up2hdfs

_ _ ./util

_ _ CrawlerStart.java

注: 增加了代码的分布式运行方式

crawlerclientcas's People

Contributors

xiaodaguan avatar ziyue246 avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.