Coder Social home page Coder Social logo

covid19-news-crawl's Introduction

疫情信息收集项目

此爬虫爬取不同地区政务网站发布的新冠疫情历史发布会,用于数据分析,用到的技术栈有 scrapy、selenium、mongodb

需要下载最新环境chromedriver

sudo mv ~/Downloads/chromedriver /usr/bin

vi ~/.bash_profile

export PATH=$PATH:/usr/local/bin/ChromeDriver

下载mongodb

进入 /usr/local

cd /usr/local

下载

sudo curl -O https://fastdl.mongodb.org/osx/mongodb-osx-ssl-x86_64-4.0.9.tgz

解压

sudo tar -zxvf mongodb-osx-ssl-x86_64-4.0.9.tgz

重命名为 mongodb 目录

sudo mv mongodb-osx-x86_64-4.0.9/ mongodb

安装完成更新bash_profile

export PATH=/usr/local/mongodb/bin:$PATH

数据存放路径:

sudo mkdir -p /usr/local/var/mongodb

日志文件路径:

sudo mkdir -p /usr/local/var/log/mongodb

确保权限

sudo chown 账户名 /usr/local/var/mongodb
sudo chown 账户名 /usr/local/var/log/mongodb

后台启动mongodb服务 启动之前记得更新配置 source ~/.bash_profile

mongod --dbpath /usr/local/var/mongodb --logpath /usr/local/var/log/mongodb/mongo.log --fork

安装 python包

pip install selenium
pip install scrapy
pip install xlwt
pip install pymongo

大佬做的匹配文本的项目,可以保证无论数据量多大处理的时间都是不变的,本项目用于做mongo数据清洗 对他的实现感兴趣可以看他论文

pip install flashtext

项目根目录下创建logs来存放日志文件

mkdir logs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.