Coder Social home page Coder Social logo

newscrawler's Introduction

爬虫总结:

今日头条

初始url比较容易拿到,根据每一个url采集数据时遇到一些问题,首先今日头条的详情界面是需要js加载完才能拿到比较完整的界面,这个使用WebMagic无法拿到有效的界面,具体来说没法拿到内容。所以采用了模拟浏览器的方式+多线程去采集具体的数据,采集速度上有了限制。另外头条数据本身较少。而且其中一些界面不规则,所以最后拿到的数据较少。

百度新闻

在拿到初始的url的时候需要注意,关键字如果是中文需要经过加密处理才能得到正常的结果,虽然百度提示有很多新闻,但实际上只提供较少的一部分。另外,在获取百度新闻的具体内容时,由于界面不规则,难以拿到准确的数据,所以采用了拿取p标签的内容作为body与实际内容基本一致。但会有一些干扰字段,去除起来比较麻烦,目前还没有解决。

爬取完后对简单对数据进行了清洗,去除了重复的数据,标题为空的数据

newscrawler's People

Contributors

wzes avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.