Coder Social home page Coder Social logo

taobao's Introduction

淘宝商品信息的定向爬取 enter image description here

Author 😎Henryhaohao😎
Email ♥️[email protected]♥️

🐬声明

2018-10-12:

淘宝更新反爬机制--点击搜索后,会触发反爬检测,直接跳转到了登录页面;解决方案:模拟登录后再进行爬取, 但是这个模拟登录需要过个滑块验证码,直接用selenium过滑块发现完全不行;

参考知乎上这篇文章:https://www.zhihu.com/question/35538123;发现淘宝的UA验证原理大概是这样的,从页面加载完成后,就开始搜集用户在页面上的操作,包括鼠标的点击,移动轨迹,按下,松开,滑动块的拖动,同时还包括时间等信息,然后根据某种算法,生成一个字符串,拼接起来。最后在登录或者 验证用户名的请求中,作为名字叫ua的参数传递到服务器。当然,它收集的次数是有限的,达到一定限制后,会停止。难点在于,这个过程非常的复杂,相关的js被混淆,但凡有价值的数值(2000多个),全放在几个数组中,通过数组[下标]的形式访问,完全搞不懂什么逻辑什么意思,调试起来异常艰难。(:◎)≡

总结一下: 这个淘宝项目暂时GG了,以后慢慢研究吧~~如果大佬有更好的解决方案,欢迎打扰~:smile:

🐬介绍

该项目为淘宝网商品信息的定向爬虫

  • 项目介绍:通过淘宝搜索关键字爬取指定的商品信息
  • 爬取方式:通过Python的Selenium自动化测试库以及配合Phantomjs无头浏览器
  • 爬虫文件:运行Spiders目录下的spider.py
  • 配置文件:运行前修改Spiders目录下的config.py,其中的KEYWORD为你要搜索商品名称的关键字,以及mongodb相关配置
  • 补充:如果想要增加爬取的字段,可以自行在item中添加,目前包括商品名、城市、详情链接、封面、售价、销量、店铺名

🐬运行环境

Version: Python3

🐬安装依赖库

pip3 install -r requirements.txt

🐬运行截图

  • 运行

    enter image description here
  • 数据结构

    enter image description here

🐬总结

最后,如果你觉得这个项目不错或者对你有帮助,给个Star呗,也算是对我学习路上的一种鼓励!
哈哈哈,感谢大家!笔芯~
💘💘

taobao's People

Contributors

henryhaohao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

taobao's Issues

为何不通过H5淘宝爬取?

为什么不通过H5页面进行爬取?m.taobao.com它会出现验证码,但验证码是图片来着,而不是滑动模块。
另外可以接上HTTP代理啊。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.