Coder Social home page Coder Social logo

aurora303 / scraping Goto Github PK

View Code? Open in Web Editor NEW

This project forked from fredfeng0326/scraping

0.0 1.0 0.0 115.9 MB

京东,淘宝,苏宁,亚马逊爬虫抓取商品信息并分析数据

Python 89.21% Shell 0.01% C 1.80% HTML 2.08% JavaScript 5.20% CSS 0.63% TeX 0.01% C++ 0.59% Objective-C 0.07% XSLT 0.38% Fortran 0.02% Smarty 0.01%

scraping's Introduction

Scraping

电子商务爬虫分析(test阶段)

1.简单说明

京东,淘宝,苏宁,亚马逊** 抓取数据,存储到database 并进行分析

2.抓取的DIC

 the_basic_info = {
                    'search_keyword': self.keyword,  "使用的keyword"
                    'last_crawling_timestamp': datetime.now(),    "当前抓取时间"
                    'platform': 'JD',   "抓取平台"
                    'product_name': product_name,    "产品名称"
                    'seller_name': seller_name,   "商家名称"
                    'sku_id': _data_pid,    "产品Id"
                    'default_price': float(final_price),   "最终价格"
                    'final_price': 0,
                    'item_url': _http,  "商品网页地址"
                    'comments_ave_score': float(score_avg),    "商品评分"
                    'comments_count': comment_count,    "商品评论数量"
                    'images': img,    "商品图片地址"
                    'current_stock': location_list,   "商品存储地址"
                    'search_rank': rank,    "在当前搜索索引下的排名"
                    'search_order': order,   "当前索引(按销量,价格,热度等)"
                    'seller_url': seller_url,   "商家网页地址"
                    'comments_list': comment_list    "具体评论,支持抓取100条评论"
                }

一个例子:

Product_name 戴尔灵越游匣15PR-6748B 15.6英寸游戏笔记本电脑(i7-7700HQ 8G 128GSSD+1T GTX1050 4G独显 IPS)黑
last_crawling_timestamp 2017-12-28 20:20:09.684290
seller_name 戴尔京东自营旗舰店
sku_id 4824733
default_price 6599.0
item_url http://item.jd.com/4824733.html
comments_count 72000
comments_ave_score 5.0
images ['http://img13.360buyimg.com/n7/jfs/t12472/179/736139380/319777/f266f597/5a128bf6N079a87ba.jpg']
search_rank 1
seller_url http://mall.jd.com/index-1000000140.html
comments_list [{'content_score': 5, 'content_time': '2017-12-05 18:54:31', 'content_title': None, 'content': '用了将近一个月了,说说体验如何。11月9号凌晨买的,当天下午就到了。包装精简,京东袋子里就是戴尔的盒子。电脑颜值高,A面类肤质,后面散热口非常帅。电脑不轻薄,因为做工的好的原因有点厚重,不过这样才有点游戏本的意思。宿舍里还有台暗影精灵2pro和R720,相比2pro键盘敲打起来挺有弹性,但是背光没有其他两台亮。个人感觉键盘触感最好的还是R720,而且按键大一些。说说R720和2PRO跟游匣无法比拟的,那就是低音炮,音质非常好,三个室友都夸赞羡慕游匣的音质。所以我的电脑也成了我们宿舍的音响。。。屏幕呢是ips45色域的。对于以前一直用的是TN屏的我感觉这电脑屏幕相当好了。再说说性能,其实性能是最不用说的,配置都摆在那里,鲁大师跑分将近一万八,1050ti能够应付大多数大型单机游戏了,吃鸡中画质可以流畅运行。运行大型游戏时风扇会全力运作,声音稍微有点响(散热好和噪音小不可兼得),我更注重散热所以风扇声大点无所谓,听着还挺带劲的。固态(不是nvme协议)和机械硬盘都比较差,开机十秒左右。总结下吧。优点:1.颜值高2.散热好3.做工精良4.配置低音炮缺点:1.低端ips屏2.略厚重3.硬盘差'}]

3.测试?

if __name__ == "__main__":
    j = JDMonitoringEngine()
    j.set_searching_url(_keyword="dell", _page_limit=1, _order=["sales"])
    url_list = j.url_list
    for _index, url_dict in enumerate(url_list):
        logger.info("Sending {0}/{1} url dict to basic info extraction".format(
            (_index + 1), len(url_list)))
        results = list(map(lambda x: j.get_basic_info(x), url_dict))

将jd_monitoring_engine main 方法里面的_keyword,_page_limit,_order
改成你想测试的例子。三个参数分别是关键字,搜索页数和搜索索引

scraping's People

Contributors

fredfeng0326 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.