Coder Social home page Coder Social logo

lsn001118 / python-internetwormdatavisualization Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ra1nv/python-internetwormdatavisualization

0.0 0.0 0.0 68.21 MB

应用Python爬虫、Flask框架、Echarts、WordCloud等技术实现

Python 71.28% CSS 0.10% JavaScript 7.81% HTML 0.91% Shell 0.01% OpenEdge ABL 18.85% C 0.63% Jupyter Notebook 0.04% C++ 0.34% Fortran 0.02%

python-internetwormdatavisualization's Introduction

网络爬虫以及数据可视化项目实践

©软件著作权归作者所有。本项目所有数据均来源于网络,仅供学习使用!

standard-readme compliant

目录

摘要

本项目运用Python网络爬虫,Flask框架,Echarts组件以及sqlite3等技术实现对51job招聘网上的广州地区Python相关职业招聘信息的爬取。
目的是为了学习网络爬虫与数据可视化分析相关知识以及初步了解Python相关专业就业形势。使用的软件是Pycharm教育版。
课程参考:Python爬虫以及数据可视化
涉及概念:网络爬虫数据可视化

准备工作

引入模块

本项目网络爬虫阶段主要使用BeautifulSoup、re、urllib、sqlite3四个库

页面分析

51job搜索Python并且地区选择广州后进入搜索结果界面,网址格式为 https://search.51job.com/list/030200,000000,0000,00,9,99,Python,2,{x}.html
其中{x}代表当前页数(从1开始)。

获取数据

使用urllib库获取页面,在函数里面添加响应头,模拟浏览器访问。

def askUrl(url):
    head = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
    }
    request = urllib.request.Request(url, headers=head)
    html = ""
    try:
        response = urllib.request.urlopen(request)
        print(url) # 定位不能解析的网址
        html = response.read().decode("gbk")
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
    return html

解析内容

获取详情页网址

在搜索结果页面,我们需要的每个详情页超链接都在一个<div>的标签中,该标签的的属性为class="dw_table"。使用css选择器定位标签,对字符串正则提取,获取每个职位具体信息的超链接存于列表中。
image

def getUrl(baseurl):
    urllist = [] # 保存每个职位对应的详情链接
    for i in range(1,75): # 共74页
        url = baseurl + str(i) + ".html"
        html = askUrl(url)
        soup = BeautifulSoup(html,"html.parser")
        for item in soup.find_all("div", class_="dw_table"):
            item = str(item)
            link = re.findall(findUrl, item)
            urllist.extend(link)
    return urllist

获取详情页具体信息

在详情页面,获取职位名称,薪资,公司名称,职能类别,关键字,上班地址,公司信息作为一个列表再依次存入一个列表中.
image

def getData(urlist):
    datalist = []
    for html in urlist:
        html = askUrl(html)
        soup = BeautifulSoup(html, "html.parser")
        for item in soup.find_all("div", class_="tCompany_center clearfix"):
            datas = []
            item = str(item)

            data = re.findall(findName, item)[0]
            datas.append(data)

            data = re.findall(findCompany, item)[0]
            datas.append(data)

            data = re.findall(findRequirement, item)
            data = data[0].replace('\xa0', '').replace('|', '')
            datas.append(data)

            data = re.findall(findClass, item)
            if len(data) > 1:
                data = data[0] + ' ' + data[1]
            else:
                pass
            datas.append(data)

            data = re.findall(findKeyword, item)
            if len(data) > 1:
                for i in range(1, len(data)):
                    data[0] = data[0] + ' ' + data[i]
            if(data):
                datas.append(data[0])
            else:
                datas.append(' ')

            data = re.findall(findAddress, item)
            if (data):
                datas.append(data[0])
            else:
                datas.append(' ')

            data = re.findall(findSalary, item)
            if (data):
                datas.append(data[0])
            else:
                datas.append(' ')

            data = re.findall(findInfo, item)
            data = data[0].replace('<br/>', '').replace('\xa0', '').replace('"', "'")
            if (data):
                datas.append(data)
            else:
                datas.append(' ')

            datalist.append(datas)
    return datalist

正则表达式

findUrl = re.compile(r'<a href="(.*?)" onmousedown="" target="_blank"') #详情链接正则表达式
findName = re.compile(r'<h1 title="(.*?)">') # 职位名
findCompany = re.compile(r'>(.*?)<em class="icon_b i_link"></em>') # 公司名
findRequirement = re.compile(r'<p class="msg ltype" title="(.*?)"') # 要求
findClass = re.compile(r'/">(.*?)</a>') # 职位类别
findKeyword = re.compile(r'=">(.*?)</a><') # 关键字
findAddress = re.compile(r'上班地址:</span>(.*?)</p>') # 地址
findSalary = re.compile(r'<strong>(.*?)</strong>') # 工资
findInfo = re.compile(r'<div class="tmsg inbox">(.*?)</div>') # 公司信息

保存数据

采用sqlite3数据库保存数据。 创建表

def init_db(dbpath):
    sql = '''
        CREATE TABLE IF NOT EXISTS job(
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        jname TEXT,
        cname TEXT,
        req TEXT,
        jclass TEXT,
        keywords TEXT,
        address TEXT,
        salary TEXT,
        cinfo TEXT);
    '''
    conn = sqlite3.connect(dbpath)
    cursor = conn.cursor()
    cursor.execute(sql)
    conn.commit()
    conn.close()

插入数据

def saveData(datalist,dbpath):
    conn = sqlite3.connect(dbpath)
    cur = conn.cursor()
    for data in datalist:
        for index in range(len(data)):
            data[index] = '"' + data[index] + '"'
        sql = '''
                INSERT INTO job(jname,cname,req,jclass,keywords,address,salary,cinfo)
                VALUES(%s)
            ''' % ",".join(data)
        cur.execute(sql)
        conn.commit()
    cur.close()
    conn.close()

项目预览

image image image image image

总结

通过本次项目实践,对Python语言以及项目中使用的库有了更深刻的了解,掌握了网络爬虫的基础知识,也懂得了如何对获取的数据进一步地处理将其转化为可视的图表,对以后的机器学习有所帮助。

python-internetwormdatavisualization's People

Contributors

ra1nv avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.