Coder Social home page Coder Social logo

读记: Python: BeautifulSoup 4.2.0 about blog HOT 7 OPEN

xvno avatar xvno commented on August 16, 2024
读记: Python: BeautifulSoup 4.2.0

from blog.

Comments (7)

xvno avatar xvno commented on August 16, 2024

安装和依赖

安装 BeautifulSoup 本尊

pip install bs4

解析器

可用解析器

  • html.parser Python 内置, 速度适中, 容错能力强(Python>3.2.2)
  • lxml 速度快, 容错强. 单独安装pip install
  • lxml-xml, xml 速度快, 支持 xml 解析
  • html5lib 高兼容性, 生成 html5 格式, 速度慢
pip install lxml html5lib 

from blog.

xvno avatar xvno commented on August 16, 2024

BeautifulSoup 内置对象

  • Tag
  • NavigableString
  • BeautifulSoup
  • Comment

Tag: HTML Element

可以获取的 tag 属性:

  • .name
  • .attrs
  • [attrName]
from bs4 import BeautifulSoup
soup  = BeautifulSoup('<h1 class="boldest">Eyes on Me</h1><h2><!--<h3> Faye Won</h3>--></h2>')
tag = soup.h1
type(tag) ## <class 'bs4.element.Tag'>

# tagName
tag.name    ## h1
tag.attrs     ## { u'class': [u'boldest'] }
tag['class'] ## [u'boldest']

NavigableString: NavigableString 类用来包装tag中的字符串

特点

  1. NavigableString字串与 Unicode 字串相同, 使用 函数str() 可直接将它转换成 Unicode 字串
  2. NavigableString 字串不能编辑, 但可以使用方法 .replace_with() 整体替换
tag.string ## u'Eyes on Me'
type(tag.string) ## <class 'bs4.element.NavigableString'>

## 1
ustring = str(tag.string) ## unicode(tag.string)
type(ustring) ## <type 'unicode'>

## 2
tag.string.replace_with('alphabet') ## Eyes on Me
tag ## <h1>alphabet</h1>

BeautifulSoup

BS 对象表示一个文档的全部内容. 可把他当做 Tag 对象, 且支持遍历和搜索的大部分方法
但它并不是真正的 HTML 或 xml 的 tag, 所以 attrs 为 {} , name值为 u'[document]' 的 .name 属性

soup.attrs   ## {}
soup.name  ## u'[document]'

Comment

特殊的字串类型

h2 = soup.h2
cmt = h2.string
cmt              ## '<h3> Faye Won</h3>'
type(cmt)    ## <class 'bs4.element.Comment'>
str(cmt)       ## '<h3> Faye Won</h3>'

from blog.

xvno avatar xvno commented on August 16, 2024

遍历 DOM Tree

子节点

  • tagname
  • .contents & .children
  • .descendants
  • .string
  • .strings & .stripped_strings

父节点

  • .parent
  • .parents

兄弟节点

  • .next_sibling & .next_siblings
  • .previous_sibling & .previous_siblings

前进 & 回退

  • .next_element & .next_elements
  • .previous_element & .previous_elements

样例代码

soup.head             ## <head><title>The Dormouse's story</title></head>
soup.title               ## <title>The Dormouse's story</title>
soup.body.b          ## <b>The Dormouse's story</b>
soup.a                   ## <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

from blog.

xvno avatar xvno commented on August 16, 2024

搜索

CSS Selectors

过滤器

  • 字符串
  • RegExp
  • list
  • True
  • filter function

find_all()

  • name 参数
  • keyword 参数
  • 按 CSS 搜索
  • string 参数
  • limit 参数
  • recursive 参数

find()

其他

  • find_parents()
  • find_parent()
  • find_next_siblings() & find_next_sibling()
  • find_previous_siblings() & find_previous_sibling()
  • find_all_next() & find_next()
  • find_all_previous() & find_previous()

from blog.

xvno avatar xvno commented on August 16, 2024

修改

  • 修改 tag : 名称 & 属性
  • 修改 .string
  • append()
  • NavigableString() & .new_tag()
  • insert()
  • insert_before() & insert_after()
  • clear()
  • extract()
  • decompose()
  • replace_with()
  • wrap()
  • unwrap()

from blog.

xvno avatar xvno commented on August 16, 2024

输出

  • 编码: soup.prettify('utf-8')
  • 格式化输出
  • 压缩输出
  • 输出格式
  • get_text()

from blog.

xvno avatar xvno commented on August 16, 2024

实例代码

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

from blog.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.