Comments (7)
安装和依赖
安装 BeautifulSoup 本尊
pip install bs4
解析器
可用解析器
- html.parser Python 内置, 速度适中, 容错能力强(Python>3.2.2)
- lxml 速度快, 容错强. 单独安装pip install
- lxml-xml, xml 速度快, 支持 xml 解析
- html5lib 高兼容性, 生成 html5 格式, 速度慢
pip install lxml html5lib
from blog.
BeautifulSoup 内置对象
- Tag
- NavigableString
- BeautifulSoup
- Comment
Tag: HTML Element
可以获取的 tag 属性:
- .name
- .attrs
- [attrName]
from bs4 import BeautifulSoup
soup = BeautifulSoup('<h1 class="boldest">Eyes on Me</h1><h2><!--<h3> Faye Won</h3>--></h2>')
tag = soup.h1
type(tag) ## <class 'bs4.element.Tag'>
# tagName
tag.name ## h1
tag.attrs ## { u'class': [u'boldest'] }
tag['class'] ## [u'boldest']
NavigableString: NavigableString 类用来包装tag中的字符串
特点
- NavigableString字串与 Unicode 字串相同, 使用 函数str() 可直接将它转换成 Unicode 字串
- NavigableString 字串不能编辑, 但可以使用方法 .replace_with() 整体替换
tag.string ## u'Eyes on Me'
type(tag.string) ## <class 'bs4.element.NavigableString'>
## 1
ustring = str(tag.string) ## unicode(tag.string)
type(ustring) ## <type 'unicode'>
## 2
tag.string.replace_with('alphabet') ## Eyes on Me
tag ## <h1>alphabet</h1>
BeautifulSoup
BS 对象表示一个文档的全部内容. 可把他当做 Tag 对象, 且支持遍历和搜索的大部分方法
但它并不是真正的 HTML 或 xml 的 tag, 所以 attrs 为 {} , name值为 u'[document]' 的 .name 属性
soup.attrs ## {}
soup.name ## u'[document]'
Comment
特殊的字串类型
h2 = soup.h2
cmt = h2.string
cmt ## '<h3> Faye Won</h3>'
type(cmt) ## <class 'bs4.element.Comment'>
str(cmt) ## '<h3> Faye Won</h3>'
from blog.
遍历 DOM Tree
子节点
- tagname
- .contents & .children
- .descendants
- .string
- .strings & .stripped_strings
父节点
- .parent
- .parents
兄弟节点
- .next_sibling & .next_siblings
- .previous_sibling & .previous_siblings
前进 & 回退
- .next_element & .next_elements
- .previous_element & .previous_elements
样例代码
soup.head ## <head><title>The Dormouse's story</title></head>
soup.title ## <title>The Dormouse's story</title>
soup.body.b ## <b>The Dormouse's story</b>
soup.a ## <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
from blog.
搜索
CSS Selectors
过滤器
- 字符串
- RegExp
- list
- True
- filter function
find_all()
- name 参数
- keyword 参数
- 按 CSS 搜索
- string 参数
- limit 参数
- recursive 参数
find()
其他
- find_parents()
- find_parent()
- find_next_siblings() & find_next_sibling()
- find_previous_siblings() & find_previous_sibling()
- find_all_next() & find_next()
- find_all_previous() & find_previous()
from blog.
修改
- 修改 tag : 名称 & 属性
- 修改 .string
- append()
- NavigableString() & .new_tag()
- insert()
- insert_before() & insert_after()
- clear()
- extract()
- decompose()
- replace_with()
- wrap()
- unwrap()
from blog.
输出
- 编码: soup.prettify('utf-8')
- 格式化输出
- 压缩输出
- 输出格式
- get_text()
from blog.
实例代码
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
from blog.
Related Issues (20)
- HTML: JS: meta-viewport HOT 1
- JS: snake case & camel case HOT 1
- Docker: WordPress, Nginx HOT 2
- Nuxt: [BABEL] Note: The code generator has deoptimised the styling of XXX.js as it exceeds the max of 500KB. HOT 1
- Nginx: Tips HOT 2
- Shell: Tips HOT 1
- Nacos: Tips HOT 2
- Xnix: LVM: Tips
- OS: Ubuntu: apt HOT 1
- OS: Ubuntu: xclip HOT 3
- OS: Ubuntu: user HOT 1
- OS:Ubuntu: Nvidia GPU HOT 5
- OS: Ubuntu: CUDA
- OS:Ubuntu:zshconfig
- Web: blogs
- Network: tools: curl
- PL: node.js: egg.js HOT 2
- FFMPEG: ProRes HOT 1
- Networking: 阿里云: 云解析DNS HOT 3
- Shell: scripts HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from blog.