from bs4 import BeautifulSoup soup = BeautifulSoup('<p>Hello</p>', 'lxml') print(soup.p.string)
在后面,Beautiful Soup 的用法实例也统一用这个解析器来演示。
4. 基本用法
下面首先用实例来看看 Beautiful Soup 的基本用法:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
html = """ <html><head><title>The Dormouse's story</title></head> <body> <pclass="title"name="dromouse"><b>The Dormouse's story</b></p> <pclass="story">Once upon a time there were three little sisters; and their names were <ahref="http://example.com/elsie"class="sister"id="link1"><!-- Elsie --></a>, <ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a> and <ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <pclass="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.prettify()) print(soup.title.string)
<html> <head> <title> The Dormouse's story </title> </head> <body> <pclass="title"name="dromouse"> <b> The Dormouse's story </b> </p> <pclass="story"> Once upon a time there were three little sisters; and their names were <aclass="sister"href="http://example.com/elsie"id="link1"> <!-- Elsie --> </a> , <aclass="sister"href="http://example.com/lacie"id="link2"> Lacie </a> and <aclass="sister"href="http://example.com/tillie"id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <pclass="story"> ... </p> </body> </html> The Dormouse's story
这里首先声明变量 html,它是一个 HTML 字符串。但是需要注意的是,它并不是一个完整的 HTML 字符串,因为 body 和 html 节点都没有闭合。接着,我们将它当作第一个参数传给 BeautifulSoup 对象,该对象的第二个参数为解析器的类型(这里使用 lxml),此时就完成了 BeaufulSoup 对象的初始化。然后,将这个对象赋值给 soup 变量。
接下来,就可以调用 soup 的各个方法和属性解析这串 HTML 代码了。
首先,调用 prettify() 方法。这个方法可以把要解析的字符串以标准的缩进格式输出。这里需要注意的是,输出结果里面包含 body 和 html 节点,也就是说对于不标准的 HTML 字符串 BeautifulSoup,可以自动更正格式。这一步不是由 prettify() 方法做的,而是在初始化 BeautifulSoup 时就完成了。
然后调用 soup.title.string,这实际上是输出 HTML 中 title 节点的文本内容。所以,soup.title 可以选出 HTML 中的 title 节点,再调用 string 属性就可以得到里面的文本了,所以我们可以通过简单调用几个属性完成文本提取,这是不是非常方便?
html = """ <html><head><title>The Dormouse's story</title></head> <body> <pclass="title"name="dromouse"><b>The Dormouse's story</b></p> <pclass="story">Once upon a time there were three little sisters; and their names were <ahref="http://example.com/elsie"class="sister"id="link1"><!-- Elsie --></a>, <ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a> and <ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <pclass="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.title) print(type(soup.title)) print(soup.title.string) print(soup.head) print(soup.p)
运行结果如下:
1 2 3 4 5
<title>The Dormouse's story</title> <class 'bs4.element.Tag'> The Dormouse's story <head><title>The Dormouse's story</title></head> <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
这里依然选用刚才的 HTML 代码,首先打印输出 title 节点的选择结果,输出结果正是 title 节点加里面的文字内容。接下来,输出它的类型,是 bs4.element.Tag 类型,这是 Beautiful Soup 中一个重要的数据结构。经过选择器选择后,选择结果都是这种 Tag 类型。Tag 具有一些属性,比如 string 属性,调用该属性,可以得到节点的文本内容,所以接下来的输出结果正是节点的文本内容。
接下来,我们又尝试选择了 head 节点,结果也是节点加其内部的所有内容。最后,选择了 p 节点。不过这次情况比较特殊,我们发现结果是第一个 p 节点的内容,后面的几个 p 节点并没有选到。也就是说,当有多个节点时,这种选择方式只会选择到第一个匹配的节点,其他的后面节点都会忽略。
html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <pclass="story"> Once upon a time there were three little sisters; and their names were <ahref="http://example.com/elsie"class="sister"id="link1"> <span>Elsie</span> </a> <ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a> and <ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a> and they lived at the bottom of a well. </p> <pclass="story">...</p> """
运行结果如下:
1 2 3
['\n Once upon a time there were three little sisters; and their names were\n ', <aclass="sister"href="http://example.com/elsie"id="link1"> <span>Elsie</span> </a>, '\n', <aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>, ' \n and\n ', <aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>, '\n and they lived at the bottom of a well.\n ']
可以看到,返回结果是列表形式。p 节点里既包含文本,又包含节点,最后会将它们以列表形式统一返回。
需要注意的是,列表中的每个元素都是 p 节点的直接子节点。比如第一个 a 节点里面包含一层 span 节点,这相当于孙子节点了,但是返回结果并没有单独把 span 节点选出来。所以说,contents 属性得到的结果是直接子节点的列表。
同样,我们可以调用 children 属性得到相应的结果:
1 2 3 4 5
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.children) fori, child in enumerate(soup.p.children): print(i, child)
运行结果如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
<list_iterator object at0x1064f7dd8> 0 Once upon atime there were three little sisters; and their names were
html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <pclass="story"> Once upon a time there were three little sisters; and their names were <ahref="http://example.com/elsie"class="sister"id="link1"> <span>Elsie</span> </a> </p> <pclass="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.a.parent)
运行结果如下:
1 2 3 4 5 6
<pclass="story"> Once upon a time there were three little sisters; and their names were <aclass="sister"href="http://example.com/elsie"id="link1"> <span>Elsie</span> </a> </p>
这里我们选择的是第一个 a 节点的父节点元素。很明显,它的父节点是 p 节点,输出结果便是 p 节点及其内部的内容。
需要注意的是,这里输出的仅仅是 a 节点的直接父节点,而没有再向外寻找父节点的祖先节点。如果想获取所有的祖先节点,可以调用 parents 属性:
html = """ <html> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> Hello <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print('Next Sibling', soup.a.next_sibling) print('Prev Sibling', soup.a.previous_sibling) print('Next Siblings', list(enumerate(soup.a.next_siblings))) print('Prev Siblings', list(enumerate(soup.a.previous_siblings)))
运行结果如下:
1 2 3 4 5 6 7 8
Next Sibling Hello
Prev Sibling Once upon a time there were three little sisters; and their names were
Next Siblings [(0, '\n Hello\n '), (1, <a class="sister"href="http://example.com/lacie"id="link2">Lacie</a>), (2, ' \n and\n '), (3, <a class="sister"href="http://example.com/tillie"id="link3">Tillie</a>), (4, '\n and they lived at the bottom of a well.\n ')] Prev Siblings [(0, '\n Once upon a time there were three little sisters; and their names were\n ')]
html = """ <html> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Bob</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> </p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print('Next Sibling:') print(type(soup.a.next_sibling)) print(soup.a.next_sibling) print(soup.a.next_sibling.string) print('Parent:') print(type(soup.a.parents)) print(list(soup.a.parents)[0]) print(list(soup.a.parents)[0].attrs['class'])
运行结果如下:
1 2 3 4 5 6 7 8 9 10 11
NextSibling: <class 'bs4.element.Tag'> <aclass="sister" href="http://example.com/lacie" id="link2">Lacie</a> Lacie Parent: <class 'generator'> <pclass="story"> Onceuponatimetherewerethreelittlesisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Bob</a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> </p> ['story']
for ul in soup.find_all(name='ul'): print(ul.find_all(name='li')) for li in ul.find_all(name='li'): print(li.string)
运行结果如下:
1 2 3 4 5 6 7
[<li class="element">Foo</li>, <liclass="element">Bar</li>, <liclass="element">Jay</li>] Foo Bar Jay [<liclass="element">Foo</li>, <liclass="element">Bar</li>] Foo Bar
这里直接传入 id='list-1',就可以查询 id 为 list-1 的节点元素了。而对于 class 来说,由于 class 在 Python 里是一个关键字,所以后面需要加一个下划线,即 class_='element',返回的结果依然还是 Tag 组成的列表。
(3)text
text 参数可用来匹配节点的文本,传入的形式可以是字符串,可以是正则表达式对象,示例如下:
1 2 3 4 5 6 7 8 9 10 11 12
import re html=''' <div class="panel"> <div class="panel-body"> <a>Hello, this is a link</a> <a>Hello, this is a link, too</a> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(text=re.compile('link')))
运行结果如下:
1
['Hello, this is a link', 'Hello, this is a link, too']
这里有两个 a 节点,其内部包含文本信息。这里在 find_all() 方法中传入 text 参数,该参数为正则表达式对象,结果返回所有匹配正则表达式的节点文本组成的列表。
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for li in soup.select('li'): print('Get Text:', li.get_text()) print('String:', li.string)
运行结果如下:
1 2 3 4 5 6 7 8 9 10
GetText: Foo String: Foo GetText: Bar String: Bar GetText: Jay String: Jay GetText: Foo String: Foo GetText: Bar String: Bar