关于网页抓取：需要使用python清理网页抓取的数据

Need to clean web scraped data using python

我正在尝试编写用于从 http://goldpricez.com/gold/history/lkr/years-3 抓取数据的代码。我写的代码如下。该代码有效，并给了我预期的结果。

1
2
3
4
5
6
7

import pandas as pd

url =“http://goldpricez.com/gold/history/lkr/years-3”

df = pd.read_html(url)

print(df)

但结果是一些不需要的数据，我只想要表中的数据。请帮我解决这个问题。

这里我添加了带有不需要数据的输出图像(红色圆圈)

相关讨论

谢了哥们。它工作正常和小问题。 df[3] 做什么？
使用 urllib.requests 实际上只是执行了两次该过程，因为 .read_html 这样做:) 所以不需要该步骤
为什么我投反对票的解释：我很少投反对票。我通常不喜欢在不解释我可以改进的地方投反对票。所以这里是我的。您添加了额外和未使用的代码 from urllib.request import urlopen, Request url =”http://goldpricez.com/gold/history/lkr/years-3″ req = Request(url=url) html = urlopen(req).read() 所有这些都没有使用。如果所有内容都被删除，df[3] 将起作用。 ;) 因此。希望你能理解：)
@ThejithaAnjana df[3] 打印数据帧列表中的第四个数据帧。

您使用 .read_html 的方式将返回所有表的列表。您的表位于索引 3

1
2
3
4
5
6
7

import pandas as pd

url =“http://goldpricez.com/gold/history/lkr/years-3”

df = pd.read_html(url)[3]

print(df)

.read_html 调用 URL，并在后台使用 BeautifulSoup 解析响应。您可以像在 .read_csv 中那样更改解析、表的名称、传递标头。查看 .read_html 了解更多详情。

为了速度，你可以使用 lxml 例如pd.read_html(url, flavor=’lxml’)[3]。默认情况下，使用第二慢的 html5lib。另一种风格是 html.parser。这是它们中最慢的。

为此使用 BeautifulSoup，下面的代码可以完美运行

1
2
3
4
5
6
7
8
9

import requests
from bs4 import BeautifulSoup
url =“http://goldpricez.com/gold/history/lkr/years-3”
r = requests.get(url)
s = BeautifulSoup(r.text,“html.parser”)
data = s.find_all(“td”)
data = data[11:]
for i in range(0, len(data), 2):
print(data[i].text.strip(),” “, data[i+1].text.strip())

使用 BeautifulSoup 的另一个优点是它比你的代码更快

Need to clean web scraped data using python

猜你喜欢