Question

我要使用Python中的Pandas，我想从this网站下载一个csv文件，但下载链接包含一些随机字符，所以我想知道如何使其自动化。

这是每天更新的金融交易数据。我要下载的csv文件是第一行中红色正方形中的一个。每天，都会在顶部添加一个新行，我想自动下载此csv。

我的计划是通过使用当天的日期动态创建url字符串，将csv自动导入到Python中的pandas中。网址示例如下：

https://www.jpx.co.jp/markets/derivatives/participant-volume/nlsgeu000004vd5b-att/20200731_volume_by_participant_whole_day.csv

这是我的Python脚本。

from datetime import datetime as dt
day = dt.today()
date = str(day.year) + '{:02d}'.format(day.month) + '{:02d}'.format(day.day)
url = 'https://www.jpx.co.jp/markets/derivatives/participant-volume/nlsgeu000004vd5b-att/%s_volume_by_participant_whole_day_J-NET.csv' %date
# Followed by pandas...

问题是，此url（nlsgeu000004vgi7-att）的一部分始终是随机字符序列，我无法真正地动态查询它。例如，对于7/30，此部分为nlsgeu000004vd5b-att。至少，我不知道生成此字符串的规则是什么。

有什么办法可以正确指向这种部分随机的URL？我想到了一些解决方法，但不知道如何实际实施它们。如果您能帮助我，那就太好了！只要我可以自动下载csv，任何方式都可以！

使用正则表达式
使用BeautifulSoup之类的抓取工具获取第一行中任何csv的网址

Answer 1

我会按照您的建议抓取该网站。看起来这很容易做到（只要这些元素不是使用javascript动态生成的），而且如果您错误地假设使用url模式，它将消除将来可能用regex遇到的问题：

使用GET请求从页面提取html（使用requests）
使用BeautifulSoup提取所需的网址

Answer 2

是的，如果您不知道该网址是如何生成的，则需要抓取该页面才能找到它。这是一个使用BeautifulSoup和正则表达式过滤器的快速示例，以查找该页面上第一个链接，该链接的URL中包含volume_by_participant_whole_day.csv：

import re
import requests
from bs4 import BeautifulSoup

base_url = "https://www.jpx.co.jp"
data = requests.get(f"{base_url}/markets/derivatives/participant-volume/archives-01.html")
parsed = BeautifulSoup(data.text, "html.parser")
link = parsed.find("a", href=re.compile("volume_by_participant_whole_day.csv"))
path = link["href"]
print(f"{base_url}{path}")

Answer 3

我写了一些代码，它将直接获取该特定excel文件的链接。我没有使用任何正则表达式，我的回答是基于该元素的位置，因此您可以通过运行它来获取链接。

在运行代码之前，请确保您具有请求和BeautifulSoup模块

如果不是这些，则为安装说明

# for requests module
pip install requests

# for beautifulsoup module
pip install beautifulsoup4

BS脚本

# Imports
import requests
from bs4 import BeautifulSoup as bs

# Requesting and extracting html code
html_source = requests.get('https://www.jpx.co.jp/markets/derivatives/participant-volume/archives-01.html').text

# converting html to bs4 object
soup = bs(html_source, 'html.parser')

# finding all the table rows columns
trs = soup.find_all('tr')

# selecting 3rd row
x = [i for i in trs[2]]

# selecting 4th cell and then 2nd item(1st item is the pdf one)
y = [i for i in x[7]][2]

excel_file_link = y.get('href')

print(excel_file_link)

使用Python自动从随机网址下载csv文件

3 个答案: