如何使用Python爬取百度网盘的目录

1. 缘起

上周自己在整理网盘资料的时候,发现资料太多,整理起来很麻烦,问我能不能帮他写个程序把网盘里的文件目录全部导出来存为Excel文件,这样就能够知道网盘里的全部目录结构了。

怎么能够放弃在自己面前显摆的机会能?故便欣然答应了。在网上百度搜索了一番,在简书中发现了一篇现成的代码分享Python | 用蟒蛇代码导出百度网盘文件目录,便花了几分钟照着该简书分享写了一个小程序,打包为exe可执行文件,给男票发了过去。原本以为他会很满意,正等着他来夸我来呢,结果,他竟然当起了甲方,挑起了刺,说:程序是导出了所有的文件目录,但要是能够指定导出文件的级别就好了(比如说我只想导出二级目录,程序只导出二级目录……)。虽然自己气到爆炸,但是作为一枚有修养的程序媛,还是乖乖的去改需求了。

如何使用Python爬取百度网盘的目录

2. 经过

参考了Python | 用蟒蛇代码导出百度网盘文件目录的代码,我第一次发现原来百度云盘在用户端本地有文件缓存的数据库文件,而且使用的数据库是sqlite3,该程序的思路是:给出文件缓存的数据库文件,然后读取出数据库文件中的内容,再按照文件夹目录的形式写到csv文件中。

代码:

# coding=utf-8
import requests
import json

url = “https://pan.baidu.com/mbox/msg/shareinfo”

querystring = {“msg_id”: “xxx”, “from_uk”: “xxx”, “gid”: “xxx”, “type”: “2”}

headers = {
‘Cookie’: “BAIDUID=6B16885AA2D577DB751D42E49878E3FA:FG=1; PSTM=1564562405; PANWEB=1; BDUSS=khEeHVub35WVlNqYUhQWU05LTBxNnUxT1RxNlhzdn43SUNVemw0SFVFY2FoMnBkSVFBQUFBJCQAAAAAAAAAAAEAAADqlksLycvUtNauveG-pwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABr6Ql0a-kJdN; BIDUPSID=08AE076E56556C82FC02FF9F296D004B; MCITY=-%3A; SCRC=2ad541dafe7b6eadfc5b483c100e64d0; STOKEN=8c49760e8a96bb75f6144d7c16299f809ce1e7b696b9405c079a2fe7f451e4bc; BDCLND=WebXlbv2GmDdNd4HJBi5uGnAh2SC85J41mwdU71dySk%3D; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; H_PS_PSSID=1429_21081_30210_20698; delPer=0; PSINO=5; Hm_lvt_7a3960b6f067eb0085b7f96ff5e660b0=1575602779,1575618653,1576054596; Hm_lpvt_7a3960b6f067eb0085b7f96ff5e660b0=1576055103; PANPSC=3962494995828399816%3ACU2JWesajwC140uwiiBLpEYemHfSgA%2FTNHUgiDktYo%2Fox7p0KalvirmhPgkngqQ%2BDNhpRjIuDWkr2wqYbSNJzPBl8XefSTKb8yY2nnCH9ClE2OlLZIpvWpwI2lYmdZVOhH24qItkHZ51Gv5h8iKvz5C9qgHiau%2FldS8c7ndIzLExGZiuUUeCaoX0p6z9lD8V7g%2B5PM2vWus%3D,BAIDUID=6B16885AA2D577DB751D42E49878E3FA:FG=1; PSTM=1564562405; PANWEB=1; BDUSS=khEeHVub35WVlNqYUhQWU05LTBxNnUxT1RxNlhzdn43SUNVemw0SFVFY2FoMnBkSVFBQUFBJCQAAAAAAAAAAAEAAADqlksLycvUtNauveG-pwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABr6Ql0a-kJdN; BIDUPSID=08AE076E56556C82FC02FF9F296D004B; MCITY=-%3A; SCRC=2ad541dafe7b6eadfc5b483c100e64d0; STOKEN=8c49760e8a96bb75f6144d7c16299f809ce1e7b696b9405c079a2fe7f451e4bc; BDCLND=WebXlbv2GmDdNd4HJBi5uGnAh2SC85J41mwdU71dySk%3D; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; H_PS_PSSID=1429_21081_30210_20698; delPer=0; PSINO=5; Hm_lvt_7a3960b6f067eb0085b7f96ff5e660b0=1575602779,1575618653,1576054596; Hm_lpvt_7a3960b6f067eb0085b7f96ff5e660b0=1576055103; PANPSC=3962494995828399816%3ACU2JWesajwC140uwiiBLpEYemHfSgA%2FTNHUgiDktYo%2Fox7p0KalvirmhPgkngqQ%2BDNhpRjIuDWkr2wqYbSNJzPBl8XefSTKb8yY2nnCH9ClE2OlLZIpvWpwI2lYmdZVOhH24qItkHZ51Gv5h8iKvz5C9qgHiau%2FldS8c7ndIzLExGZiuUUeCaoX0p6z9lD8V7g%2B5PM2vWus%3D; BAIDUID=C492A93A968632C45911877CE92DC5F3:FG=1”,
‘User-Agent’: “PostmanRuntime/7.20.1”,
‘Accept’: “*/*”,
‘Cache-Control’: “no-cache”,
‘Postman-Token’: “4c05adac-5857-4347-a76a-f34aa9b52edb,e1e57913-6b61-41d6-b45a-1a9e0dee8ab7”,
‘Host’: “pan.baidu.com”,
‘Accept-Encoding’: “gzip, deflate”,
‘Connection’: “keep-alive”,
‘cache-control’: “no-cache”
}

def backtrace(fs_id, out_file):
if len(str(fs_id)) > 1:
querystring[‘fs_id’] = fs_id
try:
response = requests.request(“GET”, url, headers=headers, params=querystring)
# response.content.decode(‘unicode_escape’)
result = response.json()
print result
except:
print “error”
return

print result[‘errno’]
records = result[‘records’]
for i in range(0, len(records)):
record = records[i]

if record[‘isdir’] == ‘1’ or record[‘isdir’] == 1:
backtrace(record[‘fs_id’], out_file)
else:
line = record[‘server_filename’] + “—-” + record[‘path’] + ‘\n’
print line
out_file.write(line.encode(‘utf-8’))

out_file = open(‘foo.txt’, ‘w’)
backtrace(‘xxx’, out_file)
out_file.close()

0 条回复 A 作者 M 管理员
    所有的伟大,都源于一个勇敢的开始!
欢迎您,新朋友,感谢参与互动!欢迎您 {{author}},您在本站有{{commentsCount}}条评论