cv-cat / spider_xhs Goto Github PK

View Code? Open in Web Editor NEW

883.0 883.0 170.0 2.41 MB

小红书爬虫，小红书笔记、主页、搜索爬取

Python 100.00%

spider_xhs's Introduction

Hi there 👋 this is CVcat

🌱 I’m currently learning at hhu

⛳ My blog url: CVcat Home

📫 Chat with me: CVZC15751076989(wx) | 992822653(qq)

🎯 Looking for an internship

spider_xhs's People

Contributors

Stargazers

Watchers

Forkers

kekewind codingpeppa abc1319 terrylijiayang doyuan sigmayang yhs0418 erwin11 bytebuff if-always lamchi-joo shuaibibobo pikaso84010 a568972484 cjp-chu guapier esword618 1142311598 jiangwu10057 libchaos leadtodream switch-08 lxbxinwei peterlam193 xuzhushen tmsdy henrylaobai archger lostjay jihgao liuandhisgithub zguangyi jeanmoumou xuelainiao wickscc coder-shaovlee hanpitill tashengjinsheng lhtest429 zsansan886 yemanzhongting pakandalive lbatsoft zhiweicoding wangfushu yangjinhui binky1017 ldzspace fenmome hildam cxapython liangdabiao ibcjacky antigenius-lb yangjiaxue2022510 yisan57 heriec profeel us579 jijinqianggithub wangzhiyuanawe zch513430014 nanayano itisyou somnusred pztpzsapp hele2069 ethan561005 haorenshiwo aytenaker hqishen clyde98980 aaalexmak gzamon 1543889217 ncndyw45066348 jargewu fluchw pzhihao gwen-z aoungy thinkerhu david19970306 orangecat-nan roezy bcl200n hellocym yik7707 939517557 lookusluo ffffffpy kewsg robotskly lllkkkai shihuricha qustyli amazingplace jisongyang wangshuniguang bugstone2023

spider_xhs's Issues

有bug Spider_XHS/xhs_utils/xhs_util.py

File "/home/xxxx/tmp/Spider_XHS/xhs_utils/xhs_util.py", line 74, in handle_profile_info
info = re.findall(r'<script>window.INITIAL_STATE=(.*?)</script>', html_text)[0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

feed接口好像不能用了

/api/sns/web/v1/search/notes接口可以正常拿到笔记列表，但是/api/sns/web/v1/feed接口今天突然报错了，之前都是好的，报错的返回：
status: 461 status code 461, text: {"code":0,"success":true,"msg":"成功","data":{}}

一直返回461的响应，即便是换代理、换账号貌似还是不行，莫非小红书直接给该接口上了新的buf么。不知道各位道友是不是好的

配置正确的情况下，用户查询失败

大概两天前能够成功爬取内容，今天再用就用户查询失败了。cookie和用户主页网址没有问题。尝试通过源码中打印来寻找排查错误，但是也没看出问题所在。
附修改的部分代码片段以及终端截图，其他地方没改过。
已经尝试过重新克隆项目运行例子，但是也是用户查询失败，猜测可能是小红书更新了json内容导致列表出错？

修改位置为home.py中的save_all_note_info函数中的循环体

one-note.py老是报语法错误

Traceback (most recent call last):
File "d:\develop\pythonWorkspace\Spider_XHS\one-note.py", line 193, in
main()
File "d:\develop\pythonWorkspace\Spider_XHS\one-note.py", line 188, in main
handle_note(note_id_)
File "d:\develop\pythonWorkspace\Spider_XHS\one-note.py", line 115, in handle_note
ret = js.call('get_xs', '/api/sns/web/v1/feed', note_id, cookies['a1'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\admin\AppData\Roaming\Python\Python311\site-packages\execjs_abstract_runtime_context.py", line 37, in call
return self._call(name, *args)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\admin\AppData\Roaming\Python\Python311\site-packages\execjs_external_runtime.py", line 92, in _call
return self.eval("{identifier}.apply(this, {args})".format(identifier=identifier, args=args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\admin\AppData\Roaming\Python\Python311\site-packages\execjs_external_runtime.py", line 78, in eval
return self.exec(code)
^^^^^^^^^^^^^^^^
File "C:\Users\admin\AppData\Roaming\Python\Python311\site-packages\execjs_abstract_runtime_context.py", line 18, in exec
return self.exec(source)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\admin\AppData\Roaming\Python\Python311\site-packages\execjs_external_runtime.py", line 88, in exec
return self._extract_result(output)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\admin\AppData\Roaming\Python\Python311\site-packages\execjs_external_runtime.py", line 167, in _extract_result
raise ProgramError(value)
execjs._exceptions.ProgramError: SyntaxError: 语法错误

search.py里面可以设置保存用户头像吗

下载的图片格式应该是 webp 吧？

之前的代码一直可以稳定运行，但是今天突然都不能查看，保存失败

求解决555

能不能设置不下载图片和视频

注意：抓取6个账号后，spider账号封禁账号现象

user_list填入了6个目标账户，抓取大约3000条note后，账号被封禁

搜索接口限制11页，有办法突破吗

不能用了

全部都是下载失败

缺少nodejs环境, 请先安装nodejs, 再运行npm i jsdom, SyntaxError: Unexpected token '||='

实际上已经安装了nodejs且安装了jsdom,是不是xhs又更新x-s算法?

请问有私信发送功能么

gui.exe打不开

我的一直是笔记..不允许查看

是网站更新了吗，我的一直是笔记..不允许查看，看了下是https://edith.xiaohongshu.com/api/sns/web/v1/feed这个接口访问成功，但data一直为空，笔记是没删除网页可以查看的

画质问题

我在网页版上手动保存的大小比用spider下载的图片大不少
画质不一样不知道能让它和网页版上手动保存的画质一致么

关于反爬

楼主好
1、小红书一个关键词搜索最多获取两百多条笔记吗，
2、有没有什么办法能够解决这个搜索关键词获取笔记太少的问题啊，
3、如果设置多个账号或者iP代理会不会好点

Encountering "缺少nodejs环境" Error When Using Source Code

Description: When running *.py files using the source code, I consistently encounter the "缺少nodejs环境" error. I have provided the necessary cookie.txt, and the exe file works fine (though the terminal window keeps appearing continuously).
Packages installed before running:

pip install xhs-spider -i https://pypi.org/simple
npm i jsdom
Current node version: nodejs v18.17
Operating System: Windows 11

视频爬取完毕后，未验证大小后，判断下载完成

能否加上下载评论的功能？

如题

下载运行cookie有效却搜索不到东西打印发现返回为成功却不携带任何信息

保存的图片损坏

小红书原图是webp格式的，直接保存为jpg和png后在Windows上都显示不了，双击打开显示位图损坏，保存为webp就没问题

search报错

Traceback (most recent call last):
File "/Users/tinghuang/Spider_XHS/search.py", line 95, in
search.main(info)
File "/Users/tinghuang/Spider_XHS/search.py", line 79, in main
self.handle_note_info(query, number, sort, need_cover=True)
File "/Users/tinghuang/Spider_XHS/search.py", line 55, in handle_note_info
ret = js.call('get_xs', api, data, self.cookies['a1'])
TypeError: 'NoneType' object is not subscriptable

下载视频的时候会报错，大佬遇到过这个问题吗

这块代码
elif type == 'video':
print(f"{name}开始下载, {media_url}")
start_time = time.time()
res = requests.get(media_url, stream=True)
size = 0
chunk_size = 1024 * 1024
content_size = int(res.headers["content-length"])
with open(path + '/' + name + '.mp4', mode="wb") as f:
for data in res.iter_content(chunk_size=chunk_size):
f.write(data)
size += len(data) # 已下载文件的大小
# 已完成的百分比
percentage = size / content_size
# 打印进度条
print(f'\r视频:%.2fMB\t' % (content_size / 1024 / 1024),
'下载进度:[%-50s%.2f%%]耗时：%.1fs' % ('>' * int(50 * percentage), percentage * 100, time.time() - start_time),
end='')
print(f"{name}下载完成")

在下载某些视频的时候，会报这个错误。网上搜了好像也没啥解决方法

urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(10485760 bytes read, 280808 more expected)', IncompleteRead(10485760 bytes read, 280808 more expected))