Skip to content

Comments

issues_bug_574 无法匹配获取微博长文,尝试修复#575

Merged
dataabc merged 2 commits intodataabc:masterfrom
myshero:issues_bug_574
Apr 27, 2024
Merged

issues_bug_574 无法匹配获取微博长文,尝试修复#575
dataabc merged 2 commits intodataabc:masterfrom
myshero:issues_bug_574

Conversation

@myshero
Copy link

@myshero myshero commented Apr 27, 2024

修复了:无法正确获取需要“展开”的长文微博

优化了:如果长文微博中有换行则保留格式。长文微博文本中的
标签被替换为\n

def get_long_weibo(self):
        """获取长原创微博"""
        try:
            for i in range(5):
                self.selector = handle_html(self.cookie, self.url)
                if self.selector is not None:
                    info_div = self.selector.xpath("//div[@class='c' and @id='M_']")[0]
                    info_span = info_div.xpath("//span[@class='ctt']")[0]
                    # 1. 获取 info_span 中的所有 HTML 代码作为字符串
                    html_string = etree.tostring(info_span, encoding='unicode', method='html')
                    # 2. 将 <br> 替换为 \n
                    html_string = html_string.replace('<br>', '\n')
                    # 3. 去掉所有 HTML 标签,但保留标签内的有效文本
                    new_content = fromstring(html_string).text_content()
                    # 4. 替换多个连续的 \n 为一个 \n
                    new_content = re.sub(r'\n+', '\n', new_content)
                    weibo_content = handle_garbled(new_content)
                    if weibo_content is not None:
                        return weibo_content
                sleep(random.randint(6, 10))
        except Exception:
            logger.exception(u'网络出错')

结果示例:

        {
            "id": "Obuk4oIaU",
            "user_id": "",
            "content": ":2024年04月26日,星期五\n今天证实了我们所说,情绪仍处于上升期,即使也要多看多了解。",
            "article_url": "",
            "original_pictures":"无",
            "retweet_pictures": null,
            "original": true,
            "video_url": "无",
            "publish_place": "无",
            "publish_time": "2024-04-26 12:11",
            "publish_tool": "微博网页版",
            "up_num": 3,
            "retweet_num": 0,
            "comment_num": 0
        }

@dataabc dataabc merged commit d7de931 into dataabc:master Apr 27, 2024
@dataabc
Copy link
Owner

dataabc commented Apr 27, 2024

感谢贡献代码。非常好的优化,可以让长微博更整洁,已merge。

@myshero myshero deleted the issues_bug_574 branch April 29, 2024 10:56
songzy12 added a commit to songzy12/weiboSpider that referenced this pull request Sep 20, 2025
Reason: after dataabc#575, there are
actually no '原文转发' in the result of get_long_weibo, thus
wb_content.rfind(u'原文转发') will always return -1. Now we can just use
the same function get_long_weibo for both original weibo and retweets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants