issues_bug_574 无法匹配获取微博长文，尝试修复 by myshero · Pull Request #575 · dataabc/weiboSpider

myshero · 2024-04-27T10:10:10Z

修复了：无法正确获取需要“展开”的长文微博

优化了：如果长文微博中有换行则保留格式。长文微博文本中的
标签被替换为\n

def get_long_weibo(self):
        """获取长原创微博"""
        try:
            for i in range(5):
                self.selector = handle_html(self.cookie, self.url)
                if self.selector is not None:
                    info_div = self.selector.xpath("//div[@class='c' and @id='M_']")[0]
                    info_span = info_div.xpath("//span[@class='ctt']")[0]
                    # 1. 获取 info_span 中的所有 HTML 代码作为字符串
                    html_string = etree.tostring(info_span, encoding='unicode', method='html')
                    # 2. 将 <br> 替换为 \n
                    html_string = html_string.replace('<br>', '\n')
                    # 3. 去掉所有 HTML 标签，但保留标签内的有效文本
                    new_content = fromstring(html_string).text_content()
                    # 4. 替换多个连续的 \n 为一个 \n
                    new_content = re.sub(r'\n+', '\n', new_content)
                    weibo_content = handle_garbled(new_content)
                    if weibo_content is not None:
                        return weibo_content
                sleep(random.randint(6, 10))
        except Exception:
            logger.exception(u'网络出错')

结果示例：

        {
            "id": "Obuk4oIaU",
            "user_id": "",
            "content": ":2024年04月26日，星期五\n今天证实了我们所说，情绪仍处于上升期，即使也要多看多了解。",
            "article_url": "",
            "original_pictures":"无",
            "retweet_pictures": null,
            "original": true,
            "video_url": "无",
            "publish_place": "无",
            "publish_time": "2024-04-26 12:11",
            "publish_tool": "微博网页版",
            "up_num": 3,
            "retweet_num": 0,
            "comment_num": 0
        }

dataabc · 2024-04-27T11:22:22Z

感谢贡献代码。非常好的优化，可以让长微博更整洁，已merge。

Reason: after dataabc#575, there are actually no '原文转发' in the result of get_long_weibo, thus wb_content.rfind(u'原文转发') will always return -1. Now we can just use the same function get_long_weibo for both original weibo and retweets.

issues_bug_574 无法匹配获取微博长文，尝试修复

bc15852

myshero force-pushed the issues_bug_574 branch from 9d75b5f to bc15852 Compare April 27, 2024 10:29

myshero mentioned this pull request Apr 27, 2024

微博长文“展开”更多，无法获取完整长文内容 #574

Closed

issues_bug_574 无法匹配获取微博长文，尝试修复

241d109

dataabc merged commit d7de931 into dataabc:master Apr 27, 2024

myshero deleted the issues_bug_574 branch April 29, 2024 10:56

songzy12 mentioned this pull request Sep 20, 2025

Fix function get_long_retweet. #677

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

issues_bug_574 无法匹配获取微博长文，尝试修复#575

issues_bug_574 无法匹配获取微博长文，尝试修复#575
dataabc merged 2 commits intodataabc:masterfrom
myshero:issues_bug_574

myshero commented Apr 27, 2024 •

edited

Loading

Uh oh!

dataabc commented Apr 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

myshero commented Apr 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dataabc commented Apr 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

myshero commented Apr 27, 2024 •

edited

Loading