使用PHP巧妙提取微信公众号文章内容标题等信息精髓-站长圈子-DZ插件网

使用PHP巧妙提取微信公众号文章内容标题等信息精髓

哥斯拉

2024/08/14 19:44:06

我们以同步阅读”，“香落尘外”，“神州文艺”等微信公众号为例子！其他有用第三方编辑器的微信公众号请在正则里面继续添加规则！此代码良好适配默认的微信公众号发布平台！

<?php $url = @$_GET['url']?$_GET['url']:"https://mp.weixin.qq.com/s/n-X7v_JBFTSM6kBYyIG5kg"; $headers = array( 'Host' => 'mmbiz.qpic.cn', 'Connection' => 'keep-alive', 'Pragma' => 'no-cache', 'Refererr'=>'http://www.qq.com/', 'Cache-Control' => 'no-cache', 'Accept' => 'textml,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8', 'User-Agent' => 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36', 'Accept-Encoding' => 'gzip, deflate, sdch', 'Accept-Language' => 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4' ); $ch = curl_init(); curl_setopt($ch, CURLOPT_URL,$url); curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0); curl_setopt($ch, CURLOPT_HTTPHEADER,$headers); $result= curl_exec($ch); curl_close($ch); preg_match_all('/meta name="author" content="(.*?)"/', $result, $m); $nickname = $m[1][0];//公众号昵称 preg_match_all('/property="og:title" content="(.*?)"/', $result, $m); $title = $m[1][0];//公众号文章标题 preg_match_all('/property="og:image" content="(.*?)"/', $result, $m); $titlepic = $m[1][0];//公众号文章标题图片 preg_match_all('/name="description" content="(.*?)"/', $result, $m); $smalltext = $m[1][0];//公众号文章简介 preg_match_all('/var round_head_img = "(.*?)";/si',$result,$m); $head_img = $m[1][0];//公众号头像 if (!extension_loaded('dom')) { die('DOMDocument扩展未加载，请检查PHP配置文件。'); } $dom = new DOMDocument(); try { $dom->loadHTML($result); } catch (Exception $e) { die('加载HTML时出错：' . $e->getMessage()); } foreach ($dom->getElementsByTagName('*') as $tag) { if ($tag->hasAttribute('style')) { $tag->removeAttribute('style'); } } $newstext = ''; $divtext = $dom->getElementById('js_content'); foreach ($divtext->childNodes as $child) { $newstext .=$child->ownerDocument->saveHTML($child); } $newstext=strip_tags($newstext, "'; },$newstext); $replacement = '<$1$2'; $newHtml = preg_replace('/<(\/)?(p|span|br)[^>]*, $replacement, $newnewstext); $newHtml = preg_replace('/<p[^>]*style\s*=\s*"\s*[^"]*"\s*>(.*?)<\/p>/i', '$2', $newHtml); $newHtml= str_replace(array(" ", " "), '', $newHtml); $wechattext =preg_replace('/<\/strong>|■||.*免费订阅|.*文学新高地|.*点击上方/i', '', $newHtml); //去头部 $wechattext = preg_replace('~作者简介.*?>|延伸阅读.*|重要公告.*|责任编辑.*|落尘外平台团队.*|作者：.*|香落尘外.*|往期作品回顾.*~', '',$wechattext); //去尾部 $content = [ 'status' => 200, 'msg' => "采集成功", 'newstext' => $wechattext, 'nickname' => $nickname, 'title' => $title, 'url' => $url, 'titlepic' => $titlepic, 'smalltext' => $smalltext, 'head_img' => $head_img, 'time' => date("Y-m-d H:i:s"), 'api_source' => "".$public_r['sitename']."官网地址:".$public_r['add_pcurl']."" ]; $Json=json_encode($content,JSON_PRETTY_PRINT|JSON_UNESCAPED_UNICODE); echo stripslashes($Json);

以上就是用代码抓取获取微信公众号文章的相关信息！此举是为了指定目标采集或者搬迁到第三方有可用接口的平台！鉴于小编只是测试并没有深究这个正则式！某些微信公众号利用其他三方发布到微信公账号的文章或许采集或带来不愉快的体验！请在相关注释的地方进行正则匹配！还是那句老话自己动身丰衣足食！本教程都是免费的，只是给不会的小伙伴提供一个范例！如果有好的写法也可以跟我们联系哦！

进入原文参与互动