前言
本篇文章很短,就是记录一个偶然遇到的问题
问题复现
是这样的,在用xpath解析某网站的时候,由于网站数据格式是普通的html,而非json字符串,所以只能解析DOM对象,有的能用正则表达式的我都尽量用正则表达式了,没法用正则的我都用beautifulsoup库或者pyquery了,但是没法,通用型还是没法跟xpath比,而且我已经写好一版,在有限的时间改的话就很烦了
不多说,先看问题
首先部分的网站源码如下:
<article class="_55wo _5rgr _5gh8 _3drq async_like"
data-ft='{"mf_story_key":"10159935560038463","top_level_post_id":"10159935560038463","tl_objid":"10159935560038463","content_owner_id_new":"8245623462","throwback_story_xxid":"10159935560038463","page_id":"8245623462","story_location":4,"story_attachment_style":"video_inline","tds_flgs":3,"ott":"AX90AyHPzJSMfPjF","tn":"-R"}'
data-sigil="story-div story-popup-metadata story-popup-metadata feed-ufi-metadata"
data-store='{"linkdata":"mf_story_key.10159935560038463:top_level_post_id.10159935560038463:tl_objid.10159935560038463:content_owner_id_new.8245623462:throwback_story_xxid.10159935560038463:page_id.8245623462:story_location.4:story_attachment_style.video_inline:tds_flgs.3:ott.AX90AyHPzJSMfPjF","share_id":"10159935560038463","feedback_target":"10159935560038463","feedback_source":0,"action_source":0,"actor_id":100065274592441}'
data-xt="2.mf_story_key.10159935560038463:top_level_post_id.10159935560038463:tl_objid.10159935560038463:content_owner_id_new.8245623462:throwback_story_xxid.10159935560038463:page_id.8245623462:story_location.4:story_attachment_style.video_inline:tds_flgs.3:ott.AX90AyHPzJSMfPjF"
data-xt-vimp='{"pixel_in_percentage":0,"duration_in_ms":1,"subsequent_gap_in_ms":60000,"log_initial_nonviewable":false,"should_batch":true,"require_horizontally_onscreen":false}'
>
<div class="story_body_container">
<header class="_7om2 _1o88 _77kd _5qc1">
<div class="_5s61 _2pii _5i2i _52wc">
<div class="_5xu4">
<div class="_67lm _77kc" data-gt='{"tn":"~"}' data-sigil="feed_story_ring8245623462"><a
data-click='{"event":"click_post_avatar_image","target_id":"10159935560038463"}'
data-gt='{"tn":"~"}' href="/nba/?__tn__=%7E%7E-R"><i aria-label="NBA, profile picture"
class="img _1-yc profpic" role="img"
></i></a>
</div>
</div>
</div>
<div class="_4g34 _5i2i _52we">
<div class="_5xu4">
<div class="_7om2 _52wc">
<div class="_4g34"><h3 class="_52jd _52jb _52jh _5qc3 _4vc- _3rc4 _4vc-" data-gt='{"tn":"C"}'>
<span><strong><a href="/nba/?__tn__=C-R">NBA</a></strong><span aria-label="Verified Page"
class="_56_f _5dzy _5dz- _3twv"
role="img"></span></span>
</h3>
<div class="_52jc _5qc4 _78cz _24u0 _36xo" data-sigil="m-feed-voice-subtitle"><a
href="/story.php?story_xxid=10159935560038463&id=8245623462&__tn__=-R"><abbr>6
hrs</abbr></a><span aria-hidden="true"> · </span><span><div class="_7jwi"><span
data-sigil="audience-icon"><i aria-label="Public"
class="feedAudienceIcon img sp_eXcmc5QyINt_2x sx_e966fc"
role="img"></i></span><div class="_7jwh"></div></div></span>
</div>
</div>
<div class="_5s61">
<div class="_2pir" ></div>
</div>
<div class="_5s61"></div>
<div class="_5s61 _2pis">
<div class="_yff" data-sigil="story-popup-causal-init"
data-store='{"feedobjectsIdentifiers":"S:_I8245623462:10159935560038463","feedContext":"{\"use_m_feed\":true,\"m_entstream_source\":\"timeline\",\"is_pages_timeline\":true,\"story_node_id\":\"u_0_5_iv\",\"show_attachments\":true,\"is_attached_story\":false}"}'
role="button"></a><i class="img sp_eXcmc5QyINt_2x sx_b9866d"
data-sigil="story-popup-context-init"><u>More
options</u></i></div>
</div>
</div>
</div>
</div>
</header>
<div class="_5rgt _5nk5 _5msi" data-ft='{"tn":"*s"}' data-gt='{"tn":"*s"}' style="">
<div><span><p>Watch the BEST DEEP 3'S from the <a href="/LAClippers/?__tn__=%2As-R">L.A. Clippers</a> during the <a
class="_5ayv _qdx" href="/hashtag/nbaplayoffs?__tn__=%2As-R"><span class="_5aw4 _qdz">#</span><span
class="_5ayu">NBAPlayoffs</span></a>! </p><p> <a class="_5ayv _qdx"
href="/hashtag/thatsgame?__tn__=%2As-R"><span
class="_5aw4 _qdz">#</span><span class="_5ayu">ThatsGame</span></a> <span class="_5mfr"><span
class="_6qdm"
style='height: 16px; width: 16px; font-size: 16px; background-image: url("https://static.xx.xxcdn.net/images/emoji.php/v9/tdf/2/16/1f4a5.png")'>????</span></span></p></span>
</div>
<a aria-label="Open story" class="_5msj"
href="/story.php?story_xxid=10159935560038463&id=8245623462&__tn__=%2As%2As-R"></a></div>
<div class="_5rgu _7dc9 _27x0" data-ft='{"tn":"H"}'>
<section class="_2rea _24e1 _412_ _bpa _vyy _5t8z">
<div class="_2zi_ _zgm _2zj0">
<div class="_53mw" data-sigil="inlineVideo"
data-store='{"videoID":"4456269257751059","playerFormat":"inline","playerOrigin":"page_timeline","external_log_id":null,"external_log_type":null,"rootID":4456269257751059,"playerSuborigin":"misc","useOzLive":false,"playbackIsLiveStreaming":false,"canUseOffline":null,"playOnClick":true,"videoDebuggerEnabled":false,"videoViewabilityLoggingEnabled":false,"videoViewabilityLoggingPollingRate":-1,"videoScrollUseLowThrottleRate":true,"playInFullScreen":false,"type":"video","src":"https:\/\/video-mad1-1.xx.xxcdn.net\/v\/t42.1790-2\/10000000_540531577146622_2129266242166849959_n.mp4?_nc_cat=111&ccb=1-3&_nc_sid=985c63&efg=eyJ2ZW5jb2RlX3RhZyI6InN2ZV9zZCJ9&_nc_ohc=CHxlLBnqdg8AX84rJTC&tn=3o-lXXvU9tVtdq6j&_nc_rml=0&_nc_ht=video-mad1-1.xx&oh=5ab243e6a2407a74ed09407f43ad04e9&oe=6107CF3F","width":320,"height":180,"trackingNodes":"FH-R","downloadResources":null,"subtitlesSrc":null,"spherical":false,"sphericalParams":null,"defaultQuality":null,"availableQualities":null,"playStartSec":null,"playEndSec":null,"playMuted":null,"disableVideoControls":false,"loop":false,"numOfLoops":null,"shouldPlayInline":true,"dashManifest":null,"isAdsPreview":false,"iframeEmbedReferrer":null,"adClientToken":null,"audioOnlyVideoSrc":null,"audioOnlyEnabled":false,"permalinkShareID":null,"feedPosition":null,"chainDepth":null,"videoURL":"https:\/\/www.xxxxxx.com\/nba\/videos\/4456269257751059\/","disableLogging":false}'>
<i class="img _lt3 _4s0y" data-sigil="playInlineVideo"
style=""></i>
<div class="_1o0y" data-sigil="m-video-play-button playInlineVideo"><span
style="display:block;height:0;overflow:hidden;position:absolute;width:0;padding:0">Play Video</span>
</div>
</div>
</div>
</section>
<div></div>
<div></div>
</div>
</div>
<footer class="_22rc" data-ft='{"tn":"*W"}'>
<div class="_2ip_ _4b44" data-sigil="mufi-inline" >
<div class="_34qc _3hxn _3myz _4b45"><a data-sigil="feed-ufi-trigger"
href="/story.php?story_xxid=10159935560038463&id=8245623462&anchor_composer=false&__tn__=%2AW-R"
role="button">
<div class="_rnk _77ke _2eo- _1e6 _4b44" data-sigil="reactions-bling-bar" >
<div class="_1w1k" data-sigil="reactions-sentence-container"><span class="_qfz _77kf"><div
class="_1g05 _77lc" style="z-index:3"><i class="img sp_eXcmc5QyINt_2x sx_9540f7"
role="presentation"><u>Like</u></i></div><div
class="_1g05 _77lc" style="z-index:2"><i class="img sp_eXcmc5QyINt_2x sx_2d1286"
role="presentation"><u>Love</u></i></div><div
class="_1g05 _77lc" style="z-index:1"><i class="img sp_eXcmc5QyINt_2x sx_176208"
role="presentation"><u>Wow</u></i></div></span>
<div aria-label="567 left reactions including Like, Love and Wow" class="_1g06">567</div>
</div>
<div class="_1fnt"><span class="_1j-c" data-sigil="comments-token">10 Comments</span><span
class="_1j-c">36 Shares</span></div>
</div>
</a></div>
<div class="_52jh _7om2 _15kk _15ks _15km _4b47 _4b46" data-sigil="ufi-inline-actions">
<div class="_52jj _15kl _3hwk _4g34"><a aria-pressed="false" class="_15ko _77li touchable"
data-ft='{"tn":">"}'
data-sigil="touchable ufi-inline-like like-reaction-flyout"
data-store='{"reaction":0,"feedbackTarget":"10159935560038463","kaiOSReactions":false}'
href="/ufi/reaction/?ft_ent_identifier=10159935560038463&reaction_type=1&story_render_location=timeline&feedback_source=0&is_sponsored=0&ext=1628151954&hash=AeQmDqjrKECVo8k9bxk&__tn__=%3E%2AW-R"
>Like</a>
<div class="_1ekf" data-sigil="screenreader-reactions-trigger" role="link" tabindex="-1">Show more
reactions
</div>
</div>
<div class="_52jj _15kl _3hwk _4g34"><a class="_15kq _77li"
data-click='{"event":"click_comment_ufi","target_id":"10159935560038463"}'
data-ft='{"tn":"S"}'
data-sigil="feed-ufi-focus feed-ufi-trigger ufiCommentLink mufi-composer-focus"
href="/story.php?story_xxid=10159935560038463&id=8245623462&fs=0&focus_composer=0&__tn__=S%2AW-R">Comment</a>
</div>
<div class="_52jj _15kl _3hwk _4g34"><a class="_15kr _77li"
data-click='{"event":"click_share_ufi","target_id":"10159935560038463"}'
data-ft='{"tn":"J"}' data-sigil="share-popup"
data-store='{"is_acting_as_page":false,"reshare_post":false,"share_id":"10159935560038463","feedback_source":0,"feedback_referrer":null,"internal_preview_image_id":null,"shareable_uri":"\/story.php?story_xxid=10159935560038463&id=8245623462","user_id":100065274592441,"behavior":"custom"}'
href="/sharer.php?fs=0&sid=10159935560038463&__tn__=J%2AW-R">Share</a>
</div>
</div>
</div>
</footer>
</article>
然后我的xpath语法就是解析不了,我用以下代码测试:
就很奇怪了,经过我的测试,发现是因为有emoji表情符引起的,
我把那些emoji符号删除了就可以正常解析了:
就很骚了。
你知道这个问题我花了1个小时排查吗,我真的是一点一点的把问题抠出来的,就感觉我在逆向js代码一样一段一段抠
解决问题
一开始我想的是,用beautifulsoup找出那段有emoji的符号部分的节点删除就行,问题是解决了:
但是我发现并不是很通用,因为,有可能emoji不会一定存在于我筛选出来的那个class为_6qdm上,也可能出现在其他地方。
那么就还是得用正则匹配了:
re.compile(u'[\U00010000-\U0010ffff]')
既然能匹配到,那就用sub替换即可:
f = open('profile.html',encoding='utf-8')
cont = f.read()
f.close()
try:
pattern = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:
pattern = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
print(pattern.findall(cont))
cont = pattern.sub('',cont)
# soup = BeautifulSoup(cont, 'html.parser')
# remove_obj = soup.select('span[class="_6qdm"]')
# if remove_obj:
# [rem.extract() for rem in remove_obj]
# html_xpath = etree.HTML(str(soup))
html_xpath = etree.HTML(cont)
print(html_xpath.xpath('//text()'))
执行:
验证下,我换了一个html结构:
果然能匹配到,ok,问题解决
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:python爬虫 — 处理emoji表情符导致xpath无法正常解析网页的问题 - Python技术站