【问题标题】:In Python, how do I most efficiently chunk a UTF-8 string for REST delivery?在 Python 中,如何最有效地将 UTF-8 字符串分块以进行 REST 传递?
【发布时间】:2023-04-04 03:14:01
【问题描述】:

  1. 首先我会说我有点理解“UTF-8”编码是什么,它基本上是但不完全是 unicode,而且 ASCII 是一个较小的字符集。我也明白,如果我有:

    se_body = "&gt; Genesis 2:2 וַיְכַל אֱלֹהִים בַּיֹּום הַשְּׁבִיעִי מְלַאכְתֹּו אֲשֶׁר עָשָׂה וַיִּשְׁבֹּת בַּיֹּום הַשְּׁבִיעִי מִכָּל־מְלַאכְתֹּו אֲשֶׁר עָשָֽׂה׃ The word tr <excess removed ...> JV"
    print len(se_body)              #will return the number of characters in the string, in my case '1500'
    print sys.getsizeof(se_body)    #will return the number of bytes, which will be 3050
    
  2. 我的代码正在利用我无法控制的 RESTful API。说 RESTful API 的工作是从文本中解析出圣经参考的传递参数,并且有一个有趣的怪癖——它一次只接受 2000 个字符。如果发送超过 2000 个字符,我的 API 调用将返回 404。再次强调,我正在利用其他人的 API,所以请不要告诉我“修复服务器端”。我不能:)

  3. 我的解决方案是获取字符串并将其分块为少于 2000 个字符的位,让它扫描每个块,然后我将根据需要重新组合和标记。我想善待所说的服务并传递尽可能少的块,这意味着每个块都应该很大。

  4. 当我传递一个包含希伯来语或希腊语字符的字符串时,我的问题就出现了。 (是的,圣经的答案经常使用希腊语和希伯来语!)如果我将块大小设置为低至 1000 个字符,我总是可以安全地通过它,但这看起来真的很小。在大多数情况下,我应该能够将其分块更大。

  5. 我的问题是:在不诉诸太多英雄主义的情况下,将 UTF-8 分块为正确大小的最有效方法是什么?

代码如下:

# -*- coding: utf-8 -*-
import requests
import json

biblia_apikey = '************'
refparser_url = "http://api.biblia.com/v1/bible/scan/?"
se_body = "&gt; Genesis 2:2 וַיְכַל אֱלֹהִים בַּיֹּום הַשְּׁבִיעִי מְלַאכְתֹּו אֲשֶׁר עָשָׂה וַיִּשְׁבֹּת בַּיֹּום הַשְּׁבִיעִי מִכָּל־מְלַאכְתֹּו אֲשֶׁר עָשָֽׂה׃ The word translated as &quot;rest&quot; in English, is actually the conjugated word from which we get the English word `Sabbath`, which actually means to &quot;cease doing&quot;. &gt; וַיִּשְׁבֹּת or by its root: &gt; שָׁבַת Here&#39;s BlueletterBible&#39;s concordance entry: [Strong&#39;s H7673][1] It is actually the same root word that is conjugated to mean &quot;[to go on strike][2]&quot; in modern Hebrew. In Genesis it is used to refer to the fact that the creation process ceased, not that God &quot;rested&quot; in the sense of relieving exhaustion, as we would normally understand the term in English. The word &quot;rest&quot; in that sense is &gt; נוּחַ Which can be found in Genesis 8:9, for example (and is also where we get Noah&#39;s name). More here: [Strong&#39;s H5117][3] Jesus&#39; words are in reference to the fact that God is always at work, as the psalmist says in Psalm 54:4, He is the sustainer, something that implies a constant intervention (a &quot;work&quot; that does not cease). The institution of the Sabbath was not merely just so the Israelites would &quot;rest&quot; from their work but as with everything God institutes in the Bible, it had important theological significance (especially as can be gleaned from its prominence as one of the 10 commandments). The point of the Sabbath was to teach man that he should not think he is self-reliant (cf. instances such as Judges 7) and that instead they should rely upon God, but more specifically His mercy. The driving message throughout the Old Testament as well as the New (and this would be best extrapolated in c.se) is that man cannot, by his own efforts (&quot;works&quot;) reach God&#39;s standard: &gt; Ephesians 2:8 For by grace you have been saved through faith, and that not of yourselves; it is the gift of God, 9 not of works, lest anyone should boast. The Sabbath (and the penalty associated with breaking it) was a way for the Israelites to weekly remember this. See Hebrews 4 for a more in depth explanation of this concept. So there is no contradiction, since God never stopped &quot;working&quot;, being constantly active in sustaining His creation, and as Jesus also taught, the Sabbath was instituted for man, to rest, but also, to &quot;stop doing&quot; and remember that he is not self-reliant, whether for food, or for salvation. Hope that helps. [1]: http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H7673&amp;t=KJV [2]: http://www.morfix.co.il/%D7%A9%D7%91%D7%99%D7%AA%D7%94 [3]: http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?strongs=H5117&amp;t=KJV"

se_body = se_body.decode('utf-8')

nchunk_start=0
nchunk_size=1500
found_refs = []

while nchunk_start < len(se_body):
    body_chunk = se_body[nchunk_start:nchunk_size]
    if (len(body_chunk.strip())<4):
        break;

    refparser_params = {'text': body_chunk, 'key': biblia_apikey }
    headers = {'content-type': 'text/plain; charset=utf-8', 'Accept-Encoding': 'gzip,deflate,sdch'}
    refparse = requests.get(refparser_url, params = refparser_params, headers=headers)

    if (refparse.status_code == 200):
        foundrefs = json.loads(refparse.text)
        for foundref in foundrefs['results']:
            foundref['textIndex'] += nchunk_start
            found_refs.append( foundref ) 
    else:
        print "Status Code {0}: Failed to retrieve valid parsing info at {1}".format(refparse.status_code, refparse.url)
        print "  returned text is: =>{0}<=".format(refparse.text)

    nchunk_start += (nchunk_size-50)
    #Note: I'm purposely backing up, so that I don't accidentally split a reference across chunks


for ref in found_refs:
    print ref
    print se_body[ref['textIndex']:ref['textIndex']+ref['textLength']]

我知道如何对字符串进行切片 (body_chunk = se_body[nchunk_start:nchunk_size]),但我不确定如何根据 UTF-8 位的长度对相同的字符串进行切片。

完成后,我需要提取选定的引用(我实际上是要添加 SPAN 标签)。这就是现在的输出:

{u'textLength': 11, u'textIndex': 5, u'passage': u'Genesis 2:2'}
Genesis 2:2
{u'textLength': 11, u'textIndex': 841, u'passage': u'Genesis 8:9'}
Genesis 8:9

【问题讨论】:

  • 大概分块需要考虑空格?例如。不要向 API 发送部分单词?
  • 所以,空白在这里并不重要——尽管如果我能以一种总是在空白上中断的方式进行分块,我会很高兴。唯一的技巧部分是,当我完成后,我需要在原始字符串中返回的位置添加标签。
  • 在最后一个空白字符上分割更容易,但您仍然可以轻松切断最后一个无效、不完整的字节。
  • 所以,切入正题:你的python字符串抽象了太多,抽象在接口边界处中断。您实际上需要对长度进行不同的解释。
  • @Deduplicator:这不是 Python 问题。这是所有可变宽度字节编码数据的问题。

标签:
python
string
rest
unicode
utf-8