从文本文件对表情符号(Python)进行编码的最佳且简洁的方式

人气：977 发布：2022-10-16 标签： text-files python encode unicode emoji

问题描述

指的是此问题：Emoji crashed when uploading to Big Query

我正在寻找将emoji从这个ud83dude04类型编码到这个类型(Unicode)-U0001f604的最好、最干净的方式，因为目前，我除了创建python方法之外没有任何想法，它将通过一个文本文件来替换emoji编码。

这是可以转换的字符串：

Converting emojis to Unicode and vice versa in python 3

假设可能需要逐行传递文本并进行转换？？

潜在想法：

with open(ff_name, 'rb') as source_file:
  with open(target_file_name, 'w+b') as dest_file:
    contents = source_file.read()
    dest_file.write(contents.decode('utf-16').encode('utf-8'))

推荐答案

因此，我假设您以某种方式获得了一个原始的ASCII型字符串，该字符串包含带有形成代理项对的UTF16代码单元的转义序列，并且您(无论出于何种原因)想要将其转换为UXXXXXXXX格式。

因此，从今以后我假设您的输入(字节！)外观如下：

weirdInput = "hello \ud83d\ude04".encode("latin_1")

现在您要执行以下操作：

以uXXXX元素转换为UTF-16代码单元的方式解释字节。有raw_unicode_escapes，但不幸的是，它需要单独的传递来修复代理对(老实说，我不知道为什么) 修复代理项对，将数据转换为有效的UTF-16 解码为有效的UTF-16 再次编码为"RAW_UNICODE_EASH" 重新解码为旧版latin_1，仅由旧版ASCII和Unicode转义序列组成，格式为UXXXXXXXX。

如下所示：

  output = (weirdInput
    .decode("raw_unicode_escape")
    .encode('utf-16', 'surrogatepass')
    .decode('utf-16')
    .encode("raw_unicode_escape")
    .decode("latin_1")
  )

现在，如果您print(output)，您将获得：

hello U0001f604

请注意，如果您在中间阶段停止：

smiley = (weirdInput
  .decode("raw_unicode_escape")
  .encode('utf-16', 'surrogatepass')
  .decode('utf-16')
)

然后您会得到一个带有表情符号的Unicode字符串：

print(smiley)
# hello

完整代码：

weirdInput = "hello \ud83d\ude04".encode("latin_1")

output = (weirdInput
  .decode("raw_unicode_escape")
  .encode('utf-16', 'surrogatepass')
  .decode('utf-16')
  .encode("raw_unicode_escape")
  .decode("latin_1")
)


smiley = (weirdInput
  .decode("raw_unicode_escape")
  .encode('utf-16', 'surrogatepass')
  .decode('utf-16')
)

print(output)
# hello U0001f604

print(smiley)
# hello

211

上一篇：属性错误：&#39；无类型&#39；对象没有...

下一篇：在python中从文件中删除特定单词