将Unicode Emoji正确读入R

人气：69 发布：2023-01-03 标签： utf-8 unicode text r emoji

问题描述

我有一组来自Facebook的评论(通过Sprint kr这样的系统获取)，其中包含文本和表情符号，我正试图在R中对它们进行各种分析，但在正确接收表情符号方面遇到了困难。

例如：我有一个.csv(以UTF-8编码)，它将有一个消息行，其中包含以下内容：

"这是正确的吗！？！？！请说这不是真的！我们家只吃原汁原味的瑞斯花生酱杯"

然后我以以下方式将其摄取到R中：

library(tidyverse)
library(janitor)
raw.fb.comments <- read_csv("data.csv",
                            locale = locale(encoding="UTF-8"))
fb.comments <- raw.fb.comments %>%
  clean_names() %>%
  filter(senderscreenname != "Reese's") %>% 
  select(c(message,messagetype,sentiment)) %>%
  mutate(type = "Facebook")
fb.comments$message[5]
[1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cupsxf0u009fu0092u009axf0u009fu0092u009axf0u009fu0092u009a

"

现在，根据我从其他来源了解的情况，我需要将这个UTF-8转换为ASCII，然后我可以使用ASCII将其与其他emoji资源(如精彩的emojidictionary)链接起来。要使联接工作，我需要将此代码转换为R编码，如下所示：

<e2><9d><a4><ef><b8><8f>

但是，添加普通步骤(使用iconv)并不能达到目的：

fb.comments <- raw.fb.comments %>%
  clean_names() %>%
  filter(senderscreenname != "Reese's") %>% 
  select(c(message,messagetype,sentiment)) %>%
  mutate(type = "Facebook") %>%
  mutate(message = iconv(message, from="UTF-8", to="ascii",sub="byte"))
fb.comments$message[5]
[1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups<f0><9f><92><9a><f0><9f><92><9a><f0><9f><92><9a>

"

有人能告诉我我错过了什么吗，或者我需要找一个不同的表情符号映射资源吗？谢谢！

推荐答案

目标不是很明确，但我怀疑放弃表示表情符号的正确性，而只是将其表示为字节不是最好的方式。例如，如果您希望将emoji转换为他们的描述，您可以这样做：

x <- "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups"

## read emoji info and get rid of documentation lines
readLines("https://unicode.org/Public/emoji/5.0/emoji-test.txt",
          encoding="UTF-8") %>%
    stri_subset_regex(pattern = "^[^#]") %>%
    stri_subset_regex(pattern = ".+") -> emoji

## get the emoji characters and clean them up
emoji %>%
    stri_extract_all_regex(pattern = "# *.{1,2} *") %>%
    stri_replace_all_fixed(pattern = c("*", "#"),
                           replacement = "",
                           vectorize_all=FALSE) %>%
    stri_trim_both() -> emoji.chars

## get the emoji character descriptions
emoji %>%
    stri_extract_all_regex(pattern = "#.*$") %>%
    stri_replace_all_regex(pattern = "# *.{1,2} *",
                           replacement = "") %>%
    stri_trim_both() -> emoji.descriptions


## replace emoji characters with their descriptions.
stri_replace_all_regex(x,
                       pattern = emoji.chars,
                       replacement = emoji.descriptions,
                       vectorize_all=FALSE)

## [1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cupsgreen heartgreen heartgreen heart"

上一篇：解释为什么Chrome对Arial表情符号的解释...

下一篇：如何通过JNI将C字符串Emoji传递给Java