将Unicode Emoji正确读入R

人气:69 发布:2023-01-03 标签: utf-8 unicode text r emoji

问题描述

我有一组来自Facebook的评论(通过Sprint kr这样的系统获取),其中包含文本和表情符号,我正试图在R中对它们进行各种分析,但在正确接收表情符号方面遇到了困难。

例如:我有一个.csv(以UTF-8编码),它将有一个消息行,其中包含以下内容:

"这是正确的吗!?!?!请说这不是真的!我们家只吃原汁原味的瑞斯花生酱杯"

然后我以以下方式将其摄取到R中:

library(tidyverse)
library(janitor)
raw.fb.comments <- read_csv("data.csv",
                            locale = locale(encoding="UTF-8"))
fb.comments <- raw.fb.comments %>%
  clean_names() %>%
  filter(senderscreenname != "Reese's") %>% 
  select(c(message,messagetype,sentiment)) %>%
  mutate(type = "Facebook")
fb.comments$message[5]
[1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cupsxf0u009fu0092u009axf0u009fu0092u009axf0u009fu0092u009a

"

现在,根据我从其他来源了解的情况,我需要将这个UTF-8转换为ASCII,然后我可以使用ASCII将其与其他emoji资源(如精彩的emojidictionary)链接起来。要使联接工作,我需要将此代码转换为R编码,如下所示:

<e2><9d><a4><ef><b8><8f>

但是,添加普通步骤(使用iconv)并不能达到目的:

fb.comments <- raw.fb.comments %>%
  clean_names() %>%
  filter(senderscreenname != "Reese's") %>% 
  select(c(message,messagetype,sentiment)) %>%
  mutate(type = "Facebook") %>%
  mutate(message = iconv(message, from="UTF-8", to="ascii",sub="byte"))
fb.comments$message[5]
[1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups<f0><9f><92><9a><f0><9f><92><9a><f0><9f><92><9a>

"

有人能告诉我我错过了什么吗,或者我需要找一个不同的表情符号映射资源吗?谢谢!

推荐答案

目标不是很明确,但我怀疑放弃表示表情符号的正确性,而只是将其表示为字节不是最好的方式。例如,如果您希望将emoji转换为他们的描述,您可以这样做:

x <- "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups"

## read emoji info and get rid of documentation lines
readLines("https://unicode.org/Public/emoji/5.0/emoji-test.txt",
          encoding="UTF-8") %>%
    stri_subset_regex(pattern = "^[^#]") %>%
    stri_subset_regex(pattern = ".+") -> emoji

## get the emoji characters and clean them up
emoji %>%
    stri_extract_all_regex(pattern = "# *.{1,2} *") %>%
    stri_replace_all_fixed(pattern = c("*", "#"),
                           replacement = "",
                           vectorize_all=FALSE) %>%
    stri_trim_both() -> emoji.chars

## get the emoji character descriptions
emoji %>%
    stri_extract_all_regex(pattern = "#.*$") %>%
    stri_replace_all_regex(pattern = "# *.{1,2} *",
                           replacement = "") %>%
    stri_trim_both() -> emoji.descriptions


## replace emoji characters with their descriptions.
stri_replace_all_regex(x,
                       pattern = emoji.chars,
                       replacement = emoji.descriptions,
                       vectorize_all=FALSE)

## [1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cupsgreen heartgreen heartgreen heart"

19