限定符可以用于R中的正则表达式替换吗？

人气：75 发布：2023-01-03 标签： regex pcre r character-replacement

问题描述

我的目标是用一个符号替换一个字符串，该符号重复的字符与该字符串的字符一样多，就像人们可以用\U\1替换大写字母一样，如果我的模式是(*)，我用x\q1或{\q1}x替换由x\q1或{\q1}x捕获的字符，因此我将获得与*捕获的字符一样多的x。

这可能吗？

我主要在sub,gsub中思考，但您可以使用stringi,stringr等其他库来回答。您可以方便地使用perl = TRUE或perl = FALSE和任何其他选项。

我假设答案可能是否定的，因为选项似乎相当有限(?gsub)：

a replacement for matched pattern in sub and gsub. Coerced to character if possible. For fixed = FALSE this can include backreferences "1" to "9" to parenthesized subexpressions of pattern. For perl = TRUE only, it can also contain "U" or "L" to convert the rest of the replacement to upper or lower case and "E" to end case conversion. If a character vector of length 2 or more is supplied, the first element is used with a warning. If NA, all elements in the result corresponding to matches will be set to NA.

主要量词有(?base::regex)：

?

    The preceding item is optional and will be matched at most once.
*

    The preceding item will be matched zero or more times.
+

    The preceding item will be matched one or more times.
{n}

    The preceding item is matched exactly n times.
{n,}

    The preceding item is matched n or more times.
{n,m}

    The preceding item is matched at least n times, but not more than m times.

OK，但它似乎是一个选项(不在PCRE中，不确定是否在PERL或Where...中)(*)，它捕获星形量词能够匹配的字符数量(我在https://www.rexegg.com/regex-quantifier-capture.html找到它)，因此可以使用q1(相同引用)来引用第一个捕获的量词(和q2，等等)。)我还读到(*)等同于{0,}，但我不确定这是否真的是我感兴趣的事实。

编辑更新：

由于受到评论者的提问，我将使用this interesting question提供的特定示例来更新我的问题。我对这个例子做了一点修改。假设我们有a <- "I hate extra spaces elephant"，所以我们感兴趣的是保持单词之间的唯一空格，每个单词的前5个字符(直到这里作为原始问题)，但然后每个其他字符都有一个圆点(不确定这是否是原始问题中预期的，但没关系)，因此得到的字符串将是"I hate extra space. eleph..."(spaces中的最后s一个圆点，ant结尾的3个圆点))。因此，我首先使用

保留前5个字符

gsub("(?<!\S)(\S{5})\S*", "\1", a, perl = TRUE)
[1] "I hate extra space eleph"

我应该如何用点或任何其他符号替换\S*中的确切字符数？

推荐答案

不能在替换模式中使用限定符，也不能使用它们匹配多少个字符的信息。

您需要的是G base PCRE pattern来查找字符串中特定位置之后的连续匹配：

a <- "I hate extra spaces elephant"
gsub("(?:\G(?!^)|(?<!\S)\S{5})\K\S", ".", a, perl = TRUE)

请参阅R demo和regex demo。

详细信息

(?:G(?!^)|(?<!S)S{5})-上一次成功匹配的结束或前面没有非空格字符的五个非空格字符 K-amatch reset operator丢弃到目前为止匹配的文本 S-任何非空格字符。

上一篇：在php中，如何使用preg替换将URL转换为T...

下一篇：匹配贪婪、非贪婪和介于两者之间的所有其他