如何让n()不在行云流水中太多呢?

人气:421 发布:2022-10-16 标签: r dplyr tidyverse na

问题描述

考虑下面的MWE,其中Amt表示每个Food项的不同数量(从1到40),另一个变量表示该食品项的Site。我想要食品的汇总中位数和计数n(),但没有NA的。

MWE

 mwe <- data.frame(
  Site = sample(rep(c("Home", "Office"), size = 884)),
  Food = sample(rep(c("Banana","Apple","Egg","Berry","Tomato","Potato","Bean","Pea","Nuts","Onion","Carrot","Cabbage","Eggplant"), size=884)),
  Amt = sample(seq(1, 40, by = 0.25), size = 884, replace = TRUE)
)
random <- sample(seq(1, 884, by = 1), size = 100, replace = TRUE) # to randomly introduce 100 NAs to Amt vector
mwe$Amt[random] <- NA

数据框

    Site     Food   Amt
1 Office  Cabbage 16.50
2   Home    Apple 36.00
3 Office      Egg  7.25
4   Home    Onion 16.00
5 Office Eggplant 36.50
6   Home     Nuts    NA

汇总代码

dfsummary <- mwe %>%
  dplyr::group_by(Food, Site) %>%
  dplyr::summarise(Median = round(median(Amt, na.rm=TRUE), digits=2), N = n()) %>%
  ungroup()

输出

# A tibble: 6 x 4
  Food   Site   Median     N
  <fct>  <fct>   <dbl> <int>
1 Apple  Home     17      34
2 Apple  Office   22.2    34
3 Banana Home     19.5    34
4 Banana Office   19.9    34
5 Bean   Home     20      34
6 Bean   Office   18      34

一些食品显示NA值,但它们在N计数中取得了进展。我只是不想计算NA向量中有NAs的那些。

推荐答案

我们可以在顶部filter,然后执行summarise而不更改代码

library(dplyr)
mwe %>% 
   filter(!is.na(Amt)) %>% 
   dplyr::group_by(Food, Site) %>%
    dplyr::summarise(Median = round(median(Amt, na.rm=TRUE), digits=2),
       N = n()) %>%
    ungroup()

或其他选项是将n()更改为sum(!is.na(Amt))

mwe %>%
    dplyr::group_by(Food, Site) %>%
    dplyr::summarise(Median = round(median(Amt, na.rm=TRUE), digits=2), 
         N = sum(!is.na(Amt))) %>%
    ungroup()

352