关于 r：使自定义函数在 dplyr mutate 中应用 rowise

Making a custom function apply rowise in dplyr mutate

我有一个自定义布尔函数来检查一个字符串(我的实际函数比下面提供的要多，这只是作为说明性示例提供的)。

如果我将第一个版本与 dplyr::mutate() 一起使用，它只适用于第一个值，然后将所有行设置为那个答案。

我可以将函数package在 purr::map() 中，但是在较大的数据集上这似乎很慢。它似乎也不是 mutate 正常工作的方式。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

library(tidyverse)

valid_string <- function(string) {
# Check the length
if (stringr::str_length(string) != 10) {
return(FALSE)
}
return(TRUE)
}

# Create a tibble to test on
test_tib <- tibble::tibble(string = c(“1504915593″,”1504915594″,”9999999999″,”123”),
known_valid = c(TRUE, TRUE, TRUE, FALSE))

# Apply the function
test_tib <- dplyr::mutate(test_tib, check_valid = valid_string(string))
test_tib

valid_string2 <- function(string) {
purrr::map_lgl(string, function(string) {
# Check the length
if (stringr::str_length(string) != 10) {
return(FALSE)
}
return(TRUE)
})
}

# Apply the function
test_tib <- dplyr::mutate(test_tib, check_valid2 = valid_string2(string))
test_tib

我建议您将函数重写为 vectorized 函数，如下所示：

1
2
3
4

valid_string <- function(string) {
# Check the length
ifelse(stringr::str_length(string) != 10, FALSE, TRUE)
}

另一个选项是 base 中的 Vectorize 函数，它的工作原理如下：

1
2
3
4
5
6
7
8

valid_string2 <- function(string) {
# Check the length
if(stringr::str_length(string) != 10) {
return(FALSE)
}
return(TRUE)
}
valid_string2 <- Vectorize(valid_string2)

两者都工作得很好，但是我建议使用 ifelse.

的解决方案

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

# Create a tibble to test on
test_tib <- tibble::tibble(string = c(“1504915593″,”1504915594″,”9999999999″,”123”),
known_valid = c(TRUE, TRUE, TRUE, FALSE))

# Apply the function
test_tib <- dplyr::mutate(test_tib, check_valid = valid_string(string))
test_tib <- dplyr::mutate(test_tib, check_valid2 = valid_string2(string))
test_tib

string known_valid check_valid check_valid2
<chr> <lgl> <lgl> <lgl>
1 1504915593 TRUE TRUE TRUE
2 1504915594 TRUE TRUE TRUE
3 9999999999 TRUE TRUE TRUE
4 123 FALSE FALSE FALSE

相关讨论

谢谢，Vectorize 似乎可以工作，只是运行一些测试来查看使用 purr::map、sapply 或 Vectorize 之间的速度是否有任何不同。我不认为我可以使用 ifelse 或 dplyr::if_else 因为我的实际功能比提供的要复杂得多。
好的，请告诉我们。如果一切都按预期工作，那么接受答案会很好:-)
这 3 种方法的速度似乎相当，但我认为 Vectorized 是最干净的，并且最大限度地减少了依赖性……通过分析和微基准测试，我设法将速度提高了 20 倍……我可能应该首先做到这一点！

这是你要找的吗？

1	test_tib <- dplyr::mutate(test_tib, checkval = ifelse(nchar(string)!=10,FALSE,TRUE))

Making a custom function apply rowise in dplyr mutate

猜你喜欢