重新安排和聚合R行

时间:2017-11-19 04:37:06

标签: r dataframe tidyr

编辑---我已经将问题清理得更小了。

我正在尝试以下列形式聚合数据框,但已陷入困境。

这是来自电话系统的isdn日志输出,因此它包含在整个日志中同时发生的呼叫。这些电话是传入的,而不是传出的。

数据框如下所示:

"V1" "V2""V3""V4"   "V5"        "V6"        "V7"                   "V8"
"1" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189056:" "Oct  2 00:00:01.326 AEDST: ISDN Se0/0/0:15 Q931: RX <- SETUP pd = 8  callref = 0x174E "
"2" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189057:" "  Bearer Capability i = 0x8090A3 "
"3" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189058:" "      Standard = CCITT "
"4" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189059:" "      Transfer Capability = Speech  "
"5" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189060:" "      Transfer Mode = Circuit "
"6" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189061:" "      Transfer Rate = 64 kbit/s "
"7" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189062:" "  Channel ID i = 0xA1839B "
"8" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189063:" "      Preferred, Channel 27 "
"9" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189064:" "  Calling Party Number i = 0x2183, '00123456789' "
"10" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189065:" "     Plan:ISDN, Type:National "
"11" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189066:" " Called Party Number i = 0xC1, '0123456' "
"12" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189067:" "     Plan:ISDN, Type:Subscriber(local) "
"13" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189068:" " Sending Complete"
"14" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189069:" "Oct  2 00:00:01.334 AEDST: ISDN Se0/0/0:15 Q931: TX -> CALL_PROC pd = 8  callref = 0x974E "
"15" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189070:" " Channel ID i = 0xA9839B "
"16" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189071:" "     Exclusive, Channel 27"
"17" "Oct" "" "2" "00:00:02" "10.20.5.31" "82189072:" "Oct  2 00:00:01.350 AEDST: ISDN Se0/0/0:15 Q931: TX -> ALERTING pd = 8  callref = 0x974E "
"18" "Oct" "" "2" "00:00:02" "10.20.5.31" "82189073:" " Progress Ind i = 0x8088 - In-band info or appropriate now available "
"19" "Oct" "" "2" "00:00:02" "10.20.5.31" "82189074:" "Oct  2 00:00:01.358 AEDST: ISDN Se0/0/0:15 Q931: TX -> CONNECT pd = 8  callref = 0x974E"
"20" "Oct" "" "2" "00:00:02" "10.20.5.31" "82189075:" "Oct  2 00:00:01.382 AEDST: ISDN Se0/0/0:15 Q931: RX <- CONNECT_ACK pd = 8  callref = 0x174E"
"21" "Oct" "" "2" "00:00:19" "10.20.5.30" "81488302:" "Oct  2 00:00:18.210 AEDST: ISDN Se0/0/0:15 Q931: TX -> DISCONNECT pd = 8  callref = 0x9AC7 "
"22" "Oct" "" "2" "00:00:19" "10.20.5.30" "81488303:" " Cause i = 0x8090 - Normal call clearing"
"23" "Oct" "" "2" "00:00:19" "10.20.5.30" "81488304:" "Oct  2 00:00:18.290 AEDST: ISDN Se0/0/0:15 Q931: RX <- RELEASE pd = 8  callref = 0x1AC7"
"24" "Oct" "" "2" "00:00:19" "10.20.5.30" "81488305:" "Oct  2 00:00:18.314 AEDST: ISDN Se0/0/0:15 Q931: TX -> RELEASE_COMP pd = 8  callref = 0x9AC7"
"25" "Oct" "" "2" "00:00:21" "10.20.5.31" "82189076:" "Oct  2 00:00:21.053 AEDST: ISDN Se0/1/0:15 Q931: RX <- SETUP pd = 8  callref = 0x093A "

我希望数据集如下所示:

    "V1" "V2""V3""V4"   "V5"        "V6"        "V7"    "UniqueId"       "V8"
    "1" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189056:" "0x174E" "Oct  2 00:00:01.326 AEDST: ISDN Se0/0/0:15 Q931: RX <- SETUP pd = 8  callref = 0x174E "
    "2" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189057:" "0x174E" " Bearer Capability i = 0x8090A3 "
    "3" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189058:" "0x174E" "      Standard = CCITT "
   ....
    "21" "Oct" "" "2" "00:00:19" "10.20.5.30" "81488302:" "0x9AC7" "Oct  2 00:00:18.210 AEDST: ISDN Se0/0/0:15 Q931: TX -> DISCONNECT pd = 8  callref = 0x9AC7 "

重新迭代:

  • 调用引用是识别此数据集的唯一方法,也是已知的 as callref例如0x174E(这是查找唯一调用的唯一方法 在数据集内)。 这是请求的数据框中的新列(UniqueId)。

  • 下面的任何行也会在新列中粘贴相同的callref id,直到它遇到另一行,该行指出同一个callref或另一个call ref。

  • 每次显示callref时,可以将这些行折叠为一行的任何人的奖励积分。请注意,这可能发生在几个不同的状态(当包含callref的行也包含TX - &gt; CALL_PROC,TX - &gt; ALERTING,TX - &gt; CONNECT,RX&lt; - CONNECT_ACK和其他几个。)

例如,我已将第1,2和3行的V7列合并为属于同一个callref

    "V1" "V2""V3""V4"   "V5"        "V6"        "V7"    "UniqueId"       "V8"
    "1" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189056:" "0x174E" "Oct  2 00:00:01.326 AEDST: ISDN Se0/0/0:15 Q931: RX <- SETUP pd = 8  callref = 0x174E \n Bearer Capability i = 0x8090A3 \n Standard = CCITT"

感谢任何答案。

1 个答案:

答案 0 :(得分:1)

所以这个答案有点乱,但我尽了最大努力。

您可以跳过我的read.fwf,因为您对str_split做了同样的事情。我只是想以可行的格式获取数据。

我首先阅读了信息,将一些列分开

example1 <- read.fwf("ex.csv", widths = c(1, 6, 10, 10, 10, 1000), strip.white = T)

将所有内容转换为字符串而不是因素,删除第一行标题,然后重命名列。

example <- example1 %>%
  mutate_all(.funs = as.character) %>%
  slice(-1) %>%
  select(-1,
         Date = 2,
         Time = 3,
         IP = 4,
         id = 5,
         Description = 6)

然后,我将callref发生的第一个点编入索引,然后按这些文本块进行分组。

x <- which(grepl("callref", example$Description))

example <- example %>%
  mutate(callref = ifelse(grepl("callref", Description), 1, 0),
         group = rep(x, c(diff(c(x, x))[1:length(x)-1], nrow(.) - x[length(x)]+1))) 

example df分组后,我总结了文本,超过了组内的描述。我认为这是你要做的主要事情吗?

example2 <- example %>%
  group_by(group) %>%
  summarise(text = paste(Description, collapse = "*"))

之后我将其加入主example df,并使用单独的内容分离出一些重要信息。我们可以通过这种方式获取RX_TX,以及callref id。如果需要,您可以拆分任何其他重要信息,然后我建议使用tidyr的spread函数将该信息转换为列,以便您可以进一步清理它以进行分析。

example3 <- example %>%
  filter(callref == 1) %>%
  left_join(example2, by = "group") %>%
  select(-Description) %>%
  rename(Description = text) %>%
  separate(Description, into = c("firstpart", "RX_TX"), sep = "Q931: ") %>%
  separate(RX_TX, into = c("RX_TX", "Info"), sep = "pd = 8") %>%
  mutate(Call_Ref = substr(gsub("callref \\= ", "", Info), 1, 8))