逐个字符地分析字符串以计算R中可能的单词数

时间:2013-06-21 10:03:18

标签: r string

我正在计算给定音节组合字符串列表的可能单词的数量。音节组合列表如下所示:

syllable_combinations <- c("C", "CC", "CCCV-CCV", "CCCV-CCV-CV", "CCCV-CV-CCV", "CCCV-CCV-CCV-CV", "CCCV-CC-CV", "CCCV-CCV-C", "CCCV-CV", "CV-C-CCCV")

在此列表的基础上,我想根据语音规则计算英语中可能的单词数。为此,我需要浏览音节组合列表中的各个项目,并计算给定该音节音节组合的可能单词的数量。

要为给定的音节组合生成可能的单词数,我需要通过音节组合并依次查看每个字符与其环境的关系。例如,对于第一个音节组合,我需要执行以下操作:

  1. 确定这个单词以单个辅音C(而不是2或3个辅音)开头;
  2. 确定第一个单个辅音后跟元音V;
  3. 确定该单词以下一个音节继续(用连字符表示);
  4. 确定此第二个音节也以单个辅音C开头;
  5. 以另一个元音V。
  6. 结束

    此信息需要与可能出现在这些位置的声音信息相关联:

    number_of_vowels <- 20
    number_of_initial_consonants_length_1 <- 22
    number_of_initial_consonants_length_2 <- 47
    number_of_final_consonants_length_1 <- 24
    

    为了用英语计算具有“CVCV”音节结构的可能单词的数量:

    number_of_CVCV_words <- number_of_initial_consonants_length_1*number_of_vowels*number_of_initial_consonants_length_1*number_of_vowels
    
    number_of_CVCV_words
    193600
    

    关于如何做到这一点的任何建议?

    我对此有了进一步了解,但遇到了一些问题。

    首先,将音节组合拆分为单独的音节:

    split_syllables <- c()
    
    for(i in 1:length(syllable_combinations)){
    strsplit(as.character(syllable_combinations[i]), split = "-") -> split_syllable
    split_syllables <- append(split_syllables, split_syllable)
    }
    

    然后,一个可以匹配每个音节的函数(存在有限数量的唯一音节,因此这是可行的)(对于特定的音节结构,counter1变量给出了英语中可能的声音组合的数量):

    detect_syllables <- function(syllable){
    if(syllable == "C") {
    counter1 <- 25
    } else if(syllable == "CC") {
    counter1 <- 528
    } else if(syllable == "CCCV") {
    counter1 <- 200 
    } else if(syllable == "CCV") {
    counter1 <- 940
    } else if(syllable == "CV") {
    counter1 <- 440
    } else if(syllable == "CVC") {
    counter1 <- 10560
    } else 
    print(syllable, "syllable not matched")
    }
    

    然后,为原始音节组合中的每个音节执行detect_syllables函数的函数:

    one_syllable <- function(first_syllable){
    lapply(split_syllables[[i]][1], FUN = detect_syllables)
    counter1 -> first_syl
    first_syl -> number1
    print(number1)
    }
    
    two_syllables <- function(first_syllable, second_syllable){
    lapply(split_syllables[[i]][1], FUN = detect_syllables)
    counter1 -> first_syl
    lapply(split_syllables[[i]][2], FUN = detect_syllables)
    counter1 -> second_syl
    first_syl*second_syl -> number2
    print(number2) 
    }
    
    three_syllables <- function(first_syllable, second_syllable, third_syllable){
    lapply(split_syllables[[i]][1], FUN = detect_syllables)
    counter1 -> first_syl
    lapply(split_syllables[[i]][2], FUN = detect_syllables)
    counter1 -> second_syl
    lapply(split_syllables[[i]][3], FUN = detect_syllables)
    counter1 -> third_syl
    first_syl*second_syl*third_syl -> number3
    print(number3)
    }
    
    four_syllables <- function(first_syllable, second_syllable, third_syllable, fourth_syllable){
    lapply(split_syllables[[i]][1], FUN = detect_syllables)
    counter1 -> first_syl
    lapply(split_syllables[[i]][2], FUN = detect_syllables)
    counter1 -> second_syl
    lapply(split_syllables[[i]][3], FUN = detect_syllables)
    counter1 -> third_syl
    lapply(split_syllables[[i]][4], FUN = detect_syllables)
    counter1 -> fourth_syl
    first_syl*second_syl*third_syl*fourth_syl -> number4
    print(number4)
    }
    

    一个for循环,以确保适当地使用detect_syllables函数:

    for(i in 1:10){
    if(length(split_syllables[[i]]) == 1) { 
    lapply(split_syllables[[i]][1], FUN = one_syllable)
    } else if(length(split_syllables[[i]]) == 2) {
    lapply(split_syllables[[i]][1], split_syllables[[i]][2], FUN = two_syllables)
    } else if(length(split_syllables[[i]]) == 3) {
    lapply(split_syllables[[i]][1], split_syllables[[i]][2], split_syllables[[i]][3], FUN = three_syllables)
    } else if(length(split_syllables[[i]]) == 4) {
    lapply(split_syllables[[i]][1], split_syllables[[i]][2], split_syllables[[i]][3], split_syllables[[i]][4], FUN = four_syllables)
    } else 
    print("number of syllables is bigger than 4")
    }
    

    但是,当我尝试使用for循环时,收到以下错误消息:

    Error in four_syllables(split_syllables[[1]]) : object 'counter1' not found
    

    我意识到这与评估'counter1'的环境有关,如下所述: Using get inside lapply, inside a function,但我不知道如何解决它。如果我试图将它们指向正确的环境,那么lapply似乎都不喜欢它(FUN中的错误(“C”[[1L]],...):未使用的参数(s))。

    如果不使用lapply(),可以获得非常不理想的结果。如果有人有其他解决方案,我很乐意了解它。

    for(i in 1:10){
    if(length(split_syllables[[i]]) == 1) { 
    detect_syllables(split_syllables[[i]][1]) -> counter1
    counter1 -> first_syl
    first_syl -> number1
    print(number1)
    } else if(length(split_syllables[[i]]) == 2) {
    detect_syllables(split_syllables[[i]][1]) -> counter1
    counter1 -> first_syl
    detect_syllables(split_syllables[[i]][2]) -> counter1
    counter1 -> second_syl
    first_syl*second_syl -> number2
    print(number2)
    } else if(length(split_syllables[[i]]) == 3) {
    detect_syllables(split_syllables[[i]][1]) -> counter1
    counter1 -> first_syl
    detect_syllables(split_syllables[[i]][2]) -> counter1
    counter1 -> second_syl
    detect_syllables(split_syllables[[i]][3]) -> counter1
    counter1 -> third_syl
    first_syl*second_syl*third_syl -> number3
    print(number3)
    } else if(length(split_syllables[[i]]) == 4) {
    detect_syllables(split_syllables[[i]][1]) -> counter1
    counter1 -> first_syl
    detect_syllables(split_syllables[[i]][2]) -> counter1
    counter1 -> second_syl
    detect_syllables(split_syllables[[i]][3]) -> counter1
    counter1 -> third_syl
    detect_syllables(split_syllables[[i]][4]) -> counter1
    counter1 -> fourth_syl
    first_syl*second_syl*third_syl*fourth_syl -> number4
    print(number4)
    } else 
    print("number of syllables is bigger than 4")
    }
    

1 个答案:

答案 0 :(得分:0)

不确定我是否遵循了您想要做的所有事情,但这里有一些可能有助于您入门的代码。

# save first two syllables
split_combs <- strsplit(syllable_combinations, "-")
syl1 <- sapply(split_combs, "[", 1)
syl2 <- sapply(split_combs, "[", 2)

# function to look at how a string starts
check.start <- function(string, start) {
    # does the string start with this?
    tfn <- substring(string, 1, nchar(start))==start
    tfn[is.na(tfn)] <- FALSE
    tfn
    }

# show all syllable combinations with the first two syllables starting with CV
syllable_combinations[check.start(syl1, "CV") & check.start(syl2, "CV")]