标点和近查询

时间:2018-07-26 19:36:25

标签: marklogic marklogic-9

当我在punctuation-insensitive中打开cts:word-query时,即使NEAR查询也将-单词分解为两个单词

let $xml :=

  <abstracts count="1">
            <abstract>
              <abstract_text count="1">
                <p>We assessed the impact of a pharmacotherapy follow-up programme on key safety points [adverse events (AE) 
                and drug administration] in outpatients treated with oral antineoplastic agents (OAA). We performed a comparative, 
                interventional, quasi-experimental study of outpatients treated with OAA in a Spanish hospital to compare pre-intervention 
                group patients (not monitored by pharmacists during 2011) with intervention group patients (prospectively monitored by 
                pharmacists during 2013). AE data were collected from medical records. Follow-up was 6 months, and 249 patients were 
                included (pre-intervention, 115; intervention, 134). After the first month, AE were detected in 86.5% of patients 
                in the pre-intervention group and 80.6% of patients in the intervention group, P = 0.096. During the remaining months, 
                79.0% patients had at least one AE in the pre-intervention group compared with 78.0% in the intervention group, P = 0.431. 
                AE were more prevalent with sorafenib and sunitinib. In total, 173 drug interactions were recorded (pre-intervention, 80; 
                intervention, 93; P = 0.045). Drug interactions were more frequent with erlotinib and gefitinib; food interactions were 
                more common with sorafenib and pazopanib. Our follow-up of cancer outpatients revealed a reduction in severe AE and major 
                drug interactions, thus helping health professionals to monitor the safety of OAA.</p>
              </abstract_text>
            </abstract>
          </abstracts>

let $q3 :=
    cts:near-query(
      (
       cts:element-query((xs:QName("abstract_text")),
          cts:word-query( ("Controlled", "randomized", "randomised", "clinical", "masked","blind*","multi center", "open label*","compar*", "cross over", "placebo",
                "post market","meta analysis","volunteer*", "prospective"
                ),
          ("case-insensitive", "punctuation-insensitive", "wildcarded"))
        )
        ,
        cts:element-query((xs:QName("abstract_text")),
          cts:word-query(("stud*", "trial*" ),
          ("case-insensitive", "punctuation-insensitive", "wildcarded"))
        )
      ),   
       3
    )

return 
  cts:highlight($xml,$q3, <b>{$cts:text}</b>)

当我将NEAR放在3上时,即使距离为comparative并且我有{{ 1}}。但是当我将其更改为study时,它就可以工作了。

但是当我也更改为3时,即使与punctuation-insensitive4的距离仍然不匹配。为什么会这样?

我也想在punctuation-sensitive中实现说NEAR3的匹配。我认为一旦打开word-query并在词查询中搜索placebo-controlled时,它将找到所有词的组合。但是,当相同时,placebo controlled距离会如何影响用于punctuation-insensitive查询中?

1 个答案:

答案 0 :(得分:1)

实际上,这与解决搜索中的标点符号无关,而与MarkLogic如何标记和索引单个单词的位置无关。默认情况下,MarkLogic的标记化将带连字符的短语分解为单独的单词。如果您不喜欢默认行为,则可以使用自定义标记器来指示MarkLogic应该如何索引单词。有一个非常详细的指南,介绍如何使用自定义标记器忽略单词标记化available here中的连字符。

对于您的情况,我不确定我是否建议您使用自定义标记器进行探索。可能会产生意想不到的后果,并且不如使用默认标记化的性能高。取而代之的是,使代码适应默认标记化的工作方式可能更有意义。

让我们看看:comparative, interventional, quasi-experimental study

它将被标记为:

Word            | Position
comparative     | 0
interventional  | 1
quasi           | 2
experimental    | 3
study           | 4

因此,comparativestudy之间的距离为4。请注意,quasi-experimental被标记为两个单词。

我不确定我是否理解您在上一段中提出的问题。但我希望这能为您提供足够的信息,以更好地了解默认标记化的行为。