如何使用notepad ++在句子中多次提取两个字符串之间的文本

时间:2015-04-12 16:19:21

标签: notepad++

我必须在

之间提取文本
</cons> and <con

使用Notepad++多次出现在文本文件的句子中 我的示例数据是这样的:

<abstract>
<sentence>The <cons lex="CD4_coreceptor" sem="G#protein_molecule">CD4 coreceptor</cons> interacts with <cons lex="non-polymorphic_region" sem="G#protein_domain_or_region">non-polymorphic regions</cons> of <cons lex="major_histocompatibility_complex_class_II_molecule" sem="G#protein_family_or_group">major histocompatibility complex class II molecules</cons> on <cons lex="antigen-presenting_cell" sem="G#cell_type">antigen-presenting cells</cons> and contributes to <cons lex="T_cell_activation" sem="G#other_name">T cell activation</cons>.</sentence>
<sentence>We have investigated the effect of <cons lex="CD4_triggering" sem="G#other_name"><cons lex="CD4" sem="G#protein_molecule">CD4</cons> triggering</cons> on <cons lex="T_cell_activating_signal" sem="G#other_name">T cell activating signals</cons> in a <cons lex="lymphoma_model" sem="G#other_name">lymphoma model</cons> using <cons lex="monoclonal_antibody" sem="G#protein_family_or_group">monoclonal antibodies</cons> (<cons lex="mAb" sem="G#protein_domain_or_region">mAb</cons>) which recognize different <cons lex="CD4_epitope" sem="G#protein_family_or_group">CD4 epitopes</cons>.</sentence>
<sentence>We demonstrate that <cons lex="CD4_triggering" sem="G#other_name"><cons lex="CD4" sem="G#protein_molecule">CD4</cons> triggering</cons> delivers signals capable of activating the <cons lex="NF-AT_transcription_factor" sem="G#protein_molecule">NF-AT transcription factor</cons> which is required for <cons lex="interleukin-2_gene_expression" sem="G#other_name"><cons lex="interleukin-2" sem="G#protein_molecule">interleukin-2</cons> gene expression</cons>.</sentence>
<sentence>Whereas different <cons lex="anti-CD4_mAb" sem="G#protein_family_or_group">anti-CD4 mAb</cons> or <cons lex="HIV-1_gp120" sem="G#protein_molecule"><cons lex="HIV-1" sem="G#virus">HIV-1</cons> gp120</cons> could all trigger activation of the <cons lex="protein_tyrosine_kinase" sem="G#protein_family_or_group">protein tyrosine kinases</cons> <cons lex="p56lck" sem="G#protein_molecule">p56lck</cons> and <cons lex="p59fyn" sem="G#protein_molecule">p59fyn</cons> and phosphorylation of the <cons lex="Shc_adaptor_protein" sem="G#protein_molecule">Shc adaptor protein</cons>, which mediates signals to <cons lex="Ras" sem="G#protein_family_or_group">Ras</cons>, they differed significantly in their ability to activate <cons lex="NF-AT" sem="G#protein_molecule">NF-AT</cons>.</sentence>
<sentence>Lack of full activation of <cons lex="NF-AT" sem="G#protein_molecule">NF-AT</cons> could be correlated to a dramatically reduced capacity to induce <cons lex="calcium_flux" sem="G#other_name"><cons lex="calcium" sem="G#atom">calcium</cons> flux</cons> and could be complemented with a <cons lex="calcium_ionophore" sem="G#other_organic_compound">calcium ionophore</cons>.</sentence>
<sentence>The results identify functionally distinct <cons lex="epitope" sem="G#protein_family_or_group">epitopes</cons> on the <cons lex="CD4_coreceptor" sem="G#protein_molecule">CD4 coreceptor</cons> involved in activation of the <cons lex="Ras/protein_kinase_C_and_calcium_pathway" sem="G#other_name"><cons lex="Ras/protein_kinase_C" sem="G#protein_molecule"><cons lex="Ras/protein_kinase_C_pathway" sem="G#other_name"><cons lex="Ras" sem="G#protein_molecule">Ras</cons><cons lex="protein_kinase_C" sem="G#protein_molecule">/protein kinase C</cons></cons></cons> and <cons lex="calcium_pathway" sem="G#other_name">calcium pathways</cons></cons>.</sentence>
 </abstract>

我想要的输出

interacts with 
of 
on 
and contributes to
on 
in 
using 
which recognize different 
triggering
delivers signals capable of activating the
which is required for 
or 
could all trigger activation of the 
and

我试过正则表达式

 .*<\/cons>(.*?)<cons.*  and replace with with $1

只给出了最后一次出现

的数据
</cons> and <con 

来自每个句子,而我的句子包含多个这些标签。谁能帮助我?

3 个答案:

答案 0 :(得分:0)

  1. 转到搜索 - &gt;在Notepad ++中替换
  2. 选择搜索模式为正则表达式
  3. 在查找内容中将正则表达式设置为&#34;&lt; [^&gt;] +&gt;&#34;并在“替换为字段放置空间”中单击“全部替换”,
  4. 它将用空格替换所有xml标签(您也可以在换字段中添加换行符)

    它会给你留下字符串: -

    CD4共同受体与抗原呈递细胞上主要组织相容性复合物II类分子的非多态性区域相互作用,并促进T细胞活化。  我们使用识别不同CD4表位的单克隆抗体(mAb)研究了CD4触发对淋巴瘤模型中T细胞激活信号的影响。  我们证明CD4触发提供能够激活白细胞介素-2基因表达所需的NF-AT转录因子的信号。  尽管不同的抗CD4 mAb或HIV-1 gp120都可以触发蛋白酪氨酸激酶p56lck和p59fyn的激活以及介导Ras信号的Shc衔接蛋白的磷酸化,但它们激活NF-AT的能力显着不同。  缺乏NF-AT的完全激活可能与诱导钙通量的能力显着降低相关,并且可以补充钙离子载体。  结果鉴定了参与Ras /蛋白激酶C和钙途径活化的CD4共同受体上功能不同的表位。

    我希望它有所帮助。

答案 1 :(得分:0)

使用正则表达式解析XML很困难。最好使用XML解析器。以下Python 3 SAX内容解析器会在解析</cons>结束标记(self.state = 1)时跟踪,如果后面紧跟文本内容(self.state = 2),则会立即跟踪{ {1}}启动元素。如果是,则打印内容:

cons

输出:

import xml.sax

data = b'''\
<abstract>
<sentence>The <cons lex="CD4_coreceptor" sem="G#protein_molecule">CD4 coreceptor</cons> interacts with <cons lex="non-polymorphic_region" sem="G#protein_domain_or_region">non-polymorphic regions</cons> of <cons lex="major_histocompatibility_complex_class_II_molecule" sem="G#protein_family_or_group">major histocompatibility complex class II molecules</cons> on <cons lex="antigen-presenting_cell" sem="G#cell_type">antigen-presenting cells</cons> and contributes to <cons lex="T_cell_activation" sem="G#other_name">T cell activation</cons>.</sentence>
<sentence>We have investigated the effect of <cons lex="CD4_triggering" sem="G#other_name"><cons lex="CD4" sem="G#protein_molecule">CD4</cons> triggering</cons> on <cons lex="T_cell_activating_signal" sem="G#other_name">T cell activating signals</cons> in a <cons lex="lymphoma_model" sem="G#other_name">lymphoma model</cons> using <cons lex="monoclonal_antibody" sem="G#protein_family_or_group">monoclonal antibodies</cons> (<cons lex="mAb" sem="G#protein_domain_or_region">mAb</cons>) which recognize different <cons lex="CD4_epitope" sem="G#protein_family_or_group">CD4 epitopes</cons>.</sentence>
<sentence>We demonstrate that <cons lex="CD4_triggering" sem="G#other_name"><cons lex="CD4" sem="G#protein_molecule">CD4</cons> triggering</cons> delivers signals capable of activating the <cons lex="NF-AT_transcription_factor" sem="G#protein_molecule">NF-AT transcription factor</cons> which is required for <cons lex="interleukin-2_gene_expression" sem="G#other_name"><cons lex="interleukin-2" sem="G#protein_molecule">interleukin-2</cons> gene expression</cons>.</sentence>
<sentence>Whereas different <cons lex="anti-CD4_mAb" sem="G#protein_family_or_group">anti-CD4 mAb</cons> or <cons lex="HIV-1_gp120" sem="G#protein_molecule"><cons lex="HIV-1" sem="G#virus">HIV-1</cons> gp120</cons> could all trigger activation of the <cons lex="protein_tyrosine_kinase" sem="G#protein_family_or_group">protein tyrosine kinases</cons> <cons lex="p56lck" sem="G#protein_molecule">p56lck</cons> and <cons lex="p59fyn" sem="G#protein_molecule">p59fyn</cons> and phosphorylation of the <cons lex="Shc_adaptor_protein" sem="G#protein_molecule">Shc adaptor protein</cons>, which mediates signals to <cons lex="Ras" sem="G#protein_family_or_group">Ras</cons>, they differed significantly in their ability to activate <cons lex="NF-AT" sem="G#protein_molecule">NF-AT</cons>.</sentence>
<sentence>Lack of full activation of <cons lex="NF-AT" sem="G#protein_molecule">NF-AT</cons> could be correlated to a dramatically reduced capacity to induce <cons lex="calcium_flux" sem="G#other_name"><cons lex="calcium" sem="G#atom">calcium</cons> flux</cons> and could be complemented with a <cons lex="calcium_ionophore" sem="G#other_organic_compound">calcium ionophore</cons>.</sentence>
<sentence>The results identify functionally distinct <cons lex="epitope" sem="G#protein_family_or_group">epitopes</cons> on the <cons lex="CD4_coreceptor" sem="G#protein_molecule">CD4 coreceptor</cons> involved in activation of the <cons lex="Ras/protein_kinase_C_and_calcium_pathway" sem="G#other_name"><cons lex="Ras/protein_kinase_C" sem="G#protein_molecule"><cons lex="Ras/protein_kinase_C_pathway" sem="G#other_name"><cons lex="Ras" sem="G#protein_molecule">Ras</cons><cons lex="protein_kinase_C" sem="G#protein_molecule">/protein kinase C</cons></cons></cons> and <cons lex="calcium_pathway" sem="G#other_name">calcium pathways</cons></cons>.</sentence>
 </abstract>'''

class Handler(xml.sax.ContentHandler):

    def __init__(self):
        xml.sax.ContentHandler.__init__(self)
        self.state = 0
        self.content = ''

    def characters(self,content):
        if self.state == 1:
            self.content = content
            self.state = 2
        else:
            self.state = 0

    def startElement(self,name,attr):
        if name == 'cons' and self.state == 2:
            print(self.content)
        self.state = 0

    def endElement(self,name):
        if name == 'cons':
            self.state = 1
        else:
            self.state = 0

xml.sax.parseString(data,Handler())

这是我在Notepad ++中使用正则表达式所做的最好的事情。它在最后一次替换后处理除文本之外的所有内容:

enter image description here

输出:

 interacts with 
 of 
 on 
 and contributes to 
 on 
 in a 
 using 
 (
) which recognize different 
 delivers signals capable of activating the 
 which is required for 
 or 
 could all trigger activation of the 

 and 
 and phosphorylation of the 
, which mediates signals to 
, they differed significantly in their ability to activate 
 could be correlated to a dramatically reduced capacity to induce 
 and could be complemented with a 
 on the 
 involved in activation of the 
 and 

答案 2 :(得分:0)

提取数据有一种简单的方法,如上面提到的notepad ++

search .*?</cons>([^<]*?)<cons
replace \1\r\n
相关问题