我正在为维基百科开发一个工具,而且我需要从模板的wikitext中提取参数名称和值,如下所示:
|param1=value1 |param2=value2 |param3=value3
这很容易,但有两个并发症。首先,可能有空格和换行符:
|param1=value1
| param2 = value 2
| param3 = value 3
第二,可能有管道在参数值内!像这样:
|param1=value1
|param2 = [[value2|val2]]
|param3 = [[ value3 | val3 ]]
我担心这种正则表达的掌握程度超出了我目前的技能。谁能看到解决方案?谢谢!
答案 0 :(得分:3)
您可以使用现有的库(如mwclient(https://github.com/mwclient/mwclient)和mwparserfromhell(https://github.com/earwig/mwparserfromhell)来实现此目的。
例如,下面的代码将提取模板&来自https://en.wikipedia.org/wiki/Test页面的参数:
import mwclient
import mwparserfromhell
wiki = mwclient.Site(('https','en.wikipedia.org'), '/w/')
page = wiki.Pages['Test']
text = page.text()
wikicode = mwparserfromhell.parse(text)
templates = wikicode.filter_templates()
for template in templates:
print "Found template %s" % template.name
for param in template.params:
print "\tFound param %s with value %s" % (param.name, param.value)
你会看到类似的东西:
Found template SampleTemplate
Found param1 with value value1
Found param2 with value value2
Found param3 with value value3
...
答案 1 :(得分:0)
input = `|param1=value1
|param2 = [[value2|val2]]
|param3 = [[ value3 | val3 ]]`
var output = input.replace(/[\s\n]*/g,'').match(/\w+=(\[\[.+?\]\]|\w+)/g).map(item => {
var pairs = item.split('=');
pairs[1] = pairs[1].match(/\w+/g);
return {
key: pairs[0],
values: pairs[1]
}
});
/* output:
[
{
"key": "param1",
"values": [
"value1"
]
},
{
"key": "param2",
"values": [
"value2",
"val2"
]
},
{
"key": "param3",
"values": [
"value3",
"val3"
]
}
]
*/
答案 2 :(得分:0)
=
字符时才会起作用。
var str = "|param1=value1
|param2 = [[value2|val2]]
|param3 = [[ value3 | val3 ]]";
//Delete line break and spaces
var splited = str.split(/\=/), len = splited.length, result = [];
var for(var i = 0; i < len; i++){
result.push({param:splited[1], value:splited[i+1]});
}
//result = [{param:param1, value:value1}, {param:param2, value:value2} ...]