Question

我试图从R中的SQL语句中提取表名。例如，我将SQL查询导入R，并且一行包含：

SELECT A , B
FROM Table.1 p
JOIN Table.2 pv
ON p.ProdID.1 = ProdID.1
JOIN Table.3 v
ON pv.BusID.1 = v.BusID
WHERE SubID = 15
ORDER BY v.Name;

在R中，我一直在尝试将strsplit用于SQL语句，该语句将每个单词拆分成一列，创建一个数据框，然后找到匹配单词＆＃34;来自＆＃34;并提取下一个单词，即表1。

我在如何从多个联接中提取其他表格时遇到问题，或者在我的研究过程中是否有更有效的方法或包裹。任何帮助将不胜感激！

Answer 1

这是使用正则表达式的一种方式：

lines <- strsplit("SELECT A, B
                   FROM Table.1 p
                   JOIN Table.2 pv
                   ON p.ProdID.1 = ProdID.1
                   JOIN Table.3 v
                   ON pv.BusID.1 = v.BusID
                   WHERE SubID = 15
                   ORDER BY v.Name;", split = "\\n")[[1]]

sub(".*(FROM|JOIN) ([^ ]+).*", "\\2", lines[grep("(FROM|JOIN)", lines)]) # "Table.1" "Table.2" "Table.3"

细分：

# Use grep to find the indeces of any line containing 'FROM' or 'JOIN'
keywords_regex <- "(FROM|JOIN)"
line_indeces <- grep(keywords_regex, lines) # gives: 2 3 5
table_lines <- lines[line_indeces] # get just the lines that have table names

# Build regular expression to capture the next word after either keyword
table_name_regex <- paste0(".*", keywords_regex, " ([^ ]+).*")

# The "\\2" means to replace each match with the contents of the second capture 
# group, where a capture group is defined by parentheses in the regex
sub(table_name_regex, "\\2", table_lines)

文本挖掘和提取单词

1 个答案: