Question

我在原始数据文件中读取时遇到问题。问题是由于分隔符，一些输入会被切断。由于其中一个标题有＆＃34; \＆＃34;在真实标题前面，Book_Title输出仅为＆＃34; \＆＃34;。我想知道是否有办法忽略这些符号。

输入：

0195153448;"Classical Mythology";"Mark P. O. Morford";"2002";"Oxford University Press"
085409878X;"\"Pie-powder\"; being dust from the law courts;John Alderson Foote";"1973";"EP Publishing"

代码：

data rating.books;
infile "&path\BX-Books.csv" dlm=';' missover dsd firstobs=2;
input   ISBN: $12.
            Book_Title: $quote150.
            Book_Author: $quote60.
            Year_Of_Publication: $quote8.
            Publisher: $quote60.;
run;

输出：

ISBN | Book-Title | Book-Author | Publisher | Publication-Year 
0195153448 | Classical Mythology | Mark P. O. Morford | Oxford University Press | 2002 
085409878X | \ | being dust from the law courts,"|  1973 | Missing value

期望的输出：

     ISBN | Book-Title | Book-Author | Publisher | Publication-Year 
    0195153448 | Classical Mythology | Mark P. O. Morford | Oxford University Press | 2002 
    085409878X | Pie-powder being dust from the law courts |John Alderson Foote | EP Publishing | 1973

Answer 1

您的源数据看起来不像任何已知模式。

如果您在没有DSD选项的情况下阅读它，那么它会将第二行视为有6个字段。

085409878X;"\"Pie-powder\"; being dust from the law courts;John Alderson Foote";"1973";"EP Publishing"

v1=085409878X
v2="\"Pie-powder\"
v3=being dust from the law courts
v4=John Alderson Foote"
v5="1973"
v6="EP Publishing"

如果你试图＆＃34;修复＆＃34;逃脱的报价

_infile_=tranwrd(_infile_,'\"','""');

那么你最终只会有4个字段。

085409878X;"""Pie-powder""; being dust from the law courts;John Alderson Foote";"1973";"EP Publishing"

v1=085409878X
v2="Pie-powder"; being dust from the law courts;John Alderson Foote
v3=1973
v4=EP Publishing
v5=
v6=

要获得所需的输出，您可以尝试删除\";和"\"字符串。

_infile_=tranwrd(_infile_,'\";',' ');
_infile_=tranwrd(_infile_,'"\"','');

它可以让你按照自己的意愿阅读。

085409878X; Pie-powder  being dust from the law courts;John Alderson Foote";"1973";"EP Publishing"

v1=085409878X
v2=Pie-powder  being dust from the law courts
v3=John Alderson Foote"
v4=1973
v5=EP Publishing
v6=

不确定这是否会推广到带有额外引号或额外分号的其他行。

Answer 2

您必须更改一下代码，将缺少的列放入字符串 $ 150。中：

data work.books;
infile "h:\desktop\test.csv" dlm=';' missover dsd firstobs=1;
input   ISBN: $12.
            Book_Title: $150.
            Book_Author: $quote60.
            Year_Of_Publication: $quote8.
            Publisher: $quote60.;
run;

然后，您必须清理列中的特殊字符＆＃34;和\用这个宏函数：

%macro cleaningColumn(col);
    compress(strip(&col),'\"',' ')
%mend cleaningColumn;

您可以将宏函数包含在proc sql语句中，如下所示：

proc sql;
create table want as
    select 
        ISBN,
        %cleaningColumn(Book_Title) as Book_Title,
        Book_Author,
        Year_Of_Publication,
        Publisher
    from books;
run;

Book_Title栏将如下：

Classical Mythology
Pie-powder

此致

读取原始数据文件dlm

2 个答案: