在PySpark列表中有条件地拆分逗号分隔值

时间:2016-04-17 18:17:18

标签: python list csv pyspark

我试图在PySpark中经营一份工作。我的数据在使用PySpark spark上下文类(sc)创建的RDD中,如下所示:

directory_file = sc.textFile('directory.csv')

*我认为Python的csv模块不能用于RDD中的数据。

这会为csv中的每一行生成一个列表。我知道这是令人讨厌的,但这里有一个列表的样本(等同于原始csv中的一行):

[u'14K685,El Puente Academy for Peace and Justice,Brooklyn,K778,718-387-1125,718-387-4229,9,12,,,"B24, B39, B44, B44-SBS, B46, B48, B57, B60, B62, Q54, Q59","G to Broadway ; J, M to Hewes St ; Z to Marcy Ave",250 Hooper Street,Brooklyn,NY,11211,www.elpuente.us,225,N/A,Consortium School,"We are a small, innovative learning community that promotes comprehensive academic excellence for all students while inspiring and nurturing leadership for peace and justice. Our state-of-the-art facility allows for a creative and intellectually challenging environment where every student thrives. Our project-based curriculum is designed to prepare students to be active citizens and independent thinkers who share a passion for transforming their communities and the world into a better place. Our trimester system allows students to complete most of their high school credits by the 11th grade, opening opportunities for exciting internships and college courses during the school day in their senior year.","Accelerated credit accumulation (up to 18 credits per year), iLearn, iZone 360, Year-long SAT (Scholastic Aptitude Test) preparatory course, Individualized college counseling, Early College Awareness & Preparatory Program (ECAPP). Visits to college campuses in NYC, Visits to colleges outside NYC in partnership with the El Puente Leadership Center, Internships, Community-based Projects, Portfolio Assessment, Integrated-Arts Projects, Before- and After-school Tutoring; Elective courses include: Drama, Dance (Men\'s and Women\'s Groups), Debate Team partnership with Tufts University, Guitar, Filmmaking, Architecture, Glee",Spanish,,,,"AM and PM Academic Support, B-Boy/B-Girl, Chorus, College and Vocational Counseling and Placement, College Prep, Community Development Project, Computers, Dance Level 1 and 2, Individual Drama; Education for Public Inquiry and International Citizenship (EPIIC), El Puente Leadership Center, Film, Fine Arts, Liberation, Media, Men\u2019s and Women\u2019s Groups, Movement Theater Level 1, Movement Theater Level 2, Music, Music Production, Pre-professional training in Dance, PSAT/SAT Prep, Spoken Word, Student Council, Teatro El Puente, Visual Art",,,,"Boys & Girls Basketball, Baseball, Softball, Volleyball",El Puente Williamsburg Leadership Center; The El Puente Bushwick Center; Leadership Center at Taylor-Wythe Houses; Beacon Leadership Center at MS50.,"Woodhull Medical Center, Governor Hospital","Hunter College (CUNY), Eugene Lang College The New School for Liberal Arts, Pratt College of Design, Tufts University, and Touro College.","El Puente Leadership Center, El Puente Bushwick Center, Beacon Leadership Center at MS50, Leadership Center at Taylor-Wythe Houses, Center for Puerto Rican Studies, Hip- Hop Theatre Festival, Urban Word, and Summer Search.",,,,,Our school requires assessment of an Academic Portfolio for graduation.,,9:00 AM,3:30 PM,This school will provide students with disabilities the supports and services indicated on their IEPs.,ESL,Not Functionally Accessible,1,Priority to Brooklyn students or residents,Then to New York City residents,,,,,,,,,"250 Hooper Street']

我想使用逗号作为分隔符来拆分每个列表项,除非逗号在双引号之间(例如" ,,,")。

parsed = directory_file.map(lambda x: x.split(','))显然不会在双引号之间处理逗号。有没有办法做到这一点?我已经看到这个问题特别提到了csv,但是因为在这种情况下csv首先被加载到Spark RDD中,所以我非常确定csv模块不适用于此处。

谢谢。

3 个答案:

答案 0 :(得分:2)

您可以使用正则表达式。它在PySpark中运行得非常快:

import re
rdd=sc.textFile("factbook.csv")

# Get rid of those commas we do not need
cleanedRdd=rdd.map(lambda x:re.match( r'(.*".*)(,)(.*".*)', x, re.M|re.I).group(1)+" "re.match( r'(.*".*)(,)(.*".*)', x, re.M|re.I).group(3) if re.match( r'(.*".*)(,)(.*".*)', x, re.M|re.I) !=None else x)

因此对于与此相似的每一行:

col1,"col2,blabla",col3

此代码与Regex模式匹配。如果找到模式,则会创建3个组:

  • 第1组:col1," col2
  • 第2组:,
  • 第3组:blabla",col3

最后我们连接第1组和第2组,输出将是:

col1,"col2 blabla",col3

答案 1 :(得分:0)

使用您的数据,这应该有效:

new_csv = [""]
inside_quotes = False
pos = 0
for letter in csv:
    if letter == ",":
        if inside_quotes:
            new_csv[pos] += letter
        else:
            new_csv.append("")
            pos += 1
    elif letter == '"':
        inside_quotes = not inside_quotes  # Switch inside_quotes to True if False or vice versa.
    else:
        new_csv[pos] += letter

new_csv = [x for x in new_csv if x != ''] # Remove all '' 's.
print(new_csv)

<强>输出

['14K685', 'El Puente Academy for Peace and Justice', 'Brooklyn', 'K778', '718-387-1125', '718-387-4229', '9', '12', 'B24, B39, B44, B44-SBS, B46, B48, B57, B60, B62, Q54, Q59', 'G to Broadway ; J, M to Hewes St ; Z to Marcy Ave', '250 Hooper Street', 'Brooklyn', 'NY', '11211', 'www.elpuente.us', '225', 'N/A', 'Consortium School', 'We are a small, innovative learning community that promotes comprehensive academic excellence for all students while inspiring and nurturing leadership for peace and justice. Our state-of-the-art facility allows for a creative and intellectually challenging environment where every student thrives. Our project-based curriculum is designed to prepare students to be active citizens and independent thinkers who share a passion for transforming their communities and the world into a better place. Our trimester system allows students to complete most of their high school credits by the 11th grade, opening opportunities for exciting internships and college courses during the school day in their senior year.', "Accelerated credit accumulation (up to 18 credits per year), iLearn, iZone 360, Year-long SAT (Scholastic Aptitude Test) preparatory course, Individualized college counseling, Early College Awareness & Preparatory Program (ECAPP). Visits to college campuses in NYC, Visits to colleges outside NYC in partnership with the El Puente Leadership Center, Internships, Community-based Projects, Portfolio Assessment, Integrated-Arts Projects, Before- and After-school Tutoring; Elective courses include: Drama, Dance (Men's and Women's Groups), Debate Team partnership with Tufts University, Guitar, Filmmaking, Architecture, Glee", 'Spanish', 'AM and PM Academic Support, B-Boy/B-Girl, Chorus, College and Vocational Counseling and Placement, College Prep, Community Development Project, Computers, Dance Level 1 and 2, Individual Drama; Education for Public Inquiry and International Citizenship (EPIIC), El Puente Leadership Center, Film, Fine Arts, Liberation, Media, Men’s and Women’s Groups, Movement Theater Level 1, Movement Theater Level 2, Music, Music Production, Pre-professional training in Dance, PSAT/SAT Prep, Spoken Word, Student Council, Teatro El Puente, Visual Art', 'Boys & Girls Basketball, Baseball, Softball, Volleyball', 'El Puente Williamsburg Leadership Center; The El Puente Bushwick Center; Leadership Center at Taylor-Wythe Houses; Beacon Leadership Center at MS50.', 'Woodhull Medical Center, Governor Hospital', 'Hunter College (CUNY), Eugene Lang College The New School for Liberal Arts, Pratt College of Design, Tufts University, and Touro College.', 'El Puente Leadership Center, El Puente Bushwick Center, Beacon Leadership Center at MS50, Leadership Center at Taylor-Wythe Houses, Center for Puerto Rican Studies, Hip- Hop Theatre Festival, Urban Word, and Summer Search.', 'Our school requires assessment of an Academic Portfolio for graduation.', '9:00 AM', '3:30 PM', 'This school will provide students with disabilities the supports and services indicated on their IEPs.', 'ESL', 'Not Functionally Accessible', '1', 'Priority to Brooklyn students or residents', 'Then to New York City residents', '250 Hooper Street']

如何运作

  1. 初始化包含一个空字符串元素的listnew_csv。这将存储我们的最终输出。

  2. 初始化boolinside_quotes,当我们解析引号内或引号之外的字母时,它会告诉我们的程序。

  3. 初始化intpos,它会告诉我们new_csv列表中的位置。

  4. 迭代字符串中的每个字母。

  5. 检查字母是否为,

    • 检查我们是否正在解析引号内的字符串。

      • 如果是True,我们会将,添加到new_csv内的字符串中。

      • 如果是False,我们不会添加它,我们会添加一个新的空白字符串,我们会pos += 1

  6. 如果没有,请检查该字母是否为"

    • 如果True,我们会将bool inside_quotes切换为True,如果为False则为假,如果为真,则使用方便的not关键字将其切换为。
  7. 如果是其他任何字符,我们只需将字符添加到列表中的任何字符串中。

  8. 执行一些清理并从列表中删除所有空白字符串''

  9. 打印出来:)。

答案 2 :(得分:0)

这是阅读表格中非常常见的问题。值得庆幸的是,Python有一个库可以为您执行此操作,因此您无需手动执行此操作。你说csv模块不起作用,为什么?如果它不起作用,请尝试以下代码和注释!

import csv

# please note: KEEP YOUR FILE AS STRINGS when you read in your data.
# Don't do anything to it to try to split it or something.
my_rdd = sc.textFile("/your/file/location/*)
split_with_quotes = my_rdd.map(lambda row: next(csv.reader(row.splitlines(), skipinitialspace=True))

您应该注意CSV包中的csv解析器的字符串长度限制为131,072个字符,因此如果您有非常长的字符串,则需要做更多的工作。

要检查是否是这种情况,请运行以下命令:my_rdd.filter(lambda x: len(x) >= 131072).count()。如果count不是0,则表示字符串太长。