解析XML的最佳方法

时间:2017-08-09 21:17:25

标签: python pandas csv xml-parsing

我正在使用可用的xml文件here

我想解析并加载LON, LAT, PGA, PGV, MMI, PSA03, PSA10, PSA30, STDPGA, URAT and SVEL作为CSV文件的标题。

grid_data元素以空格分隔符方式包含所有这些标头的所有值。

我正在寻找下面的csv file output

LON LAT PGA PGV MMI PSA03 PSA10 PSA30 STDPGA URAT SVEL
-99.6833 38.2891 0.04 0.04 2.04 0.09 0.02 0 0.65 1 363.294
-99.6666 38.2891 0.04 0.04 2.06 0.09 0.02 0 0.65 1 342.531
-99.6500 38.2891 0.04 0.04 2.11 0.1 0.02 0 0.65 1 303.783
-99.6333 38.2891 0.04 0.04 2.08 0.09 0.02 0 0.65 1 334.629
-99.6166 38.2891 0.04 0.05 2.15 0.09 0.02 0 0.65 1 279.535
-99.6000 38.2891 0.04 0.04 2.08 0.09 0.02 0 0.65 1 326.391
-99.5833 38.2891 0.04 0.04 2.02 0.08 0.02 0 0.65 1 390.897
-99.5666 38.2891 0.04 0.04 2.08 0.09 0.02 0 0.65 1 346.033

稍后,我会使用pandas for python来查找最大PGV值并进行GIS分析。

到目前为止,这是我的代码:

import sys
import traceback
from xml.dom import minidom
import warnings
warnings.filterwarnings("ignore")

try:
    print "*"*20 + " The Beginning " + "*"*20

    xml_file_location = r"C:\Users\*****\Downloads\Grids\us2000a3y4_grid.xml"
    xmldoc = minidom.parse(xml_file_location)
    itemlist = xmldoc.getElementsByTagName('grid_field')
    for item in itemlist:
        print (item.attributes['name'].value)



Catch all exception and print to the screen
except:
    e = sys.exc_info()[0]
    print( "Error: %s\n\n" % e )

#Closing script
finally:
    print "*"*20 + " The End " + "*"*20

1 个答案:

答案 0 :(得分:1)

考虑使用内置etree解析 grid_data 节点,并使用pandas.read_table将其直接传递到StringIO()

import pandas as pd
import xml.etree.ElementTree as et
from io import StringIO    
import requests as rq

# RETRIEVE URL OBJECT
r = rq.get('https://earthquake.usgs.gov/realtime/product/shakemap/us2000a3y4/us/1501736303313/download/grid.xml')

# BUILD TREE FROM URL CONTENT
doc = et.fromstring(r.content)

# PARSE <grid_data> TEXT WITH UNDECLARED PREFIX NAMESPACE
data = doc.find('.//{http://earthquake.usgs.gov/eqcenter/shakemap}grid_data').text

# READ SPACE-DELIMITED STRING INTO DATAFRAME
df = pd.read_table(StringIO(data), sep="\\s+", header=0, 
                   names=['LON','LAT','PGA', 'PGV', 'MMI','PSA03','PSA10','PSA30','STDPGA','URAT','SVEL'])

print(df.head())
#         LON      LAT   PGA   PGV   MMI  PSA03  PSA10  PSA30  STDPGA  URAT     SVEL
# 0 -100.3997  38.1145  0.01  0.01  1.77   0.02   0.01    0.0    0.65   1.0  354.533
# 1 -100.3831  38.1145  0.01  0.02  1.82   0.02   0.01    0.0    0.65   1.0  310.786
# 2 -100.3664  38.1145  0.01  0.01  1.77   0.02   0.01    0.0    0.65   1.0  354.545
# 3 -100.3497  38.1145  0.01  0.01  1.76   0.02   0.01    0.0    0.65   1.0  362.307
# 4 -100.3331  38.1145  0.01  0.01  1.76   0.02   0.01    0.0    0.65   1.0  360.332

print(df.tail())
#             LON      LAT   PGA   PGV   MMI  PSA03  PSA10  PSA30  STDPGA  URAT     SVEL
# 105767 -94.4831  33.2425  0.01  0.01  1.78   0.02   0.01    0.0    0.65   1.0  337.237
# 105768 -94.4664  33.2425  0.01  0.02  1.89   0.03   0.01    0.0    0.65   1.0  249.221
# 105769 -94.4497  33.2425  0.01  0.02  1.83   0.02   0.01    0.0    0.65   1.0  297.622
# 105770 -94.4331  33.2425  0.01  0.01  1.63   0.02   0.01    0.0    0.65   1.0  500.368
# 105771 -94.4164  33.2425  0.01  0.01  1.77   0.02   0.01    0.0    0.65   1.0  340.302