使用IMPORTXML从网站上抓取数据

时间:2016-11-21 20:13:46

标签: xml xpath google-sheets xquery xpath-2.0

<div id="ext-gen392" class="x-panel-body">
    <div class="identify multiline">
        <div class="item">
            <span class="larger-text">1566 GREENE AVENUE, Brooklyn 11237</span>
        </div>
        <div class="item" style="display:none">
            <span class="label">Alternate address from NYC Dept of City Planning:</span>
            <br>1566 GREENE AVENUE
        </div>
        <div class="item">
            <span style="background-color:#FFE094;" class="legend-color"></span><span class="label" style="font-style: italic;">&nbsp;Residential: Multi-Family Walk-up</span>
        </div>
        <div class="item" style="clear:both;">
            <span class="label">Owner:</span> BAEZ, IGNACIO
        </div>
        <div class="item">
            <span class="label">Block:</span> 3303 <span class="label">Lot:</span> 22
        </div>
        <div class="item">
            <span class="label">Property Characteristics:</span>
            <ul style="list-style-type: none; padding-left: 0;">
                <li><span class="label">Lot Area:</span> 1,950 sq ft (19.5' x 100')</li>
                <li><span class="label"># of Buildings:</span> 1 <span class="label">Year
                    built:</span> 1920 (Year built is an estimate)</li>
                <li><span class="label">Building frontage:</span> 19.5' <span class="faded-text">(Building frontage along the street measured in feet.)</span></li>
                <li><span class="label"># of floors:</span> 3 <span class="label">Building
                    Area:</span> 3,303 sq ft</li>
                <li><span class="label">Total Units:</span> 3 <span class="label">
                    Residential Units:</span> 3</li>
                <li><span class="label">Primary zoning:</span> R6 <span class="label">Commercial Overlay:</span>
                    None</li>
                <li><span class="label">Floor Area Ratio:</span> 1.69 
                    <br>
                    <span class="label">Max. Allowable Residential FAR:</span> 2.43
                    <br>
                    <span class="label">Max. Allowable Commercial FAR:</span> 0
                    <br>
                    <span class="label">Max. Allowable Facility FAR:</span> 4.8
                    <!--REMOVED MAX FAR UNTIL WE FIGURE OUT HOW TO ADD DIFFT FAR VARS FROM PLUTO13-->
                    <!--<span class="label">Max. FAR:</span> 0 -->
                    <span class="faded-text">
                        <br>
                        The Maximum Allowable Floor Area Ratios are exclusive of bonuses for plazas, plaza-connected open areas, arcades or other amenities.
                        <br>
                        FAR may depend on street widths or other characteristics. Contact <a href="http://www1.nyc.gov/site/planning/zoning/about-zoning.page" target="_blank">City Planning Dept.</a> for latest information.</span></li>
            </ul>
        </div>
        <div class="item">
            <span class="label">MORE INFO:</span>
            <ul>
                <li><span class="label">Zoning Map#:</span> <a href="http://www1.nyc.gov/assets/planning/download/pdf/zoning/zoning-maps/map13b.pdf" target="_blank">
                    13b</a> (<a href="http://www1.nyc.gov/site/planning/zoning/zoning-maps.page" target="_blank">how to read</a> NYC zoning maps)</li>
                <li><span class="label">Historical Zoning Maps:</span> <a href="http://www1.nyc.gov/assets/planning/download/pdf/zoning/zoning-maps/historical-zoning-maps/maps13b.pdf" target="_blank">
                    13b</a></li>

                <li><a href="http://a810-bisweb.nyc.gov/bisweb/PropertyProfileOverviewServlet?boro=3&amp;block=3303&amp;lot=22" target="_blank">NYC Dept. of Buildings</a></li>


                <li><a href="http://a836-acris.nyc.gov/bblsearch/bblsearch.asp?borough=3&amp;block=3303&amp;lot=22" target="_blank">Property transaction records</a> (<b>NB:</b> buildings w/condos may not show transaction results)</li>

                <li><a href="http://webapps.nyc.gov:8084/CICS/fin1/find001i?FFUNC=C&amp;FBORO=3&amp;FBLOCK=3303&amp;FLOT=22" target="_blank">NYC Dept. of Finance Assessment Roll</a></li>
                <li><a href="https://hpdonline.hpdnyc.org/HPDonline/provide_address.aspx" target="_blank">NYC HPD data</a></li><!--?p1=3&p2=street number =&p3=street name-->
                <li><a href="http://gis.nyc.gov/doitt/nycitymap/template?z=8&amp;p=1008264,195724&amp;a=ZOLA&amp;c=ZOLA&amp;s=l:Brooklyn,3303,22,PLUTO" target="_blank">NYC Planning's ZoLa application</a></li> <!--http://gis.nyc.gov/doitt/nycitymap/template?z=8&p=988783,211983&a=ZOLA&c=ZOLA&s=a:365,FIFTH+AVENUE,MANHATTAN-->
                <li><a href="http://maps.nyc.gov/taxmap/map.htm?searchType=BblSearch&amp;featureTypeName=EVERY_BBL&amp;featureName=3033030022" target="_blank">NYC Digital Tax Map</a></li>
<!--                <li><a href="http://a810-bisweb.nyc.gov/bisweb/PropertyProfileOverviewServlet?boro=3&block=3303&lot=22" target="_blank">NYC Dept. of Buildings</a></li>
                <li><a href="http://a836-acris.nyc.gov/bblsearch/bblsearch.asp?borough=3&block=3303&lot=22" target="_blank">Property transaction records</a></li>
                <li><a href="http://webapps.nyc.gov:8084/CICS/fin1/find001i?FFUNC=C&FBORO=3&FBLOCK=3303&FLOT=22" target="_blank">NYC Dept. of Finance Assessment Roll</a></li>
                <li><a href="http://gis.nyc.gov/taxmap/map.htm?searchType=FeatureSearch&featureTypeName=TAX_LOT_POLYGON&featureName=3033030022" target="_blank">NYC Digital Tax Map</a></li>-->
                <li><a href="http://www.nyc.gov/html/dcp/html/subcats/zoning.shtml" target="_blank">
                    NYC zoning guide</a></li>
                <li><a href="http://www.oasisnyc.net/watershed/watershed.aspx" target="_blank">NYC
                    Watershed Resources</a></li>
            </ul>
        </div>
        <div class="item">
            <span class="label">OASIS shortcut to this property:</span>
            <br>
            <a href="http://www.oasisnyc.net/map.aspx?zoomto=lot:3033030022">http://www.oasisnyc.net/map.aspx?zoomto=lot:3033030022</a>
        </div>
        <div class="item">
            <span class="faded-text">Source: MapPLUTO Tax
                Block &amp; Tax Lot files from the New York City Department of City Planning,
                2016 (ver. 16v1).</span>
        </div>
<!--        <div class="item" style="width: 95%; margin: 10px 0 5px 4px;">
            <span style="display:block;padding: 1px; color: #000066; background-color: #dddddd; border-bottom: solid 1px #aabbdd;">
                NYC Department of City Planning Census Factfinder
            </span>
            Find all census tracts within
            <select id="selTaxLotRadius" style="font-size:1.1em" >
                <option>0.25</option>
                <option>0.5</option>
                <option>1</option>
            </select>
            mile(s)
            <input type="button" value="Go" style="font-size:1.1em;font-weight:bold;" onclick="var sel=document.getElementById('selTaxLotRadius');CUR.IdentifyLotTemplate.goToNycFF('1566 GREENE AVENUE','3', sel.options[sel.selectedIndex].value);" />
        </div>-->
<!--        <div class="item">
            <div style="width: 95%; margin: 10px 0 5px 4px;">
                <div style="padding: 1px; color: #000066; background-color: #dddddd; border-bottom: solid 1px #aabbdd;">
                    <a href="http://local.yahoo.com/" style="text-decoration: none;"
                        target="newWin"><span style="color: #ff0000; font-weight: bold;">YAHOO!</span> <span style="color: #000066;">
                            Local</span></a> search results for this
                    address:</div>
                <div style="padding-left: 4px;">

                    <div style="margin-top: 4px; color: #888888; font-style: italic;">
                        &nbsp;Know of something that's missing? <a href="http://listings.local.yahoo.com/csubmit/index.php"
                            target="newWin">Add it to YAHOO!</a></div>
                </div>
            </div>
        </div>-->
    </div>
</div>

我正在废弃网站以收集有关属性的一些数据。我正在尝试获取所有者名称,并最终获得<span class="label">之后的所有其他文本属性。以下是我在查询表达式normalize-space(//span[(@class='label') and contains(., 'Owner:')]/following-sibling::text())方面的内容,我使用FirePath评估了表达式,并返回了正确的字符串,但是,在Google表格中,返回的值为空。有什么建议吗?

1 个答案:

答案 0 :(得分:1)

您可以通过修改稍微查询的URL来执行此操作 - 例如,我发现包含所需原始数据的端点如下所示:http://www.oasisnyc.net/service.svc/lot/3033030022?layerstoselect=

然后使用此公式,您可以将原始URL转换为正确的端点:

="http://www.oasisnyc.net/service.svc/lot/"&REGEXEXTRACT(A1,"lot:(\d+)")&"?layerstoselect="

如果您使用=transpose(IMPORTDATA(B1))提取数据,您会看到包含所有字段的列,具体取决于您希望如何排列数据,然后您可以使用arrayformula和whatnot来清理/转换单独的数据它们根据需要...例如,如果您想要列中的标题和数据,您可以输入:

=arrayformula(regexreplace({iferror(ARRAYFORMULA(REGEXEXTRACT(transpose(IMPORTDATA(B1)),"(\w+):"))),iferror(ARRAYFORMULA(REGEXEXTRACT(transpose(IMPORTDATA(B1)),":""?(.*)""?")))},"""",""))

Flask-JWT

如果你想转置成一行,请将整个内容换成转置:

 =transpose(arrayformula(regexreplace({iferror(ARRAYFORMULA(REGEXEXTRACT(transpose(IMPORTDATA(B1)),"(\w+):"))),iferror(ARRAYFORMULA(REGEXEXTRACT(transpose(IMPORTDATA(B1)),":""?(.*)""?")))},"""","")))

enter image description here