如何以纯文本格式提取网页上的所有URL(链接)?

时间:2015-03-19 03:46:50

标签: php html regex string web-crawler

基本上我想从网页中提取所有网址,即使它们不是可点击的链接。

例如,页面源可能是:

<html>

<title>Random Website I am Crawling</title>

<body>

Click <a href="http://clicklink.com">here</a> for foobar

Another site is http://foobar.com

</body>

</html>

我想要显示两个网址,

http://clicklink.com and http://foobar.com

我也不希望被包括在内。

我当前的脚本抓住了网址,但似乎还抓住了一堆其他垃圾,使链接可以点击,无法存储在数据库中。

这是我目前的代码。

<?php

$db = new PDO('mysql:host=localhost;dbname=crawler;charset=utf8', 'crawler', '***', array(PDO::ATTR_EMULATE_PREPARES => false, 
                                                                                                PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION));

$url="http://www.frozencpu.com/";
$data=file_get_contents($url);
$data = strip_tags($data,"<a>");
$d = preg_split("/<\/a>/",$data);
foreach ( $d as $k=>$u ){
    if( strpos($u, "<a href=") !== FALSE ){
    //echo $u;
    //echo "<BR>";
        $u = preg_replace("/.*<a\s+href=\"/sm","",$u);
        $u = preg_replace("/\".*/","",$u);
        //echo $u;
        //echo "<BR>";
        $db->exec("INSERT INTO urls(url, crawled) VALUES('$u', '0')");
    }
}

?>

以下是输出示例

http://www.facebook.com/pages/FrozenCPUcom/351841771499<BR>http://twitter.com/FrozenCPU<BR>/rss/frozencpu.rss<BR>http://www.frozencpu.com/index.html?id=CR9RnD2g<BR>http://www.frozencpu.com/index.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cart.html?id=CR9RnD2g<BR>http://www.frozencpu.com/account.html?id=CR9RnD2g<BR>http://www.frozencpu.com/tracking.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help_center.html?id=CR9RnD2g<BR>http://www.frozencpu.com/manage_carts.html?id=CR9RnD2g<BR>

*在此之前似乎很好

Then it just junks up big time

&nbsp;&nbsp;<a href='http://www.frozencpu.com/advanced_search.html?id=CR9RnD2g' class=small>Advanced Search<BR>http://www.frozencpu.com/brands/shop_by_brand.html?id=CR9RnD2g<BR>http://www.frozencpu.com/shop_category.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g30/Liquid_Cooling.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g57/EK_Products.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g59/XSPC_Products.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g60/LutroO_Products.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g12/Accessories.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g40/Air_Cooling.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g53/Apparel.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g34/Bay_Devices.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g54/Cabinet_Cooling.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g2/Cables.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g32/Caffeine.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g1/Cases.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g58/CaseLabs_Cases.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g45/Custom_Cases.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g43/Case_Parts-OEM.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g51/Connectors.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g48/CPU_Heatsinks.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g44/DIYMod_Parts.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g4/Electronics.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g36/Fans.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g47/Fan_Accessories.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g39/Gaming.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g6/Lighting.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g49/Phase_Change.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g11/Power_Supplies.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g55/Screws.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g35/SleevingHeatshrink.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g7/Sound_Dampening.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g52/Switches.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g8/Thermal_Interface.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g31/Travel_Cases.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g33/Ultra_Quiet.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g42/Window_Kits.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g50/Custom_Services.html?id=CR9RnD2g<BR>http://www.frozencpu.com/index.html?enable=1&id=CR9RnD2g<BR>http://www.frozencpu.com/products/2770/gc-01/Gift_Certificate.html?id=CR9RnD2g<BR>http://www.frozencpu.com/rebates.html?id=CR9RnD2g<BR>http://www.frozencpu.com/aboutus.html?id=CR9RnD2g<BR>http://www.frozencpu.com/resource.html?id=CR9RnD2g<BR>http://www.frozencpu.com/career.html?id=CR9RnD2g<BR>http://www.frozencpu.com/clearance/list/p1/Clearance-Page1.html?id=CR9RnD2g<BR>http://www.frozencpu.com/contactus.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help_center.html?id=CR9RnD2g<BR>http://www.frozencpu.com/news.html?id=CR9RnD2g<BR>http://www.frozencpu.com/links.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?id=CR9RnD2g<BR>http://www.frozencpu.com/media.html?id=CR9RnD2g<BR>http://www.frozencpu.com/account.html?id=CR9RnD2g<BR>http://www.frozencpu.com/manage_carts.html?view_cart=Wish%2dList&wish_list=1&id=CR9RnD2g<BR>http://www.frozencpu.com/new_products.html?id=CR9RnD2g<BR>http://www.frozencpu.com/powder_coating.html?id=CR9RnD2g<BR>http://www.frozencpu.com/press.html?id=CR9RnD2g<BR>http://www.frozencpu.com/rebates.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cart.html?id=CR9RnD2g<BR>http://www.frozencpu.com/sitemap.html?id=CR9RnD2g<BR>http://www.frozencpu.com/testimonials.html?id=CR9RnD2g<BR>http://www.frozencpu.com/tracking.html?id=CR9RnD2g<BR>http://www.frozencpu.com/stores.html?id=CR9RnD2g<BR>


            <a href='http://www.facebook.com/pages/FrozenCPUcom/351841771499' target=<BR>
            <a href='http://twitter.com/FrozenCPU' target=<BR>
            <a href='/rss/frozencpu.rss' target=<BR>https://www.resellerratings.com
<BR>https://www.securitymetrics.com/sitecertsummary.adp?s=67%2e228%2e74%2e232&amp;i=340380<BR>mailto:lori@frozencpu.com?subject=WESTERN%20UNION<BR>http://www.frozencpu.com/products/23382/ex-wat-303/XSPC_Raystorm_RX240_V3_Extreme_Universal_CPU_Water_Cooling_Kit_w_D5_Variant_Pump_Included_and_Free_Dead-Water.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/23382/ex-wat-303/XSPC_Raystorm_RX240_V3_Extreme_Universal_CPU_Water_Cooling_Kit_w_D5_Variant_Pump_Included_and_Free_Dead-Water.html?id=CR9RnD2g

                                    The XSPC Raystorm RX240 V3 Universal CPU Water Cooling Kit comes complete with everything you will need to cool your CPU.  This kit is designed to handle your CPU and can be expanded to handle more blocks as well.

The kit uses the newest XSPC CPU block, the Raystorm as the core cooling component.  This block has a pure copper base and is a top o...
                                    3 In Stock, Ships Today Till 6pm EST
                                    $259.99
                                <BR>http://www.frozencpu.com/products/17220/ex-wat-223/XSPC_Copper_Raystorm_AX240_Extreme_Intel_CPU_Water_Cooling_Kit_w_Twin_D5_w_Free_Dead-Water.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/17220/ex-wat-223/XSPC_Copper_Raystorm_AX240_Extreme_Intel_CPU_Water_Cooling_Kit_w_Twin_D5_w_Free_Dead-Water.html?id=CR9RnD2g

                                    The RayStorm Copper Twin D5 AX240 kit is the most powerful 240 kit XSPC have ever made. It includes a special Copper edition of our RayStorm block, our fantastic new AX240 radiator and two D5 Vario pumps in series.

The RayStorm Copper has the same great performance as our award winning RayStorm block, but with an all metal design. The acetal top...
                                    7 In Stock, Ships Today Till 6pm EST
                                    $399.99
                                <BR>http://www.frozencpu.com/products/22914/cas-495/PrimoChill_Hasher_-_Rugged_Crypto_Stackable_Mining_Rack_R-HRC.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/22914/cas-495/PrimoChill_Hasher_-_Rugged_Crypto_Stackable_Mining_Rack_R-HRC.html?id=CR9RnD2g

                                    PrimoChill once again provides a good lookin, easy solution to the unimaginable. Introducing, one hell of a crypto rack, The Hasher! 

Built out of rugged, 1in anodized extruded aluminum t-slot, the PrimoChill Hasher is tough but cool enough to keep out of the basement. It combines not only functionality but order to the chaos that other mining r...
                                    5 In Stock, Ships Today Till 6pm EST
                                    $129.99
                                <BR>http://www.frozencpu.com/products/13815/ele-933/Add2PSU_Multiple_Power_Supply_Adapter.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/13815/ele-933/Add2PSU_Multiple_Power_Supply_Adapter.html?id=CR9RnD2g

                                    Small, lightweight, and true Plug N Play, the Add2Psu adapter allows you to add more power to your computer. No cutting wires or soldering, no compromising the integrity or function of your PC.

Now there is a way to add more power to your PC. Finally a true plug and play way to manage additional power for those big video cards, bigger hard drive...
                                    290 In Stock, Ships Today Till 6pm EST
                                    $19.95
                                <BR>http://www.frozencpu.com/products/25635/ex-wat-335/Larkooler_SkyWater_330L_All-In-One_Liquid_Cooling_Kit_LCS0030.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/25635/ex-wat-335/Larkooler_SkyWater_330L_All-In-One_Liquid_Cooling_Kit_LCS0030.html?id=CR9RnD2g

                                    The SkyWater 330L is a new liquld cooling system with a variable speed pump and Fans in desktop PC. The water cooling system is designed for the best thermal solution of CPU, the most important component of your PC. The SkyWater 330L provides a low noise at  low speed fans , high performance at high speed fans and reliable liquid cooling system.

...
                                    4 In Stock, Ships Today Till 6pm EST
                                    $129.99
                                <BR>http://www.frozencpu.com/products/26337/ex-blc-1942/Aquacomputer_Kryographics_GTX_980_Full_Coverage_Liquid_Cooling_Block_-_Copper_Acrylic_Glass_23614.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/26337/ex-blc-1942/Aquacomputer_Kryographics_GTX_980_Full_Coverage_Liquid_Cooling_Block_-_Copper_Acrylic_Glass_23614.html?id=CR9RnD2g

                                    Combined GPU/RAM/VRM-cooler for graphics cards of the type nvidia GTX 980 with 4 GB RAM according to reference design.
This cooler combines the features of a graphics chip cooler and RAM-coolers in an elegant and very flat watercooler. Additionally the voltage regulators are also cooled effectively.

The kryographics for GTX 980 water block offe...
                                    5 In Stock, Ships Today Till 6pm EST
                                    $129.99
                                <BR>http://www.frozencpu.com/products/19760/bus-348/Lamptron_CW611_36W_-_6_Channel_Aluminum_Liquid_Cooling_Controller_-_Black_CW611.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/19760/bus-348/Lamptron_CW611_36W_-_6_Channel_Aluminum_Liquid_Cooling_Controller_-_Black_CW611.html?id=CR9RnD2g

                                    Introducing the Lamptron CW611 Water Cooling fan controller! The first in a series of advanced control 5.25&#8243; bay devices that allow complete control over your entire PC cooling system.  You can use this controller to be used with fans, liquid cooling pumps, as well as flow meters.  The first in a new series of controllers this is sure to get ...
                                    52 In Stock, Ships Today Till 6pm EST
                                    $99.99
                                <BR>http://www.frozencpu.com/products/9350/fan-583/Noiseblocker_NB-BlackSilentFan_XM2_40mmx10mm_Ultra_Quiet_Fan_-_3800_RPM_-_14_dBA.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/9350/fan-583/Noiseblocker_NB-BlackSilentFan_XM2_40mmx10mm_Ultra_Quiet_Fan_-_3800_RPM_-_14_dBA.html?id=CR9RnD2g

                                    The Noiseblocker NB-BlackSilentFan XM2 40mmx10mm Ultra Quiet Fan, manufactured by Noiseblocker, Germany's quietest fan manufacturer, the BlackSilentFan series features extraordinary life spans and near silent operation.  Using the NB-Longlife advanced sleeve bearing and matched with the NB-EKA drive, the BlackSilentFan series runs more than double ...
                                    20 In Stock, Ships Today Till 6pm EST
                                    $12.95
                                <BR>http://www.frozencpu.com/products/25250/cst-1779/Phanteks_Enthoo_Luxe_Full_Tower_Chassis_w_Window_-_White_PH-ES614L_WT.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/25250/cst-1779/Phanteks_Enthoo_Luxe_Full_Tower_Chassis_w_Window_-_White_PH-ES614L_WT.html?id=CR9RnD2g

                                    Staying true to the Phanteks’ Enthoo line, the Luxe features a sandblasted front and top panel. Ambient lighting run from top to front of the case on both sides. Even though smaller in size, the Enthoo Luxe boost many features from the award-winning Enthoo Primo. The Luxe comes pre-installed with a 200mm front fan and 2x PH-F140SP fans. Phanteks’ E...
                                    In Stock, Ships Today Till 6pm EST
                                    $159.99
                                <BR>http://www.frozencpu.com/products/25721/ex-wat-337/MagiCool_DIY_Complete_Single_120mm_Liquid_Cooling_Kit_MC-G12V1.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/25721/ex-wat-337/MagiCool_DIY_Complete_Single_120mm_Liquid_Cooling_Kit_MC-G12V1.html?id=CR9RnD2g

                                    The MagiCool DIY Complete Liquid Cooling Kit comes with everything you need to set your system up on liquid.  The CPU block is compatible with all current sockets giving you flexibility for now and for future upgrades as well.  The radiator is a slim profile variant allowing for maximum case compatibility.
Compression fittings are provided for dur...
                                    5 In Stock, Ships Today Till 6pm EST
                                    $124.99
                                <BR>http://www.frozencpu.com/products/26065/ex-blc-1936/Alphacool_NexXxoS_GPX_Nvidia_Geforce_GTX_970_M03_Liquid_Cooling_Blockw_Backplate_11199.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/26065/ex-blc-1936/Alphacool_NexXxoS_GPX_Nvidia_Geforce_GTX_970_M03_Liquid_Cooling_Blockw_Backplate_11199.html?id=CR9RnD2g

                                    With the new NexXxoS GPX coolers Alphacool is again a step ahead! Optimum performance and quality in a new cooling design for a great price!

A new sophisticated injection system means the GPU is actively cooled. All other chips are sufficiently cooled by the passive cooler which is also in contact with the watercooling block for extra efficiency...
                                    3 In Stock, Ships Today Till 6pm EST
                                    $94.99
                                <BR>http://www.frozencpu.com/products/14175/bus-285/Alphacool_Heatmaster_II_Liquid_Cooling_PCB_Control_Board_26153.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/14175/bus-285/Alphacool_Heatmaster_II_Liquid_Cooling_PCB_Control_Board_26153.html?id=CR9RnD2g

                                    The new generation of cooling control from Alphacool: The Heatmaster II

The new Alphacool Heatmaster II was developed in Germany over multiple years, and has continuously been improved considering the experiences from the first version. Hence we are now, after a development and testing period of almost 3 years, able to present the best Heatmaste...
                                    4 In Stock, Ships Today Till 6pm EST
                                    $84.99
                                <BR>http://www.frozencpu.com/products/23748/ex-tub-3052/EK_ZMT_Tubing_-_38_ID_58OD_-_1_Foot_-_Black_EK-Tube_ZMT_Matte_Black_15995mm.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/23748/ex-tub-3052/EK_ZMT_Tubing_-_38_ID_58OD_-_1_Foot_-_Black_EK-Tube_ZMT_Matte_Black_15995mm.html?id=CR9RnD2g

                                    EK ZMT (Zero Maintainance Tubing) is a high quality, zero maintainance industrial grade EPDM rubber tubing in stylish matte black.

This tubing is - just like Norprene - designed to withstand harsh conditions for a very long period of time, offering a truly exceptional lifespan even under UV, ozone and heat exposure for many years.

Unlike most...
                                    62 In Stock, Ships Today Till 6pm EST
                                    $2.50
                                <BR>http://www.frozencpu.com/products/25897/ex-wat-342/XSPC_Raystorm_EX360_Extreme_Universal_CPU_Water_Cooling_Kit_w_DDC_Photon_and_Free_Dead-Water.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/25897/ex-wat-342/XSPC_Raystorm_EX360_Extreme_Universal_CPU_Water_Cooling_Kit_w_DDC_Photon_and_Free_Dead-Water.html?id=CR9RnD2g

                                    The XSPC Raystorm DDC Photon EX360 Universal CPU Water Cooling Kit comes complete with everything you will need to cool your CPU.  This kit is designed to handle your CPU and can be expanded to handle more blocks as well.

The kit uses the newest XSPC CPU block, the Raystorm as the core cooling component.  This block has a pure copper base and is...
                                    5 In Stock, Ships Today Till 6pm EST
                                    $254.99
                                <BR>http://www.frozencpu.com/products/26379/fan-1397/Alphacool_Susurro_120mm_x_25mm_Fan_-_1700RPM_24684.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/26379/fan-1397/Alphacool_Susurro_120mm_x_25mm_Fan_-_1700RPM_24684.html?id=CR9RnD2g

                                    A new generation of fans joins the Alphacool range. The Susurro, Spanish for Whisper.

A fundamental review of known fan designs was used to manufacture the Susurro. The perfect harmony between the AlphaCool blue and deep blacks make a great impression. The transparent black fan is optimized to cause virtually no noise.

But don’t be persuaded ...
                                    2 In Stock, Ships Today Till 6pm EST
                                    $14.99
                                <BR>http://www.frozencpu.com/products/18800/ex-res-486/Alphacool_Clip-On_Reservoir_Mount_2_Piece_Set_w_5mm_LED_Support_-_50mm.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/18800/ex-res-486/Alphacool_Clip-On_Reservoir_Mount_2_Piece_Set_w_5mm_LED_Support_-_50mm.html?id=CR9RnD2g

                                    The best Alphacool reservoir mounts of all times!

Many reservoir mounts were designed for the original tube reservoirs from the beginning of the PC water cooling sector. During the last years though, the reservoirs became larger, sized for more capacity and metal was integrated for the end caps. This resulted in heavier reservoirs, making the co...
                                    1 In Stock, Ships Today Till 6pm EST
                                    $10.99
                                <BR>http://www.frozencpu.com/news.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?gu=1&id=CR9RnD2g<BR>http://www.frozencpu.com/help/h25/Ordering_with_a_PO.html?id=CR9RnD2g<BR>http://www.frozencpu.com/testimonials.html?id=CR9RnD2g<BR>http://www.frozencpu.com/index.html?id=CR9RnD2g<BR>http://www.frozencpu.com/sitemap.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help_center.html?id=CR9RnD2g<BR>http://www.frozencpu.com/contactus.html?id=CR9RnD2g<BR>http://www.frozencpu.com/problem.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help/h15/Legal.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help/h13.html?id=CR9RnD2g<BR>http://www.getfirefox.com<BR>

2 个答案:

答案 0 :(得分:2)

如果您想要查看<a href=内的所有网址,请特别注意href的{​​{1}}属性并不总是标记内的第一个内容。像<a>这样的标签会被忽略。

如果您想搜索所有网址,无论上下文如何,您都可以忽略标记并查找一般的网址格式,如下所示:

<a target=_blank href=http://google.com>

这可能需要大量的抛光,但应该抓住机会开始。 但请注意,这只会匹配完整的网址。 $urls = preg_match_all('/[a-z]+:\/\/[a-zA-Z0-9?+.=%:\/]+/', $content, $matches); 等相关网页的链接显然不匹配。

Regular Expressions are not a recommended solution to parse HTML以来,我担心您需要为<a href="index.html">等更合适的解决方案提供资源,以便对页面进行打谷并充分查找网址。

答案 1 :(得分:1)

为了与所有类型的网址匹配,以下代码可以为您提供帮助:

<?php

$content = '<html>

<title>Random Website I am Crawling</title>

<body>

Click <a href="http://clicklink.com">here</a> for foobar

Another site is http://foobar.com

</body>

</html>';

$regex = "((https?|ftp)\:\/\/)?"; // SCHEME
$regex .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)?@)?"; // User and Pass
$regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP
$regex .= "(\:[0-9]{2,5})?"; // Port
$regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query
$regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor


$matches = array(); //create array
$pattern = "/$regex/";

preg_match_all($pattern, $content, $matches); 

print_r(array_values(array_unique($matches[0])));
echo "<br><br>";
echo implode("<br>", array_values(array_unique($matches[0])));


/*
 * With your code
*/

$db = new PDO('mysql:host=localhost;dbname=crawler;charset=utf8', 'crawler', '***', array(PDO::ATTR_EMULATE_PREPARES => false, 
                                                                                                PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION));
$url="http://www.frozencpu.com/";
$data=file_get_contents($url);
$matches = array();

preg_match_all($pattern, $data, $matches); 
$array = array_values(array_unique($matches[0]));
    $count = count($array);

    for($i = 0; $i < $count; $i++) {
          $db->exec("INSERT INTO urls(url, crawled) VALUES('{$array[$i]}', '0')");
}

    ?>

这是更新代码,似乎有用,但速度极慢。

<?php

$db = new PDO('mysql:host=localhost;dbname=crawler;charset=utf8', 'crawler', '***', array(PDO::ATTR_EMULATE_PREPARES => false, 
                                                                                                PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION));

$url="http://proxylists.connectionincognito.com/";
$content=file_get_contents($url);

$regex = "((https?|ftp)\:\/\/)?"; // SCHEME
$regex .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)?@)?"; // User and Pass
$regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP
$regex .= "(\:[0-9]{2,5})?"; // Port
$regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query
$regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor


$matches = array(); //create array
$pattern = "/$regex/";

preg_match_all($pattern, $content, $matches); 

$unique = array_unique($matches[0]);

foreach ($unique as $url) {

//Insert if none exist

$stmt = $db->prepare("SELECT * FROM urls WHERE url='$url'");
$stmt->bindParam(1, $_GET['id'], PDO::PARAM_INT);
$stmt->execute();
$row = $stmt->fetch(PDO::FETCH_ASSOC);

if( ! $row)
{

$db->exec("INSERT INTO urls(url, crawled) VALUES('$url', '0')");
}
//Insert end code
}
?>

参考:

  

http://php.net/manual/en/function.preg-match.php

相关问题