如何从python中的文本文件中提取url?

时间:2017-01-19 19:16:17

标签: python python-2.7 text-files

我有一个充满URL和文本的文本文件,我想提取以

开头的网址
thumbnailUrl\": \

我使用了这段代码

def get_net_target(page):
    start_link=page.find("thumbnailUrl")
    start_quote=page.find('"',start_link)
    end_quote=page.find('"',start_quote+1)
    url=page[start_quote+1:end_quote]
    print url

my_file = open("data.txt")
page = my_file.read()

print(get_net_target(page))

我想要这样的输出

https://tse3.mm.bing.net///th?id=OIP.Mcbb568859281f5bc7a7f64d8c58d4895H1&pid=Api\
https:\\/\\/tse1.mm.bing.net\\/th?id=OIP.M7ff1f4e880bac2c244c0b6a286cee669o2&pid=Api\

...

但我只得到:

None

几行数据......

webSearchUrl\": \"https:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=RUc0BARkL2P78A5CI7XPWqhCYAA2XaQLP-fHGdfODEY&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dshoaibmalik%26id%3d97C5A1ECB43BCDC1B5739F49555CE0C75CEDF83F%26simid%3d607996336242885612&p=DevEx,5006.1\", \"thumbnailUrl\": \"https:\\/\\/tse2.mm.bing.net\\/th?id=OIP.Me19820ab68b4bcc7ec82756b2b5ecffbo1&pid=Api\", \"datePublished\": \"2011-07-08T12:00:00\", \"contentUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=gA9S9qCIF1jvD5yA4V9VOqfrJUxdW2_wyacSDR15Yc8&v=1&r=http%3a%2f%2fwww.forumpakistan.com%2fimages%2fcelebrity-profiles%2fShoaib-Malik-1.jpg&p=DevEx,5008.1\", \"hostPageUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=IODAmtxi3pYzDGhiJcJgCv0fWHEq8hlJauGxRW5o2c4&v=1&r=http%3a%2f%2fok-khan.blogspot.com%2f2011%2f07%2fshoaib-malik.html&p=DevEx,5007.1\", \"contentSize\": \"48445 B\", \"encodingFormat\": \"jpeg\", \"hostPageDisplayUrl\": \"ok-khan.blogspot.com\\/2011\\/07\\/shoaib-malik.html\", \"width\": 500, \"height\": 647, \"thumbnail\": {\"width\": 231, \"height\": 300}, \"imageInsightsToken\": \"ccid_4Zggq2i0*mid_97C5A1ECB43BCDC1B5739F49555CE0C75CEDF83F*simid_607996336242885612\", \"imageId\": \"97C5A1ECB43BCDC1B5739F49555CE0C75CEDF83F\", \"accentColor\": \"3A6491\"}, {\"name\": \"Pakistani Crickert Player: Shoaib Malik\", \"webSearchUrl\": \"https:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=4qc04BUbtNDwiCHco5m3IY_YFqKVaY2q8ZWhX-DvFQs&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dshoaibmalik%26id%3dF690295FD18526BA8225367169A0664405923A09%26simid%3d608039315980946676&p=DevEx,5012.1\", \"thumbnailUrl\": \"https:\\/\\/tse3.mm.bing.net\\/th?id=OIP.Mcbb568859281f5bc7a7f64d8c58d4895H1&pid=Api\", \"datePublished\": \"2012-12-24T12:00:00\", \"contentUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=9psh5pXKn2R_2Zn4-iMzpjDFePVuLSNVJhbVjf2uTI0&v=1&r=http%3a%2f%2fi1.tribune.com.pk%2fwp-content%2fuploads%2f2010%2f10%2fshoaib-malik-640x480.jpg&p=DevEx,5014.1\", \"hostPageUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=-cUvEUoDmZ1OAI-PVQc4MOfS-ELdt5Im521SJ2ZP4j8&v=1&r=http%3a%2f%2fpakistanicricketplayr44410.blogspot.com%2f2012%2f12%2fshoaib-malik.html&p=DevEx,5013.1\", \"contentSize\": \"51986 B\", \"encodingFormat\": \"jpeg\", \"hostPageDisplayUrl\": \"pakistanicricketplayr44410.blogspot.com\\/2012\\/12\\/shoaib-malik.html\", \"width\": 640, \"height\": 480, \"thumbnail\": {\"width\": 300, \"height\": 225}, \"imageInsightsToken\": \"ccid_y7VohZKB*mid_F690295FD18526BA8225367169A0664405923A09*simid_608039315980946676\", \"imageId\": \"F690295FD18526BA8225367169A0664405923A09\", \"accentColor\": \"98AE1D\"}, {\"name\": \"Pakistani Cricket Players: Shoaib Malik\", \"webSearchUrl\": \"https:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=n2Lkz5bg7h-AgbmZE4SnL-_AFBcCgc-_vaiVeAuC84s&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dshoaibmalik%26id%3d320A83F8A63DED3BD4B4EF926CAA3BE901F9DEA2%26simid%3d608028569977424814&p=DevEx,5018.1\", \"thumbnailUrl\": \"https:\\/\\/tse3.mm.bing.net\\/th?id=OIP.Mb6ca65eda578c80e71f4c3b3193c5b67H1&pid=Api\", \"datePublished\": \"2011-04-17T12:00:00\", \"contentUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=TwpcQHy-RdAJUStMisg6zBtjt_j60EStRFRAJS1D69Q&v=1&r=http%3a%2f%2fimages.teamtalk.com%2f08%2f10%2f800x600%2fShoaib-Malik_1264846.jpg&p=DevEx,5020.1\", \"hostPageUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=xICbhyFdmUBblBavcA3pXPdpbOa-1bJuBvP5H6Z0kms&v=1&r=http%3a%2f%2fcricketplayerspk.blogspot.com%2f2011%2f04%2fshoaib-malik.html&p=DevEx,5019.1\", \"contentSize\": \"51243 B\", \"encodingFormat\": \"jpeg\", \"hostPageDisplayUrl\": \"cricketplayerspk.blogspot.com\\/2011\\/04\\/shoaib-malik.html\", \"width\": 800, \"height\": 600, \"thumbnail\": {\"width\": 300, \"height\": 225}, \"imageInsightsToken\": \"ccid_tspl7aV4*mid_320A83F8A63DED3BD4B4EF926CAA3BE901F9DEA2*simid_608028569977424814\", \"imageId\": \"320A83F8A63DED3BD4B4EF926CAA3BE901F9DEA2\", \"accentColor\": \"416838\"}, {\"name\": \"Shoaib Malik in line for Test comeback after 5 years - Sports\", \"webSearchUrl\": \"https:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=7CIa0gvwncEquihLMmMIvtYAAUYZutf8EQr57d8EDO0&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dshoaibmalik%26id%3d8045A5C7203C2203C8238D9E00905FCB328BD4D9%26simid%3d608033376034882300&p=DevEx,5024.1\", \"thumbnailUrl\": \"https:\\/\\/tse2.mm.bing.net\\/th?id=OIP.M65fe5bf16283dc466e93650fbaef1205o1&pid=Api\", \"datePublished\": \"2015-10-06T04:07:00\", \"contentUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=F2RLPPSfrErnxq7OZt_3mbKbvpJITet7f_kGd90aKlg&v=1&r=http%3a%2f%2fimages.mid-day.com%2fimages%2f2015%2foct%2f6Shoaib-Malik-1.jpg&p=DevEx,5026.1\", \"hostPageUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=3V02TER99J6fm2eshh_cv4NCdJELV1DpI1pOmALtDMQ&v=1&r=http%3a%2f%2fwww.mid-day.com%2farticles%2fshoaib-malik-in-line-for-test-comeback-after-5-years%2f16586181&p=DevEx,5025.1\", \"contentSize\": \"119997 B\", \"encodingFormat\": \"jpeg\", \"hostPageDisplayUrl\": \"www.mid-day.com\\/articles\\/shoaib-malik-in-line-for-test-comeback...\", \"width\": 670, \"height\": 746, \"thumbnail\": {\"width\": 269, \"height\": 300}, \"imageInsightsToken\": \"ccid_Zf5b8WKD*mid_8045A5C7203C2203C8238D9E00905FCB328BD4D9*simid_608033376034882300\", \"imageId\": \"8045A5C7203C2203C8238D9E00905FCB328BD4D9\", \"accentColor\": \"304987\"}, {\"name\": \"Gallery > Cricketers > Shoaib Malik > Shoaib Malik high quality! Free ...\", \"webSearchUrl\": \"https:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=A9FD1ucKtYszoNQZ2KEhYMvgMwvJ6AA5d-DFInyr9I4&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dshoaibmalik%26id%3dB7AD00B57D67FD1664C7BBA404FF6E2679019517%26simid%3d608007657767896024&p=DevEx,5030.1\", \"thumbnailUrl\": \"https:\\/\\/tse3.mm.bing.net\\/th?id=OIP.M5d9fb4d528228cb5c8b9748bff10365bo1&pid=Api\", \"datePublished\": \"2013-05-18T00:44:00\", \"contentUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=7jwPNSK-kjHNAXQmqBqznMWCB3u4YPz0uHDFoJizw1U&v=1&r=http%3a%2f%2fpak101.com%2fgallery%2fCricketers%2fShoaib_Malik%2f2011%2f9%2f22%2fShoaib_Malik_Picture_9_xmnqf.jpg&p=DevEx,5032.1\", \"hostPageUrl\": \"http:\\/\\/www.bing.com\

1 个答案:

答案 0 :(得分:0)

此代码演示了两种方法。第一个与你的相似,第二个显示了一个涉及使用正则表达式的简单方法。

值得学习的第一种方法,但诀窍是保持你正在解析的字符串中的位置。

.as-console-wrapper { max-height: 100% !important; top: 0; }

输出完全相同:

data = '''webSearchUrl\": \"https:\\/\\/w ... p:\\/\\/www.bing.com"'''
data = data.replace ('\/', '/')

print ('Using roughly your approach ...')

start = 0
while True:
    p = data[start:].find('thumbnailUrl')
    if p == -1: break
    q = data[start+p+12:].find('http')
    r = data[start+p+q+12:].find('"')
    print (data[start+p+q+12:start+p+q+r+12])
    start = start+p+q+r+12

print ('Using a regular expression ...')

from re import compile

thumbNailRE = compile(r'thumbnailUrl":\s+"([^"]+)')
for match in thumbNailRE.findall(data):
    print (match)
相关问题