网站名称提取在Python中

时间:2018-03-14 07:13:50

标签: python split urlparse

我想从网址中提取网站名称。对于例如https://plus.google.com/in/test.html 应该输出 - “加上谷歌”

更多的测试用例是 -

  1. WWW.OH.MADISON.STORES.ADVANCEAUTOPARTS.COM/AUTO_PARTS_MADISON_OH_7402.HTML
  2. 输出: - OH MADISON STORES ADVANCEAUTOPARTS

    1. WWW.LQ.COM/LQ/PROPERTIES/PROPERTYPROFILE.DO?PROPID=6054
    2. 输出: - LQ

      1. WWW.LOCATIONS.DENNYS.COM
      2. 输出: - LOCATIONS DENNYS

        1. WV.WESTON.STORES.ADVANCEAUTOPARTS.COM
        2. 输出: - WV WESTON STORES ADVANCEAUTOPARTS

          1. WOODYANDERSONFORDFAYETTEVILLE.NET /
          2. 输出: - WOODYANDERSONFORFAYETTEVILLE

            1. WILMINGTONMAYFAIRETOWNCENTER.HGI.COM
            2. 输出: - WILMINGTONMAYFAIRETOWNCENTER HGI

              1. WHITEHOUSEBLACKMARKET.COM /
              2. 输出: - WHITEHOUSEBLACKMARKET

                1. WINGATEHOTELS.COM
                2. 输出: - WINGATEHOTELS

                  string = str(input("Enter the url "))
                  new_list = list(string)
                  count=0
                  flag=0
                  
                  if 'w' in new_list:
                      index1 = new_list.index('w')
                      new_list.pop(index1)
                      count += 1
                  if 'w' in new_list:
                      index2 = new_list.index('w')
                      if index2 != -1 and index2 == index1:
                          new_list.pop(index2)
                          count += 1
                  if 'w' in new_list:
                      index3= new_list.index('w')
                      if index3!= -1 and index3== index2 and new_list[index3+1]=='.':
                          new_list.pop(index3)
                          count+=1      
                          flag = 1
                  if flag == 0:
                      start = string.find('/')
                      start += 2
                      end = string.rfind('.')
                  
                      new_string=string[start:end]
                      print(new_string)
                  elif flag == 1:
                      start = string.find('.')
                      start = start + 1
                      end = string.rfind('.')
                  
                      new_string=string[start:end]
                      print(new_string)
                  

                  以上适用于某些测试用例,但不是全部。请帮帮我。

                  由于

2 个答案:

答案 0 :(得分:3)

这是你可以建立的东西;使用urllib.parse.urlparse

from urllib.parse import urlparse

tests = ('https://plus.google.com/in/test.html',
         ('WWW.OH.MADISON.STORES.ADVANCEAUTOPARTS.COM/'
          'AUTO_PARTS_MADISON_OH_7402.HTML'),
         'WWW.LQ.COM/LQ/PROPERTIES/PROPERTYPROFILE.DO?PROPID=6054')

def extract(url):
    # urlparse will not work without a 'scheme'
    if not url.startswith('http'):
        url = 'http://' + url
    parsed = urlparse(url).netloc
    split = parsed.split('.')[:-1] # get rid of TLD
    if split[0].lower() == 'www':
        split = split[1:]
    ret = ' '.join(split)
    return ret

for url in tests:
    print(extract(url))

答案 1 :(得分:1)

该函数将URL从双斜杠剥离为单斜杠: 剩下的就是'清理'

def stripURL( url, TwoSlashes, OneSlash ):
    try:
        start = url.index(TwoSlashes) + len(TwoSlashes)
        end = url.index( OneSlash, start )
        return url[start:end]
    except ValueError:
        return ""
url= raw_input("URL : ")
if "www." in url:url=url.replace("www.","")
Strip = stripURL( url, "//", "/" )
# Strips anything after the last period found
Stripped = Strip[:Strip.rfind(".")]
# get rid of the any periods used in the name 
Stripped = Stripped.replace("."," ")
print Stripped