Question

如何从JS中的String中提取域，因此对于下面列表中的每个String，输出将是 example.com ，但输出应为null或undefined或空字符串的最后两个除外。我基本上只是尝试从字符串中提取域，下面是验证它的测试用例。

var urls = [
    "case 1 http://example.com",
    "case 2 https://example.com",
    "case 3 custume_scheme://example.com",
    "case 4 www.example.com",
    "case 5 www.example.com/staffToIgnore",
    "case 6 www.example.com?=key=leyToIgnore",
    "case 7 www.example.com ignore all those too",
    "case 8 www.example.com www.example2.com",
    "case 9 example.com need to return null",
    "case 10 wwwa.example.com need to return null",
];

域名的扩展可能是.com之外的其他内容，它可以是[a-z0-9]
允许子域名。

这个问题有几个类似的问题，但是没有作为答案的具体和非答案通过了所有案例。

Answer 1

您可以使用Lodash轻松实现您的需求。如果您丢弃包含格式错误的域的所有字符串，则设置this plunker which tells you which strings contain a domain.

var urls = [
        "case 1 http://example.com",
        "case 2 https://example.com",
        "case 3 custume_scheme://example.com",
        "case 4 www.example.com",
        "case 5 www.example.com/staffToIgnore",
        "case 6 www.example.com?=key=leyToIgnore",
        "case 7 www.example.com ignore all those too",
        "case 8 www.example.com www.example2.com",
        "case 9 example.com need to return null",
        "case 10 wwwa.example.com need to return null",
];

_.forEach(urls, function(currentS){
  //If currentS is indeed a string
  if(_.isString(currentS)){
     //If it is a url
     if(isUrl(currentS)){
       $('#urls_list' ).append('<li>'+  currentS.match(/([a-zA-Z])*\.([a-zA-Z]){0,3}(?=\s|\?|\/|$)/)[0] +'</li>');
     } else {
       $('#urls_list' ).append('<li> null </li>');
     }
  }
});

isUrl

//Returns true if current string s is a domain else false
function isUrl(s){
  if(_.includes(s, 'www.', '.com') || _.includes(s, '://', '.com')){
     return true
  } else {
     return false;
  }
}

<强>输出：

currentS.match(/([a-zA-Z])*\.([a-zA-Z]){0,3}(?=\s|\?|\/|$)/)[0]仅返回您要查找的内容：

([a-zA-Z])*\.：domain.
([a-zA-Z]){0,3}：com
(?=\s|\?|\/|$)/)：预测匹配的?，，/或字符串的结尾
[0]：参加第一场比赛

无论如何，如果我是你，我会看看validator这是一个很棒的库来检查字符串。它有一个方法isUrl，它肯定会告诉你一个字符串是否包含一个url。我无法将其导入到plunker中，因此我创建了一个自定义函数。

您可以查看_.includes here和_.forEach here。

如果您想使用正则表达式而不是第二个_.forEach和_.includes，请查看@Daveo的this answer。

Answer 2

找到非正则表达式解决方案：

function domainFromUrl(url) {
    var index = url.indexOf("www.");
    if (index != -1) {
        url = url.substr(index + 4);
    }
    else{
        index = url.indexOf("://");
        if (index != -1) {
            url = url.substr(index + 3);
        }
        else{
            return null;
        }
    }
    return url.split(/[ /?]/i)[0].split(".");
}

用法

var urls = [
    "case 1 http://example.com",
    "case 2 https://example.com",
    "case 3 custume_scheme://example.com",
    "case 4 www.example.com",
    "case 5 www.example.com/staffToIgnore",
    "case 6 www.example.com?=key=leyToIgnore",
    "case 7 www.example.com ignore all those too",
    "case 8 www.example.com www.example2.com",
    "case 9 example.com need to return null",
    "case 10 wwwa.example.com need to return null"
];

for (var i in urls) {
    console.log(i + ": " + domainFromUrl(urls[i]));
}

输出

0: example.com
1: example.com
2: example.com
3: example.com
4: example.com
5: example.com
6: example.com
7: example.com
8: null
9: null

Answer 3

使用此正则表达式：

/(?:[\w-]+\.)+[\w-]+/

这是一个正则表达式演示！

采样：

var regex = /(?:[\w-]+\.)+[\w-]+/
regex.exec("google.com");                   ["google.com"]
regex.exec("www.google.com");               ["www.google.com"]
regex.exec("ftp://ftp.google.com");         ["ftp.google.com"]
regex.exec("http://www.google.com");        ["www.google.com"]
regex.exec("http://www.google.com/");       ["www.google.com"]
regex.exec("https://www.google.com/");      ["www.google.com"]
regex.exec("https://www.google.com.sg/");   ["www.google.com.sg"]

如果您想删除领先域名'www'，请尝试以下操作：

/^[^\.]+\.(.+\..+)$/

采样：

var regex = /^[^\.]+\.(.+\..+)$/
regex.exec("google.com");                   ["google.com"]
regex.exec("www.google.com");               ["google.com"]
regex.exec("ftp://ftp.google.com");         ["google.com"]
regex.exec("http://www.google.com");        ["google.com"]
regex.exec("http://www.google.com/");       ["google.com"]
regex.exec("https://www.google.com/");      ["google.com"]
regex.exec("https://www.google.com.sg/");   ["google.com.sg"]

学习正则表达式。它将节省您的时间和代码行。

PS。我在正则表达式上吮吸我使用了一个名为google的小东西来获得这个正则表达式。你真的不需要了解正则表达式来使用它。有很多很好的正则表达式的例子。你会发现每次都需要的东西。

Answer 4

在StackOverflow上找到了这个答案：

getDomain = (url) => {
    var dom = "", v, step = 0;
    for(var i=0,l=url.length; i<l; i++) {
        v = url[i]; if(step == 0) {
            //First, skip 0 to 5 characters ending in ':' (ex: 'https://')
            if(i > 5) { i=-1; step=1; } else if(v == ':') { i+=2; step=1; }
        } else if(step == 1) {
            //Skip 0 or 4 characters 'www.'
            //(Note: Doesn't work with www.com, but that domain isn't claimed anyway.)
            if(v == 'w' && url[i+1] == 'w' && url[i+2] == 'w' && url[i+3] == '.') i+=4;
            dom+=url[i]; step=2;
        } else if(step == 2) {
            //Stop at subpages, queries, and hashes.
            if(v == '/' || v == '?' || v == '#') break; dom += v;
        }
    }
    return dom;
}

它会返回没有你想要的前导和尾随的域名。

从字符串中提取域

4 个答案:

用法

输出