Question

我有PowerShell脚本（来自Theo先生），用于从以下网址获取标题大文本文件（6000行）。我的问题是文本文件中有些行网址是特定的，脚本不起作用。脚本没有结束。文件的结构为（input.txt）：

2018-11-23 17:10:20;$https://www.super.cz
2018-11-24 21:09:34;$https://www.seznam.cz
2018-11-25 11:20:23;$https://www.kara.cz/muzi
2018-11-26 21:11:00;$https://atlas.centrum.cz (problem row)
2018-11-27 21:09:34;$https://www.seznam.cz

所需文件结构为：

2018-11-23 17:10:20;$https://www.super.cz;$Super.cz
2018-11-24 21:09:34;$https://www.seznam.cz;$Seznam - najdu tam, co neznám
2018-11-25 11:20:23;$https://www.kara.cz/muzi;$Kara - Online obchod Kara
2018-11-27 21:09:34;$https://www.seznam.cz;$Seznam - najdu tam, co neznám

或其他所需结构：

2018-11-23 17:10:20;$https://www.super.cz;$Super.cz
2018-11-24 21:09:34;$https://www.seznam.cz;$Seznam - najdu tam, co neznám
2018-11-25 11:20:23;$https://www.kara.cz/muzi;$Kara - Online obchod Kara
2018-11-26 21:11:00;$https://atlas.centrum.cz;$ (problem row without title)
2018-11-27 21:09:34;$https://www.seznam.cz;$Seznam - najdu tam, co neznám

我可以导出错误的行或将其删除吗？您能帮我更新脚本吗？

$inputFile  = 'C:\Users\user\Desktop\OSTROTA\input.txt'
$outputFile = 'C:\Users\user\Desktop\OSTROTA\urls_title.txt'

# Read the headerless textfile and replace all `;$` into a single ';'
# so we can use ConvertFrom-Csv.
# Collect the output for each delimited output string in a variable
$result = (Get-Content -Path $inputFile) -replace ';\$', ';' | 
    ConvertFrom-Csv -Delimiter ';' -Header date, url | 
    ForEach-Object {
        # put the url and date in variables so we can use them inside the catch block if needed
        $url  = $_.url
        $date = $_.date
        try {
            $page = Invoke-WebRequest -Uri $_.url -Method Get -ErrorAction Stop
            # output a string, delimited by ';$' 
            '{0};${1};${2}' -f $_.date, $_.url, $page.ParsedHtml.title
        }
        catch {
            Write-Warning "An error occurred on Url '$url'.`r`n$($_.Exception.Message)"
            # output the line with the title empty
            '{0};${1};$' -f $date, $url
        }
    }

# show output on screen
$result

# write output to a new headerless text file
$result | Set-Content $outputFile -Force

Answer 1

以下脚本使用

[Environment]::GetFolderPath('Desktop')（当前用户桌面版）
splatting以更好地定义参数，
Invoke-WebRequest和参数-Timeoutsec=1，-MaximumRedirection=1，以加快获取标题的速度并避免可能的重定向循环。
ATM已注释掉
具有PSCustomObject的替代输出格式。

## Q:\Test\2019\07\18\SO_57093988.ps1
$Desktop    = [Environment]::GetFolderPath('Desktop')
$inputFile  = Join-Path $Desktop 'OSTROTA\input.txt'
$outputFile = Join-Path $Desktop 'OSTROTA\urls_title.txt'

$result = (Get-Content -Path $inputFile) | ForEach-Object {
    $date,$url,$title = $_ -split ';\$'
    try {
        $params = @{
            Uri                = $url
            Method             = 'Get'
            ErrorAction        = 'Stop'
            Timeoutsec         = 1
            MaximumRedirection = 1
        }
        $title = (Invoke-WebRequest @params).ParsedHtml.title
        if(-not $title){$title = (([System.Uri]$url) -Split '\.')[-2]+' - najdu tam, co neznám'}
    }
    catch {
        Write-Warning "An error occurred on Url '$url'.`r`n$($_.Exception.Message)"
        # output the line with the title empty
        $title = ' (problem row without title)'
    }
    '{0};${1};${2}' -f $date,$url,$title
    # Alternatively use a PSCustomObject
    #[PSCustomObject]@{
    #    date = $date
    #    url  = '$'+$url
    #    title= if($title){'$'+$title}
    #}
}

# show output on screen
$result

# write output to a new headerless text file
# $result | Set-Content $outputFile -Force

两种变体的示例输出（德语区域设置）：

> Q:\Test\2019\07\18\SO_57093988.ps1
WARNUNG: An error occurred on Url 'https://atlas.centrum.cz'.
Es wurden zu viele automatische Umleitungen versucht.

date                url                       title
----                ---                       -----
2018-11-23 17:10:20 $https://www.super.cz     $Super.cz
2018-11-24 21:09:34 $https://www.seznam.cz    $seznam - najdu tam, co neznám
2018-11-25 11:20:23 $https://www.kara.cz/muzi $Kara - Online obchod Kara - Muži
2018-11-26 21:11:00 $https://atlas.centrum.cz $ (problem row without title)
2018-11-27 21:09:34 $https://www.seznam.cz    $seznam - najdu tam, co neznám

> Q:\Test\2019\07\18\SO_57093988.ps1
WARNUNG: An error occurred on Url 'https://atlas.centrum.cz'.
Es wurden zu viele automatische Umleitungen versucht.
2018-11-23 17:10:20;$https://www.super.cz;$Super.cz
2018-11-24 21:09:34;$https://www.seznam.cz;$seznam - najdu tam, co neznám
2018-11-25 11:20:23;$https://www.kara.cz/muzi;$Kara - Online obchod Kara - Muži
2018-11-26 21:11:00;$https://atlas.centrum.cz;$ (problem row without title)
2018-11-27 21:09:34;$https://www.seznam.cz;$seznam - najdu tam, co neznám

Answer 2

您必须在catch { ... }中修改该行：

'{0};${1};$' -f $date, $url

如果您根本不希望它出现，请发表评论：

#'{0};${1};$' -f $date, $url

如果要添加自定义消息，则应在$url之后添加它，还应添加${2}：

'{0};${1};${2}' -f $date, $url, ' (problem row without title)'

如评论中的@LotPings所述，某些网站可能要求Invoke-WebRequest cmdlet使用-UseBasicParsing参数，否则它们会挂起。

在这种情况下，ParsedHtml属性将为空白，因此您需要找到另一种提取标题的方法。使用正则表达式的一个示例：

try {
    $page = Invoke-WebRequest -Uri $_.url -Method Get -ErrorAction Stop -UseBasicParsing
    $title = ($page.Content -match "<title>(.*)</title>")[0]| % {$Matches[1]}
    # output a string, delimited by ';$' 
    '{0};${1};${2}' -f $_.date, $_.url, $title
}

在这种情况下，您将搜索<title>Something something</title>并使用捕获组从中提取Something something。虽然整个匹配项都保存到$Matches[0]，但是您不需要这样做，因此您可以使用$Matches数组中与捕获组匹配的下一个元素。

说明：正在使用catch { ... }块，因为您明确告诉脚本输入该字符，以防出现任何错误：-ErrorAction Stop。该参数会强制所有错误触发catch { ... }块。在这种情况下，错误是

无效的URI：无法解析主机名。

Invoke-WebRequest-脚本无法按预期工作

2 个答案: