如何在大数据框中进行范围调整?

时间:2019-06-12 11:49:02

标签: python pandas range

我想浏览一个文件夹,并想检查该文件夹中的每个文件属于哪个时区。为此,我有一个csv文件

ip1         ip2           timezone
0           16777215          0
16777216    16777471       +10:00
16777472    16778239       +08:00
16778240    16779263       +11:00
16779264    16781311       +08:00
16781312    16785407       +09:00
...

当特定的ip_number在ip1和ip2之间时,相关的时区在第三列中。

df = pd.read_csv('IP2LOCATION-LITE-DB11.csv', parse_dates=True)

path="Testordner"
os.chdir(path)
result = [i for i in glob.glob('*.{}'.format("csv"))]
os.chdir("..")
for i in result:
    df2 = pd.read_csv("twiceaweek/"+i, parse_dates=True)
    w1,x1,y1,z1=i.split('.')
    w=int(w1)
    x=int(x1)
    y=int(y1)
    ip_number= 16777216*w + 65536*x + 256*y+1

我不知道如何在ip1ip2之间排列数字,以及如何将每个文件的ip_number合并到它们并获取我的时区。你有什么想法吗?

2 个答案:

答案 0 :(得分:0)

您要$FileToCheck = Get-Item -Path $folder/test.zip -ErrorAction SilentlyContinue $EmailSplat = @{ To = 'business@email.com' CC = 'admin@email.com' #SmtpServer = 'smtp.server.net' From = 'my@email.com' Priority = 'High' } $folder = "C:\test\" # first condition: 'If the file does not exist, or was not created today, an e-mail should be sent that states "File not created" or similar.' if ((-not $FileToCheck) -or ($FileToCheck.CreationTime -le (Get-Date).AddDays(-1))) { $EmailSplat.Subject = 'File not Found or not created today' $EmailSplat.building = 'This is the email building' Send-MailMessage @EmailSplat # second condition 'If the file exists and was created today, but has no content, no e-mail should be sent.' } elseif (($FileToCheck) -and ($FileToCheck.Length -le 2)) { #third condition and the default condition if it does not match the other conditions } else { $EmailSplat.Subject = 'Active Directory Accounts To Check' $EmailSplat.building = Get-Content -Path/test.zip //maybe add the file?? Send-MailMessage @EmailSplat }

qcut

输出:

thresholds = list(df['ip1']) + [df['ip2'].iloc[-1]]

# test:
ips = df[['ip1', 'ip2']].mean(axis=1).astype(int)

# bucketing
buckets = pd.cut(ips, thresholds,
                 right=True, 
                 include_lowest=True,
                 labels=False)

# get the labels:
df['timezone'].values[buckets]

答案 1 :(得分:0)

您可以使用merge_asof。它允许找到小于搜索值的最后一个索引,这就是您所需要的。因此,要在找到IP地址后找到时区,请使用:

tmp = pd.merge_asof(pd.DataFrame([ip_number], columns=['ip']), df, left_on=['ip'],
      right_on=['ip1'])
tmp = tmp[tmp.ip2>ip_number]
if len(tmp) > 0:
    tz = tmp.at[0, 'timezone']
else:
    tz = ''       # not found

或者,您可以使用searchsorted

ix = df['ip2'].searchsorted([ip_number], 'right')[0]
if ix == len(df) or df.at[ix, 'ip1']>ip_number:
    tz = ''        # not found:
else:
    tz = df.at[ix, 'timezone']