熊猫替换每列中的某些值

时间:2018-11-13 17:34:49

标签: python pandas percentile

我有一个看起来如下的数据框

<?php.validate.executablePath
if(empty($_POST['name'])        ||
   empty($_POST['email'])       ||
   empty($_POST['phone'])       ||
   empty($_POST['ankunft'])     ||
   empty($_POST['abreise'])     ||
   empty($_POST['message'])     ||
   !filter_var($_POST['email'],FILTER_VALIDATE_EMAIL))
   {
    echo "No arguments Provided!";
    return false;
   }

$name = strip_tags(htmlspecialchars($_POST['name']));
$email_address = strip_tags(htmlspecialchars($_POST['email']));
$phone = strip_tags(htmlspecialchars($_POST['phone']));
$ankunft = strip_tags(htmlspecialchars($_POST['ankunft']));
$abreise = strip_tags(htmlspecialchars($_POST['abreise']));
$message = strip_tags(htmlspecialchars($_POST['message']));

// Create the email and send the message
$to = 'myemail@gmail.com'; 
$email_subject = "Website Contact Form:  $name";
$email_body = "You have received a new message from your website.\n\n"."Here are the details:\n\nName: $name\n\nEmail: $email_address\n\nPhone: $phone\n\nAnkunft: $ankunft\n\nAbreise: $abreise\n\nMessage:\n$message";
$headers = "From: myemail@gmail.com\n"; // add noreply Email
$headers .= "Reply-To: $email_address"; 
mail($to,$email_subject,$email_body,$headers);
return true;            
?>

观察每个变量的箱形图后,我发现它们中有离群值。

因此,在+---+-------------+---------+---------------+---------------+---------+------+--------------------------+-----+----------+ | | Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | +---+-------------+---------+---------------+---------------+---------+------+--------------------------+-----+----------+ | 0 | 6 | 148.0 | 72.0 | 35.0 | 125.0 | 33.6 | 0.627 | 50 | 1 | | 1 | 1 | 85.0 | 66.0 | 29.0 | 125.0 | 26.6 | 0.351 | 31 | 0 | | 2 | 8 | 183.0 | 64.0 | 29.0 | 125.0 | 23.3 | 0.672 | 32 | 1 | | 3 | 1 | 89.0 | 66.0 | 23.0 | 94.0 | 28.1 | 0.167 | 21 | 0 | | 4 | 0 | 137.0 | 40.0 | 35.0 | 168.0 | 43.1 | 2.288 | 33 | 1 | +---+-------------+---------+---------------+---------------+---------+------+--------------------------+-----+----------+ 之外的每一列中,我都希望替换该特定列的Outcomegreater than 95 percentile with value at 75 percentile的值

例如,在列less than 5 percentile with 25 percentile中高于95%的值我想用Glucose列中75%的值替换它们

我该如何使用熊猫过滤器和百分位数功能

对此将提供任何帮助

1 个答案:

答案 0 :(得分:3)

您可以在apply以外的所有列上使用outcome,并使用np.clipnp.percentile函数:

import numpy as np

percentile_df = df.set_index('Outcome').apply(lambda x: np.clip(x, *np.percentile(x, [25,75]))).reset_index()

>>> percentile_df
   Outcome  Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0        1          6.0    148.0           66.0           35.0    125.0  33.6   
1        0          1.0     89.0           66.0           29.0    125.0  26.6   
2        1          6.0    148.0           64.0           29.0    125.0  26.6   
3        0          1.0     89.0           66.0           29.0    125.0  28.1   
4        1          1.0    137.0           64.0           35.0    125.0  33.6   

   DiabetesPedigreeFunction   Age  
0                     0.627  33.0  
1                     0.351  31.0  
2                     0.672  32.0  
3                     0.351  31.0  
4                     0.672  33.0  

[EDIT] 我一开始误解了问题,这是一种使用np.select将第5个百分位数和第95个百分位数分别更改为第25个和第75个百分位数的方法:

def cut(column):
    conds = [column > np.percentile(column, 95),
             column < np.percentile(column, 5)]
    choices = [np.percentile(column, 75),
               np.percentile(column, 25)]
    return np.select(conds,choices,column)

df.set_index('Outcome',inplace=True)

df = df.apply(lambda x: cut(x)).reset_index()

>>> df
   Outcome  Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0        1          6.0    148.0           66.0           35.0    125.0  33.6   
1        0          1.0     89.0           66.0           29.0    125.0  26.6   
2        1          6.0    148.0           64.0           29.0    125.0  26.6   
3        0          1.0     89.0           66.0           29.0    125.0  28.1   
4        1          1.0    137.0           64.0           35.0    125.0  33.6   

   DiabetesPedigreeFunction   Age  
0                     0.627  33.0  
1                     0.351  31.0  
2                     0.672  32.0  
3                     0.351  31.0  
4                     0.672  33.0