使用NaN计算行

时间:2017-11-17 19:20:41

标签: python

我有以下DataFrame:

dur  wage1  wage2  wage3  cola  hours     pension  stby_pay  shift_diff
6   3.0    2.0    3.0    NaN   tcf    NaN  empl_contr       NaN         NaN
8   1.0    2.8    NaN    NaN  none   38.0  empl_contr       2.0         3.0
9   1.0    5.7    NaN    NaN  none   40.0  empl_contr       NaN         4.0
13  1.0    5.7    NaN    NaN  none   40.0  empl_contr       NaN         4.0
17  3.0    2.0    3.0    NaN   tcf    NaN  empl_contr       NaN         NaN
31  1.0    5.7    NaN    NaN  none   40.0  empl_contr       NaN         4.0
43  2.0    2.5    3.0    NaN   NaN   40.0        none       NaN         NaN
44  1.0    2.8    NaN    NaN  none   38.0  empl_contr       2.0         3.0
47  3.0    2.0    3.0    NaN   tcf    NaN  empl_contr       NaN         NaN

我要做的是计算完全相同的行,包括NaN值。

问题如下,我使用groupby,但它是一个忽略NaN值的函数,也就是说,在进行计数时它没有记住它们,这就是我没有返回正确的原因输出计算这些行之间的精确重复次数。

我的代码如下:

def detect_duplicates(data):
    x = DataFrame(columns=data.columns.tolist() + ["num_reps"])

    aux = data[data.duplicated(keep=False)]
    x = data[data.duplicated(keep=False)].drop_duplicates()
    #This line should count my repeated rows
    s = aux.groupby(data.columns.tolist(),as_index=False).transform('size')


    return x

如果我打印“x”var,我得到这个结果,它会显示所有重复的行:

dur  wage1  wage2  wage3  cola  hours     pension  stby_pay  shift_diff
6   3.0    2.0    3.0    NaN   tcf    NaN  empl_contr       NaN         NaN
8   1.0    2.8    NaN    NaN  none   38.0  empl_contr       2.0         3.0
9   1.0    5.7    NaN    NaN  none   40.0  empl_contr       NaN         4.0
13  1.0    5.7    NaN    NaN  none   40.0  empl_contr       NaN         4.0
17  3.0    2.0    3.0    NaN   tcf    NaN  empl_contr       NaN         NaN
31  1.0    5.7    NaN    NaN  none   40.0  empl_contr       NaN         4.0
43  2.0    2.5    3.0    NaN   NaN   40.0        none       NaN         NaN
44  1.0    2.8    NaN    NaN  none   38.0  empl_contr       2.0         3.0
47  3.0    2.0    3.0    NaN   tcf    NaN  empl_contr       NaN         NaN
51  3.0    2.0    3.0    NaN   tcf    NaN  empl_contr       NaN         NaN
53  2.0    2.5    3.0    NaN   NaN   40.0        none       NaN         NaN

现在我必须从x结果中计算完全相同的那些行。

这应该是我的正确输出:

 dur    wage1   wage2   wage3   cola    hours   pension stby_pay    shift_diff  num_reps
6   3.0 2.0 3.0 NaN tcf NaN empl_contr  NaN NaN                4
8   1.0 2.8 NaN NaN none    38.0    empl_contr  2.0 3.0        2
9   1.0 5.7 NaN NaN none    40.0    empl_contr  NaN 4.0        3
43  2.0 2.5 3.0 NaN NaN 40.0    none    NaN NaN                2

这是我的问题,而且groupby忽略了NaN值,这就是为什么关于这个问题的其他类似帖子无法帮助我。

由于

2 个答案:

答案 0 :(得分:0)

如果数据框的名称是df,则只需使用一行代码就可以计算重复数:

  <android.support.constraint.ConstraintLayout xmlns:android="http://schemas.android.com/apk/res/android"
xmlns:app="http://schemas.android.com/apk/res-auto"
xmlns:tools="http://schemas.android.com/tools"
android:id="@+id/frameLayout"
android:layout_width="match_parent"
android:layout_height="match_parent"
tools:context="com.world.bolandian.talent.fragments.AddVideoAudioFragment">


<TextView
    android:id="@+id/textView"
    android:layout_width="0dp"
    android:layout_height="wrap_content"
    android:layout_marginEnd="8dp"
    android:layout_marginTop="32dp"
    android:gravity="center"
    android:text="UPLOAD YOUR MUSIC"
    android:textAppearance="@style/TextAppearance.AppCompat.Display1"
    android:textStyle="bold"
    app:layout_constraintEnd_toEndOf="parent"
    app:layout_constraintStart_toStartOf="parent"
    app:layout_constraintTop_toTopOf="parent" />

<Spinner
    android:id="@+id/spinner"
    android:layout_width="227dp"
    android:layout_height="19dp"
    android:layout_marginStart="8dp"
    android:entries="@color/bootstrap_brand_info"
    android:popupBackground="@color/bootstrap_brand_success"
    android:spinnerMode="dialog"
    app:layout_constraintBottom_toBottomOf="@+id/textView2"
    app:layout_constraintRight_toRightOf="parent"
    app:layout_constraintStart_toEndOf="@+id/textView2"
    app:layout_constraintTop_toTopOf="@+id/textView2" />

<TextView
    android:id="@+id/textView2"
    android:layout_width="wrap_content"
    android:layout_height="wrap_content"
    android:layout_marginLeft="16dp"
    android:layout_marginTop="68dp"
    android:text="Choose Genre"
    android:textSize="18dp"
    android:textAppearance="@style/TextAppearance.AppCompat.Body2"
    app:layout_constraintLeft_toLeftOf="parent"
    app:layout_constraintTop_toBottomOf="@+id/textView" />

<android.support.design.widget.TextInputLayout
    android:id="@+id/textInPut"
    android:layout_width="368dp"
    android:layout_height="wrap_content"
    android:layout_marginTop="45dp"
    app:layout_constraintLeft_toLeftOf="parent"
    app:layout_constraintRight_toRightOf="parent"
    app:layout_constraintTop_toBottomOf="@+id/spinner">

    <android.support.design.widget.TextInputEditText
        android:id="@+id/etTitleMusic"
        android:layout_width="match_parent"
        android:layout_height="wrap_content"
        android:hint="Title your upload" />
</android.support.design.widget.TextInputLayout>


<com.beardedhen.androidbootstrap.BootstrapButton
    android:id="@+id/btnUploadVideo"
    android:layout_width="0dp"
    android:layout_height="wrap_content"
    android:layout_marginEnd="8dp"
    android:layout_marginStart="8dp"
    android:layout_marginTop="56dp"
    android:text="UPLOAD VIDEO"
    app:bootstrapBrand="primary"
    app:bootstrapSize="lg"
    app:buttonMode="regular"
    app:layout_constraintEnd_toEndOf="parent"
    app:layout_constraintStart_toStartOf="parent"
    app:layout_constraintTop_toBottomOf="@+id/textInPut"
    app:roundedCorners="true"
    app:showOutline="false" />

<com.beardedhen.androidbootstrap.BootstrapButton
    android:id="@+id/btnUploadAudio"
    android:layout_width="0dp"
    android:layout_height="wrap_content"
    android:layout_marginEnd="8dp"
    android:layout_marginStart="8dp"
    android:layout_marginTop="24dp"
    android:text="UPLOAD AUDIO"
    app:bootstrapBrand="warning"
    app:bootstrapSize="lg"
    app:buttonMode="regular"
    app:layout_constraintEnd_toEndOf="parent"
    app:layout_constraintStart_toStartOf="parent"
    app:layout_constraintTop_toBottomOf="@+id/btnUploadVideo"
    app:roundedCorners="true"
    app:showOutline="false" />

如果要删除重复行,请使用drop_duplicates方法。 documentation

示例:

sum(df.duplicated(keep = False))

导入data.csv并删除重复行(默认保留重复行的第一个实例)

#data.csv
col1,col2,col3
a,3,NaN #duplicate
b,9,4   #duplicate
c,12,5
a,3,NaN #duplicate
b,9,4   #duplicate
d,19,20
a,3,NaN #duplicate - 5 duplicate rows

要计算重复行数,请使用数据框的重复方法。将“keep”设置为False(documentation)。如上所述,您只需使用import pandas as pd df = pd.read_csv("data.csv") print(df.drop_duplicates()) #Output c1 c2 c3 0 a 3 NaN 1 b 9 4.0 2 c 12 5.0 5 d 19 20.0 执行此操作即可。这是一种更简单的方法来演示“重复”方法的作用:

sum(df.duplicated(keep = False))

答案 1 :(得分:0)

我刚解决了。

我所说的问题是groupby并不接受Nan值。

所以我所做的就是用fillna(0)函数改变所有Nan值,所以它将所有NaN都改为0,现在我可以正确地进行比较了。

这是我的新功能正常工作:

def detect_duplicates(data):
    x = DataFrame(columns=data.columns.tolist() + ["num_reps"])

    aux = data[data.duplicated(keep=False)]
    x = data[data.duplicated(keep=False)].drop_duplicates()
    s =  aux.fillna(0).groupby(data.columns.tolist()).size().reset_index().rename(columns={0:'count'})
    x['num_reps'] = s['count'].tolist()[::-1]

    return x
相关问题