使用熊猫的csv值的总和

时间:2018-08-28 17:13:24

标签: python python-3.x pandas csv

我想对第3列中的所有值求和,以使用熊猫认为更有效的熊猫第一和第二列将结果保存到新的csv文件中。

可以加在一起的最大值在0到2之间

如果存在除0.5,1或2以外的值或字符,则将忽略加法。

csv文件的示例:

https://pastebin.com/WwDWqU3U

encounterId|chartTime|11885|67187|6711|6711|6710|1356|1357|1358|1359|1360|1361|1362|1366|140|140

325|2014-01-01 00:00:00|0
325|2014-01-01 01:00:00|0|0|0
325|2014-01-01 02:00:00|0
325|2014-01-01 03:00:00|0|0|0
325|2014-01-01 04:00:00|0
325|2014-01-01 05:00:00|1
325|2014-01-01 06:00:00|0|0|0
325|2014-01-01 07:00:00|1|0|0.5|1
325|2014-01-01 08:00:00|0
325|2014-01-01 09:00:00|1|0|0
325|2014-01-01 10:00:00|0
325|2014-01-01 11:00:00|1|0|0
325|2014-01-01 12:00:00|0
325|2014-01-01 13:00:00|0|0|0.5|1
325|2014-01-01 14:00:00|0
325|2014-01-01 15:00:00|0

我正在寻找什么:

323|2013-06-03 00:00:00|0
323|2013-06-03 01:00:00|1
323|2013-06-03 02:00:00|1.5
323|2013-06-03 03:00:00|1.5
323|2013-06-03 04:00:00|0
323|2013-06-03 05:00:00|0.5
323|2013-06-03 06:00:00|0
323|2013-06-03 07:00:00|3.5
323|2013-06-03 08:00:00|0.5

我尝试过没有熊猫,这给了我一些奇怪的结果

4 个答案:

答案 0 :(得分:1)

您可以按照上一个答案here

的建议,求和并设置参数轴= 1

答案 1 :(得分:1)

使用此,

Dim nameArray() As Variant
Dim resultArray() As Variant

nameArray = Array("france", "usa", "germany", "switzerland", "spain")

For each name in nameArray
    With w2.Worksheets(name)
        .Range("D2:S17").Value = w1.Worksheets(name).Range("D2:S17").Value
        .Range("AX2:BM17").Value = w1.Worksheets(name).Range("AX2:BM17").Value
        .Range("AB2:AQ17").Value = w1.Worksheets(name).Range("AB2:AQ17").Value
        .Name = .Name & "_tab1"

        resultArray = .Range("D2:S17").Value ' 2D array
        ' do array calculations here
    End With
Next

输出:

from io import StringIO
csvfile = StringIO("""323|2013-06-03 00:00:00|0|0|0
323|2013-06-03 01:00:00|1|
323|2013-06-03 02:00:00|1|0|0.5|86
323|2013-06-03 03:00:00|1|0|0.5|0
323|2013-06-03 04:00:00|0
323|2013-06-03 05:00:00|0|0|0.5|0
323|2013-06-03 06:00:00|0
323|2013-06-03 07:00:00|1|0|0.5|2
323|2013-06-03 08:00:00|0|0.5""")

df = pd.read_csv(csvfile, sep='|', names=['ID','date','A','B','C','D'])

df_out = df.set_index(['ID','date'])

df_out.where((df_out>0) & (df_out<=2), 0)\
      .sum(1)\
      .reset_index()\
      .to_csv('outfile.csv', index=False, header=False)

!type outfile.csv

答案 2 :(得分:1)

请注意,pd.read_csv()如果读取列数可变的csv会抛出错误,除非您事先提供了列名。应该这样做:

import pandas as pd
import numpy as np

df = pd.read_csv('sample.txt', names=['Index','Date','Val1','Val2','Val3','Val4'], sep='|')

df[df[['Val1','Val2','Val3','Val4']]>2] = np.nan

df['Final'] = df.iloc[:,2:].sum(axis=1)

df = df[['Index','Date','Final']]

礼物:

   Index                 Date  Final
0    323  2013-06-03 00:00:00    0.0
1    323  2013-06-03 01:00:00    1.0
2    323  2013-06-03 02:00:00    1.5
3    323  2013-06-03 03:00:00    1.5
4    323  2013-06-03 04:00:00    0.0
5    323  2013-06-03 05:00:00    0.5
6    323  2013-06-03 06:00:00    0.0
7    323  2013-06-03 07:00:00    3.5
8    323  2013-06-03 08:00:00    0.5

这是一种更简洁的方法(与下面@Scott Boston的回答非常相似,但避免了创建单独的数据框)。通过将csv的前两列设置为数据框的索引,可以有条件地过滤仅包含浮点值的数据框的其余部分:

df = pd.read_csv('sample.txt', names=['Index','Date','Val1','Val2','Val3','Val4'], sep='|').set_index(['Index','Date'])

df['Final'] = df[(df>0) & (df<=2)].sum(axis=1)

df.reset_index()[['Index','Date','Final']].to_csv('output.csv', index=False, header=False)

礼物:

323,2013-06-03 00:00:00,0.0
323,2013-06-03 01:00:00,1.0
323,2013-06-03 02:00:00,1.5
323,2013-06-03 03:00:00,1.5
323,2013-06-03 04:00:00,0.0
323,2013-06-03 05:00:00,0.5
323,2013-06-03 06:00:00,0.0
323,2013-06-03 07:00:00,3.5
323,2013-06-03 08:00:00,0.5

答案 3 :(得分:0)

怎么样?

for row in df.rows:
   row[row.columns[2]]=sum(row[row.columns[>1]])