Question

因此，我正在编写一个函数，以根据函数参数过滤csv文件，然后在过滤后找到一列的平均值。我只允许使用import csv（无熊猫），不能使用lambda或任何其他python“高级”快捷方式。我觉得我可以轻松获得平均部分，但是我无法根据我提到的参数和约束条件对其进行滤波。我通常会使用熊猫来解决这个问题，这会使此过程更容易，但我做不到。

这是我的代码：

def calc_avg(self, specific, filter, logic, threshold):
        
        with open(self.load_data, 'r') as avg_file:
            for row in csv.DictReader(avg_file, delimiter= ','):
                specific = row[specific]
                filter = int(row[filter])
                logic = logic
                threshold = 0
                
                if logic == 'lt':
                    filter < threshold
                    
                elif logic == 'gt':
                    filter > threshold
                    
                elif logic == 'lte':
                    filter <= threshold
                    
                elif logic == 'gte':
                    filter >= threshold

它应与此命令一起使用

print(csv_data.calc_avg("Length_of_stay", filter="SOFA", logic="lt", threshold="15"))

这是代码和列标题的格式。样本数据：

RecordID SAPS-I SOFA    Length_of_stay  
132539    6      1         5    
132540    16     8         8    
132541    21     11       19    
132545    17     2         4    
132547    14     11        6    
132548    14     4         9    
132551    19     8         6    
132554    11     0        17

Answer 1

比较的结果对您没有任何作用。您需要在if语句中使用它们，以将特定值包括在平均计算中。

def calc_avg(self, specific, filter, logic, threshold):
    with open(self.load_data, 'r') as avg_file:
        values = []
        for row in csv.DictReader(avg_file, delimiter= ','):
            specific = row[specific]
            filter = int(row[filter])
            threshold = 0

            if logic == 'lt' and filter < threshold:
                values.append(specific)
            elif logic == 'gt' and filter > threshold:
                values.append(specific)
            elif logic == 'lte' and filter <= threshold:
                values.append(specific)
            elif logic == 'gte' and filter >= threshold:
                values.append(specific)
        if len(values) > 0:
            return sum(values) / len(values)
        else:
            return 0

Answer 2

更新

此选项计算一次logic并返回一个函数compare，该函数可在迭代行时使用。当数据有很多行时，速度更快。

# written as a function because you don't share the definition of load_data
# but the main idea can be translated to a class
def calc_avg(self, specific, filter, logic, threshold):
    if isinstance(threshold, str):
        threshold = float(threshold)
    
    def lt(a, b): return a < b
    def gt(a, b): return a > b
    def lte(a, b): return a <= b
    def gte(a, b): return a >= b
    
    if logic == 'lt': compare = lt
    elif logic == 'gt': compare = gt
    elif logic == 'lte': compare = lte
    elif logic == 'gte': compare = gte
    
    with io.StringIO(self) as avg_file: # change to open an actual file
        running_sum = running_count = 0
        for row in csv.DictReader(avg_file, delimiter=','):
            if compare(int(row[filter]), threshold):
                running_sum += int(row[specific])
                # or float(row[specific])
                running_count += 1
        
    if running_count == 0:
        # no even one row passed the filter
        return 0
    else:
        return running_sum / running_count

print(calc_avg(data, 'Length_of_stay', 'SOFA', 'lt', '15'))
print(calc_avg(data, 'Length_of_stay', 'SOFA', 'lt', '2'))
print(calc_avg(data, 'Length_of_stay', 'SOFA', 'lt', '0'))

输出

9.25
11.0
0

初始答案

为了过滤行，一旦确定了应该使用哪种不等式，就必须对比较进行一些操作。此处的代码将其存储在布尔值include中。

然后，您可以有两个变量：running_sum和running_count，稍后应将其除以返回平均值。

import io
import csv

# written as a function because you don't share the definition of load_data
# but the main idea can be translated to a class
def calc_avg(self, specific, filter, logic, threshold):
    if isinstance(threshold, str):
        threshold = float(threshold)

    with io.StringIO(self) as avg_file: # change to open an actual file
        running_sum = running_count = 0
        for row in csv.DictReader(avg_file, delimiter=','):
            # your code has: filter = int(row[filter])
            value = int(row[filter]) # avoid overwriting parameters
            
            if logic == 'lt' and value < threshold:
                include = True
            elif logic == 'gt' and value > threshold:
                include = True
            elif logic == 'lte' and value <= threshold: # should it be 'le'
                include = True
            elif logic == 'gte' and value >= threshold: # should it be 'ge'
                include = True
            # or import ast and consider all cases in one line
            # if ast.literal_eval(f'{value}{logic}{treshold}'):
                # include = True
            else:
                include = False
            
            if include:
                running_sum += int(row[specific])
                # or float(row[specific])
                running_count += 1
        
        return running_sum / running_count
    

data = """RecordID,SAPS-I,SOFA,Length_of_stay
132539,6,1,5
132540,16,8,8
132541,21,11,19
132545,17,2,4
132547,14,11,6
132548,14,4,9
132551,19,8,6
132554,11,0,17"""


print(calc_avg(data, 'Length_of_stay', 'SOFA', 'lt', '15'))
print(calc_avg(data, 'Length_of_stay', 'SOFA', 'lt', '2'))

输出

9.25
11.0

使用功能参数过滤CSV文件

2 个答案: