Question

以下是从pandas dataframe中的kaggle主页导入的任务列表。

$(".manage_permission_button").click(function() {
    var button = $(this),
        user_id = button.attr('user_id');

    $.ajax({
        type: 'GET',
        url: 'assets/pages/manage_users/myModal.php',
        data: {
            user_id: user_id
        },
        success: function(response) {
            $(response).insertAfter(button);
            $('#myModal').modal('show');
        },
    });
});

$('#myModal').on('hidden.bs.modal', function () {
    $(this).remove();
});

示例数据框将正确生成第一行。我需要循环其余的数据。如何每5行重复一次转置方法？

Answer 1

最简单的是使用MultiIndex，但不幸的是，每5行数据不重复：

df.index = [df.index // 5, df.index % 5]
sample = df.unstack()
sample.columns=['task_name', 'task_description', 'task_date', 'task_prize', 'task_teams']

print (sample.head(10))

                                    task_description  \
0  Can you detect fraudulent click traffic for mo...   
1  Can you segment each objects within image fram...   
2          Image classification of fashion products.   
3    Image Classification of Furniture & Home Goods.   
4  Given an image, can you find all of the same l...   
5              Google Landmark Recognition Challenge   
6                                          289 teams   
7                                          Knowledge   
8                       image data, object detection   
9                       Getting Started2 years to go   

                                           task_date  \
0                              Featured13 days to go   
1                             Research2 months to go   
2                              Researcha month to go   
3                              Researcha month to go   
4                              Researcha month to go   
5  Label famous (and not-so-famous) landmarks in ...   
6                ImageNet Object Detection Challenge   
7                                            0 teams   
8                                          Knowledge   
9      tutorial, tabular data, binary classification   

                                      task_prize  \
0                                       $25,000    
1                                        $2,500    
2                                        $2,500    
3                                        $2,500    
4                                     image data   
5                          Researcha month to go   
6  Identify and label everyday objects in images   
7         ImageNet Object Localization Challenge   
8                                        7 teams   
9                                      Knowledge   

                                task_teams  
0                              3,382 teams  
1                                 32 teams  
2                                 67 teams  
3                                238 teams  
4                                  $2,500   
5                               image data  
6                   Research12 years to go  
7           Identify the objects in images  
8  Titanic: Machine Learning from Disaster  
9                             11,169 teams

Answer 2

@jezrael指出，数据并不统一。有时会有五条信息，有时会有六条信息。

要清理它并加载到数据框中，您可以执行以下操作：

import requests as r
import pandas as pd

raw = r.get('https://s3.amazonaws.com/todel162/kaggle_unicode1.txt')

# the raw data has some non ascii characters which you could likely ignore.
# and I ignore the last line if it is blank as that breaks the parsing.
data = raw.text.encode('ascii', errors='ignore').decode()
lines = [d.strip() for d in data.split('\n')]
if lines[-1] == '':
    lines = lines[:-1]

# then split out sections of data
# this 1 lines replaces the following commented out for-loop more elegantly
blurbs = [l.split('**') for l in '**'.join(lines).split('****')]
# blurbs = []
# blurb = []
# for line in lines:
#     if line == '':
#         blurbs.append(blurb)
#         blurb = []
#     else:
#         blurb.append(line)

# it seems each section can either have 5 or 6 elements, write a function to return a uniform format record, and use pandas.DataFrame.from_records to load into dataframe

def get_record(blurb):
    if len(blurb) == 6:
        return blurb
    return blurb[:3] + [''] + blurb[3:]

cols = ['task_name', 'task_description', 'task_date', 'other', 'task_prize', 'task_teams']
df = pd.DataFrame.from_records([get_record(b) for b in blurbs], columns=cols)
df.head()

这输出以下内容：

Out[8]:
                                          task_name  \
0  TalkingData AdTracking Fraud Detection Challenge
1        CVPR 2018 WAD Video Segmentation Challenge
2         iMaterialist Challenge (Fashion) at FGVC5
3       iMaterialist Challenge (Furniture) at FGVC5
4               Google Landmark Retrieval Challenge

                                    task_description               task_date  \
0  Can you detect fraudulent click traffic for mo...   Featured13 days to go
1  Can you segment each objects within image fram...  Research2 months to go
2          Image classification of fashion products.   Researcha month to go
3    Image Classification of Furniture & Home Goods.   Researcha month to go
4  Given an image, can you find all of the same l...   Researcha month to go

        other task_prize   task_teams
0                $25,000  3,382 teams
1                 $2,500     32 teams
2                 $2,500     67 teams
3                 $2,500    238 teams
4  image data     $2,500    129 teams

如您所见，数据正在被正确地解析为列。从那里，您可以转换类型，删除列other等，并分析数据集。

每N行后重复一次pandas方法

2 个答案: