Question

我这里有一个非常有趣的问题，我有一个像

这样的数据集

    id,    start,  end   
    1234    200   400
    1235    300   500
    1236    100   900
    1236    200   1200
    1236    300   1400

主要目标：我想计算每个ID的并发会话数。

at 100, id:1236 has 1 session running
at 200, id:1236 has 2 sessions
at 300, id:1236 has 3 sessions
...
at 1000m id:1236 has 2 sessions
etc

我的解决方案：

将列数从1到1400（会话的最小和最大）添加到所有行
使用1
然后添加用户的所有行，以获得上述结果。

在熊猫中：

df = pd.read_csv(data+fileName,sep="\t",usecols=[0,1,2],names=['id','start','end'])

for i in range(0,1440):
    df[str(i)]=0

print df.columns

我可以添加列，并且正在考虑如何在会话开始和每行结束之间填充1到这些列。每行可以有不同的会话开始和结束。

任何提示都会有所帮助。我只是在熊猫中尝试它，但后来我必须将它移植到Apache pyspark，其中工作节点中没有pandas 。

Answer 1

在熊猫你也可以这样做： import pdfcrowd def generate_pdf_view(request): path_to_html_file = os.path.join(settings.PROJECT_ROOT,"templates/mytemplate.html") try: # create an API client instance client = pdfcrowd.Client("<username>", "<API_key>") # convert a web page and store the generated PDF to a variable pdf = client.convertFile(path_to_html_file) # set HTTP response headers response = HttpResponse(content_type="application/pdf") response["Cache-Control"] = "max-age=0" response["Accept-Ranges"] = "none" response["Content-Disposition"] = "attachment; filename=myamazingPDF.pdf" # send the generated PDF response.write(pdf) except pdfcrowd.Error, why: response = HttpResponse(content_type="text/plain") response.write(why) return response

其中t是您所需的时间。只需相应地重命名最后一列。但我不知道这是否可以通过pyspark移植。@ Khris

根据其他列中描述的范围填充数据框列

1 个答案: