如何合并两个数据帧?

时间:2018-08-15 20:28:32

标签: python pandas dataframe

我有这张桌子1:

  A B C D
0 1 2 k l
1 3 4 e r

df.dtypes给我这个:

A int64
B int64
C object
D object

现在,我想使用此命令table2=df.select_dtypes(include=[object])创建一个只包含对象(C和D列)的table2。

然后,我想使用此命令pd.get_dummies(table)对table2进行编码。

它给了我这张桌子2:

  C D
0 0 1
1 1 0

我要做的最后一件事是将两个表(表1 +表2)附加在一起,以便最终表看起来像这样:

  A B C D
0 1 2 0 1
1 3 4 1 0

有人可以帮忙吗?

2 个答案:

答案 0 :(得分:1)

这应该做到:

table2=df.select_dtypes(include=[object])
table1.select_dtypes(include=[int]).join(table2.apply(lambda x:pd.factorize(x, sort=True)[0]))

enter image description here

它首先分解表2的对象类型列(而不是使用虚拟生成器),然后将其合并回原始数据帧的int类型列!

答案 1 :(得分:0)

假设您要解决的问题是为from __future__ import (absolute_import, division, print_function, unicode_literals) from future.builtins import * # NOQA import dash from dash.dependencies import Output, Event import dash_core_components as dcc import dash_html_components as html import time import plotly import plotly.graph_objs as go from collections import deque import sys from operator import add import numpy as np from itertools import chain import warnings from obspy import UTCDateTime from obspy.signal.cross_correlation import templates_max_similarity from obspy.signal.headers import clibsignal, head_stalta_t from obspy import read from pyspark.sql import SparkSession from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils def classic_sta_lta_py(a): """ Computes the standard STA/LTA from a given input array a. The length of the STA is given by nsta in samples, respectively is the length of the LTA given by nlta in samples. Written in Python. .. note:: There exists a faster version of this trigger wrapped in C called :func:`~obspy.signal.trigger.classic_sta_lta` in this module! :type a: NumPy :class:`~numpy.ndarray` :param a: Seismic Trace :type nsta: int :param nsta: Length of short time average window in samples :type nlta: int :param nlta: Length of long time average window in samples :rtype: NumPy :class:`~numpy.ndarray` :return: Characteristic function of classic STA/LTA """ # The cumulative sum can be exploited to calculate a moving average (the # cumsum function is quite efficient) nsta = 2 nlta = 20 sta = np.cumsum(a ** 2) # Convert to float sta = np.require(sta, dtype=np.float) # Copy for LTA lta = sta.copy() # Compute the STA and the LTA sta[nsta:] = sta[nsta:] - sta[:-nsta] sta /= nsta lta[nlta:] = lta[nlta:] - lta[:-nlta] lta /= nlta # Pad zeros sta[:nlta - 1] = 0 # Avoid division by zero by setting zero values to tiny float dtiny = np.finfo(0.0).tiny idx = lta < dtiny lta[idx] = dtiny return sta / lta def saveRec(rdd): rdd.foreach(lambda rec: open("/Users/zeinab/kafka_2.11-1.1.0/outputFile7.txt", "a").write(rec+"\n")) app = dash.Dash(__name__) # Read data max_length = 50 X = deque(maxlen=max_length) X.append(0) Y = deque(maxlen=max_length) text_file = open("/Users/zeinab/kafka_2.11-1.1.0/outputFile7.txt", "r") lines = text_file.readlines() a = [] for l in lines: a.append(float(l)) app.layout = html.Div( [ dcc.Graph(id='live-graph', animate=True), dcc.Interval( id='graph-update', interval=1*1000 ) ] ) @app.callback(Output('live-graph', 'figure'), events=[Event('graph-update', 'interval')]) def update_graph_scatter(): #times.append(time.time()) X.append(X[-1]+1) Y.append(a[0]) del a[0] data = plotly.graph_objs.Scatter( x=list(X), y=list(Y), name='Scatter', mode= 'lines+markers' ) return {'data': [data],'layout' : go.Layout(xaxis=dict(range=[min(X),max(X)]), yaxis=dict(range=[min(Y),max(Y)]))} if __name__ == "__main__": print("hello") sc = SparkContext(appName="STALTA") ssc = StreamingContext(sc, 5) broker, topic = sys.argv[1:] # Connect to Kafka kvs = KafkaUtils.createStream(ssc, broker, "raw-event-streaming-consumer",{topic:1}) lines = kvs.map(lambda x: x[1]) ds = lines.flatMap(lambda line: line.strip().split("\n")).map(lambda strelem: float(strelem)) mapped = ds.mapPartitions(lambda i: classic_sta_lta_py(np.array(list(i)))) lines2 = mapped.map(lambda y: y) mapped2 = lines2.map(lambda w: str(w)) mapped2.foreachRDD(saveRec) ssc.start() ssc.awaitTermination() app.run_server(debug=True) 设置一列,该列的值为C替换1的值,并在e列中,D的值替换1的值。否则,如其他地方所述,每种响应可能性都会有一列。

l

现在,如果要删除df = pd.DataFrame({'A': [1,2], 'B': [2,4], 'C': ['k','e'], 'D': ['l','r']}) df A B C D 0 1 2 k l 1 2 4 e r df.dtypes A int64 B int64 C object D object dtype: object e是因为您想拥有l列,则可以使用k-1参数。

drop_first

请注意,dtype不同于df = pd.get_dummies(df, drop_first = True) df A B C_k D_r 0 1 2 1 0 1 2 4 0 1 int64列那样的A

B

如果很重要,它们是同一类型,那么您当然可以适当地更改它们。在一般情况下,您可能需要保留df A int64 B int64 C_k uint8 D_r uint8 dtype: object C_k之类的名称,以便您了解虚拟对象对应的内容。如果没有,您总是可以根据D_r(前缀_的默认值)来重命名。因此,您可以使用'_'创建重命名字典,以拆分出部分。前缀后面的列名称。或像这样的简单情况。

get_dummies