我有这张桌子1:
A B C D
0 1 2 k l
1 3 4 e r
df.dtypes
给我这个:
A int64
B int64
C object
D object
现在,我想使用此命令table2=df.select_dtypes(include=[object])
创建一个只包含对象(C和D列)的table2。
然后,我想使用此命令pd.get_dummies(table)
对table2进行编码。
它给了我这张桌子2:
C D
0 0 1
1 1 0
我要做的最后一件事是将两个表(表1 +表2)附加在一起,以便最终表看起来像这样:
A B C D
0 1 2 0 1
1 3 4 1 0
有人可以帮忙吗?
答案 0 :(得分:1)
这应该做到:
table2=df.select_dtypes(include=[object])
table1.select_dtypes(include=[int]).join(table2.apply(lambda x:pd.factorize(x, sort=True)[0]))
它首先分解表2的对象类型列(而不是使用虚拟生成器),然后将其合并回原始数据帧的int类型列!
答案 1 :(得分:0)
假设您要解决的问题是为from __future__ import (absolute_import, division, print_function,
unicode_literals)
from future.builtins import * # NOQA
import dash
from dash.dependencies import Output, Event
import dash_core_components as dcc
import dash_html_components as html
import time
import plotly
import plotly.graph_objs as go
from collections import deque
import sys
from operator import add
import numpy as np
from itertools import chain
import warnings
from obspy import UTCDateTime
from obspy.signal.cross_correlation import templates_max_similarity
from obspy.signal.headers import clibsignal, head_stalta_t
from obspy import read
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
def classic_sta_lta_py(a):
"""
Computes the standard STA/LTA from a given input array a. The length of
the STA is given by nsta in samples, respectively is the length of the
LTA given by nlta in samples. Written in Python.
.. note::
There exists a faster version of this trigger wrapped in C
called :func:`~obspy.signal.trigger.classic_sta_lta` in this module!
:type a: NumPy :class:`~numpy.ndarray`
:param a: Seismic Trace
:type nsta: int
:param nsta: Length of short time average window in samples
:type nlta: int
:param nlta: Length of long time average window in samples
:rtype: NumPy :class:`~numpy.ndarray`
:return: Characteristic function of classic STA/LTA
"""
# The cumulative sum can be exploited to calculate a moving average (the
# cumsum function is quite efficient)
nsta = 2
nlta = 20
sta = np.cumsum(a ** 2)
# Convert to float
sta = np.require(sta, dtype=np.float)
# Copy for LTA
lta = sta.copy()
# Compute the STA and the LTA
sta[nsta:] = sta[nsta:] - sta[:-nsta]
sta /= nsta
lta[nlta:] = lta[nlta:] - lta[:-nlta]
lta /= nlta
# Pad zeros
sta[:nlta - 1] = 0
# Avoid division by zero by setting zero values to tiny float
dtiny = np.finfo(0.0).tiny
idx = lta < dtiny
lta[idx] = dtiny
return sta / lta
def saveRec(rdd):
rdd.foreach(lambda rec: open("/Users/zeinab/kafka_2.11-1.1.0/outputFile7.txt", "a").write(rec+"\n"))
app = dash.Dash(__name__)
# Read data
max_length = 50
X = deque(maxlen=max_length)
X.append(0)
Y = deque(maxlen=max_length)
text_file = open("/Users/zeinab/kafka_2.11-1.1.0/outputFile7.txt", "r")
lines = text_file.readlines()
a = []
for l in lines:
a.append(float(l))
app.layout = html.Div(
[
dcc.Graph(id='live-graph', animate=True),
dcc.Interval(
id='graph-update',
interval=1*1000
)
]
)
@app.callback(Output('live-graph', 'figure'),
events=[Event('graph-update', 'interval')])
def update_graph_scatter():
#times.append(time.time())
X.append(X[-1]+1)
Y.append(a[0])
del a[0]
data = plotly.graph_objs.Scatter(
x=list(X),
y=list(Y),
name='Scatter',
mode= 'lines+markers'
)
return {'data': [data],'layout' : go.Layout(xaxis=dict(range=[min(X),max(X)]),
yaxis=dict(range=[min(Y),max(Y)]))}
if __name__ == "__main__":
print("hello")
sc = SparkContext(appName="STALTA")
ssc = StreamingContext(sc, 5)
broker, topic = sys.argv[1:]
# Connect to Kafka
kvs = KafkaUtils.createStream(ssc, broker, "raw-event-streaming-consumer",{topic:1})
lines = kvs.map(lambda x: x[1])
ds = lines.flatMap(lambda line: line.strip().split("\n")).map(lambda strelem: float(strelem))
mapped = ds.mapPartitions(lambda i: classic_sta_lta_py(np.array(list(i))))
lines2 = mapped.map(lambda y: y)
mapped2 = lines2.map(lambda w: str(w))
mapped2.foreachRDD(saveRec)
ssc.start()
ssc.awaitTermination()
app.run_server(debug=True)
设置一列,该列的值为C
替换1
的值,并在e
列中,D
的值替换1
的值。否则,如其他地方所述,每种响应可能性都会有一列。
l
现在,如果要删除df = pd.DataFrame({'A': [1,2], 'B': [2,4], 'C': ['k','e'], 'D': ['l','r']})
df
A B C D
0 1 2 k l
1 2 4 e r
df.dtypes
A int64
B int64
C object
D object
dtype: object
和e
是因为您想拥有l
列,则可以使用k-1
参数。
drop_first
请注意,dtype不同于df = pd.get_dummies(df, drop_first = True)
df
A B C_k D_r
0 1 2 1 0
1 2 4 0 1
和int64
列那样的A
。
B
如果很重要,它们是同一类型,那么您当然可以适当地更改它们。在一般情况下,您可能需要保留df
A int64
B int64
C_k uint8
D_r uint8
dtype: object
和C_k
之类的名称,以便您了解虚拟对象对应的内容。如果没有,您总是可以根据D_r
(前缀_
的默认值)来重命名。因此,您可以使用'_'创建重命名字典,以拆分出部分。前缀后面的列名称。或像这样的简单情况。
get_dummies