Question

因此，在我的数据帧的时间列中转换UTC时区并将其保存到新的csv文件后，我决定绘制推文频率的时间图。我的时间图最初是在时区为UTC时工作但在转换为东方后，它给出了下面的错误。我该如何解决？

import pandas as pd
import matplotlib.pyplot as plt


time_interval = pd.offsets.Second(10)

fig, ax = plt.subplots(figsize=(6, 3.5))

ax = (
    pd.read_csv('converted_timezone_tweets.csv', parse_dates=['Time'])
          .resample(time_interval, on='Time')['ID']
          .count()
          .plot.line(ax=ax)
)

plt.show()

错误是：

/scratch/sjn/anaconda/bin/python /scratch2/debate_tweets/temporal_analysis.py
Traceback (most recent call last):
  File "/scratch2/debate_tweets/temporal_analysis.py", line 18, in <module>
    pd.read_csv('converted_timezone_tweets.csv', parse_dates=['Time'])
  File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py", line 655, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py", line 411, in _read
    data = parser.read(nrows)
  File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py", line 1005, in read
    ret = self._engine.read(nrows)
  File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py", line 1748, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10862)
  File "pandas/_libs/parsers.pyx", line 912, in pandas._libs.parsers.TextReader._read_low_memory (pandas/_libs/parsers.c:11138)
  File "pandas/_libs/parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:11884)
  File "pandas/_libs/parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)
  File "pandas/_libs/parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.


Process finished with exit code 1

converted_timezone_tweets.csv如下所示：

,Candidate,ID,Time,Username,Tweet
0,Clinton,788948653016842240,2016-10-19 23:43:11-04:00,Tamayo_castle,Hillary Clinton dresses as Christian Bale at the debate via /r/pics 
1,Clinton,788948666501464064,2016-10-19 23:43:14-04:00,ThinkCenter1968,"It's like I told my kids, a reason U don't want 2 vote 4 Hillary is U want the inheritance I'm leaving U, Right? They changed their minds!"
2,Clinton,788948673594097664,2016-10-19 23:43:16-04:00,21stCenRevolt,When hearing about Saudi Arabia murdering people for being gay. Hillary laughed with glee. She disgusting and disgraceful. #debatenight
3,Both,788948662881751040,2016-10-19 23:43:13-04:00,mikeywan,MEGYN IS A PAID HILLARY WHORE #TrumpPence2016 #TrumpTrain 
4,Both,788948675313696769,2016-10-19 23:43:16-04:00,erwoti,Can't wait to hear @realDonaldTrump call that Nasty Woman (Hillary Clinton) - Madam President  #debatenight #ChrisWallace
5,Clinton,788948671756955650,2016-10-19 23:43:15-04:00,isaac_urner,"The Clinton campaign already has  redirecting to their site. That's what a real campaign looks like.
#badhombres2016"

相同的代码适用于valid_tweets.csv，并创建如下图： valid_tweets.csv行看起来像：

Candidate,ID,Time,Username,Tweet
Clinton,788948653016842240,2016-10-20 03:43:11+00:00,Tamayo_castle,Hillary Clinton dresses as Christian Bale at the debate via /r/pics
Clinton,788948666501464064,2016-10-20 03:43:14+00:00,ThinkCenter1968,"It's like I told my kids, a reason U don't want 2 vote 4 Hillary is U want the inheritance I'm leaving U, Right? They changed their minds!"
Clinton,788948673594097664,2016-10-20 03:43:16+00:00,21stCenRevolt,When hearing about Saudi Arabia murdering people for being gay. Hillary laughed with glee. She disgusting and disgraceful. #debatenight
Both,788948662881751040,2016-10-20 03:43:13+00:00,mikeywan,MEGYN IS A PAID HILLARY WHORE #TrumpPence2016 #TrumpTrain 
Both,788948675313696769,2016-10-20 03:43:16+00:00,erwoti,Can't wait to hear @realDonaldTrump call that Nasty Woman (Hillary Clinton) - Madam President  #debatenight #ChrisWallace
Clinton,788948671756955650,2016-10-20 03:43:15+00:00,isaac_urner,"The Clinton campaign already has redirecting to their site. That's what a real campaign looks like.
#badhombres2016"

更新：在我的第一个文件中：

import pandas as pd
import matplotlib.pyplot as plt

#2016-10-20 03:43:11+00:00
tweets_df = pd.read_csv('valid_tweets.csv')

tweets_df['Time'] = pd.Index(pd.to_datetime(tweets_df['Time'], utc=True)).tz_localize('UTC').tz_convert('US/Eastern')

tweets_df.to_csv('converted_timezone_tweets.csv', index=False)

在我的第二个档案中，我有：

import pandas as pd
import matplotlib.pyplot as plt


time_interval = pd.offsets.Second(10)

fig, ax = plt.subplots(figsize=(6, 3.5))

ax = (
    pd.read_csv('converted_timezone_tweets.csv', engine='python', parse_dates=['Time'])
          .resample(time_interval, on='Time')['ID']
          .count()
          .plot.line(ax=ax)
)

plt.show()

在其中一个答案中使用engine ='python'后，我收到此错误：

/scratch/sjn/anaconda/bin/python /scratch2/debate_tweets/temporal_analysis.py
Traceback (most recent call last):
  File "/scratch2/debate_tweets/temporal_analysis.py", line 11, in <module>
    .resample(time_interval, on='Time')['ID']
  File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 4729, in resample
    base=base, key=on, level=level)
  File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/core/resample.py", line 969, in resample
    return tg._get_resampler(obj, kind=kind)
  File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/core/resample.py", line 1091, in _get_resampler
    "but got an instance of %r" % type(ax).__name__)
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'

Process finished with exit code 1

我对每个csv的前5行做了一个vimdiff，这就是我得到的：

Answer 1

似乎错误是使用C引擎解析csv。我知道为什么可能不够了解，但是通过传递df.read_csv()参数来强制engine = 'python'位使用python引擎的可能解决方法。 As per the Pandas documentation, pd.read_csv() defaults to using the C engine for speed.鉴于您的错误暗示了C引擎的问题，这可能是一个很好的起点。所以，试试pd.read_csv('converted_timezone_tweets.csv', parse_dates=['Time'], engine = 'python') There was also something on GitHub hinting towards similar problems and fixes

Answer 2

每the comment，此代码

df1 = pd.read_csv('converted_timezone_tweets.csv', engine='python')
mask = pd.isnull(pd.to_datetime(df1['Time'], errors='coerce'))
print(df1.loc[mask, 'Time'])

打印

9941      None
27457     None
27458     None
...

这意味着converted_timezone_tweets.csv中有多个条目Time字段为字符串'None'。

您可能希望返回并调查原始CSV中的这些值：

df1 = pd.read_csv('converted_timezone_tweets.csv', engine='python')
mask = pd.isnull(pd.to_datetime(df1['Time'], errors='coerce'))
tweets_df = pd.read_csv('valid_tweets.csv')
print(tweets_df.loc[mask, 'Time'])

如果这些推文没有Time数据，也许最明智的做法就是抛弃它们，因为我们无法对它们所属的时间间隔进行分类。您可以使用df1 = df1.loc[mask, :]删除有问题的行：

import pandas as pd
import matplotlib.pyplot as plt

df1 = pd.read_csv('converted_timezone_tweets.csv', engine='python')
df1['Time'] = pd.to_datetime(df1['Time'], errors='coerce')
mask = pd.notnull(df1['Time'])
df1 = df1.loc[mask, :]
df1 = df1.set_index('Time')
counts = df1.resample('10S')['ID'].count()

fig, ax = plt.subplots(figsize=(6, 3.5))
counts.plot.line(ax=ax)
plt.show()

为避免解析错误，我们在不设置parse_dates参数的情况下调用pd.read_csv（上方）。因此pd.read_csv会返回一个DataFrame，其Time列包含日期字符串：

df1 = pd.read_csv('converted_timezone_tweets.csv', engine='python')
#    ID                       Time
# 0   5  2016-10-19 23:43:00-04:00
# 1   5  2016-10-19 23:43:05-04:00
# 2   5  2016-10-19 23:43:10-04:00
# 3   5  2016-10-19 23:43:15-04:00
# ...

然后我们使用pd.to_datetime将日期字符串解析为日期时间。 pd.to_datetime通过将日期字符串转换为UTC来解析日期字符串，同时考虑时区偏移。由此产生的日期时间是天真的 - 没有附加时区信息。 Pandas使用This behavior is derived from the underlying NumPy datetime64[ns] data type来表示日期时间。

因此，要使日期时间再次获得时区感知，您需要再次致电tz_localize / tz_convert：

df1['Time'] = pd.Index(df1['Time']).tz_localize('UTC').tz_convert('US/Eastern')

但是这也表明第一次调用tz_convert并在第一次将结果存储在converted_timezone_tweets.csv中没有任何好处。

因此，一个更好的解决方案（在加载converted_timezone_tweets.csv后不需要调用tz_convert）是在没有时区偏移的情况下编写converted_timezone_tweets.csv 。您可以通过tz_localize(None)致电dropping the timezone offset来执行此操作：

df1['Time'] = pd.Index(pd.to_datetime(df1['Time'], utc=True)).tz_localize('UTC').tz_convert('US/Eastern').tz_localize(None)

import numpy as np import pandas as pd import matplotlib.pyplot as plt N = 10 df = pd.DataFrame({'Time':pd.date_range('2016-10-20 03:43:00', periods=N, freq='5S'), 'ID':np.random.randint(N)}) df1 = df.copy() df1['Time'] = pd.Index(pd.to_datetime(df1['Time'], utc=True)).tz_localize('UTC').tz_convert('US/Eastern').tz_localize(None) df1.to_csv('converted_timezone_tweets.csv', index=False) df1 = pd.read_csv('converted_timezone_tweets.csv', engine='python') df1['Time'] = pd.to_datetime(df1['Time'], errors='coerce') mask = pd.notnull(df1['Time']) df1 = df1.loc[mask, :] df = df.set_index('Time') df1 = df1.set_index('Time') counts1 = df1.resample('10S')['ID'].count() counts = df.resample('10S')['ID'].count() fig, ax = plt.subplots(figsize=(6, 3.5), nrows=2) counts.plot.line(ax=ax[0]) counts1.plot.line(ax=ax[1]) plt.show()

请注意，以UTC格式存储所有与时间相关的数据可能更具吸引力而不是与其他一些当地时区有关。那样，如果你有很多 CSV文件您无需跟踪数据时间的时区关系到。从这个角度来看，保留是最好的 valid_tweets.csv，drop converted_timezone_tweets.csv，并进行转换美国/东方仅在必要时：

import numpy as np import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv('valid_tweets.csv') df['Time'] = pd.to_datetime(df['Time'], errors='coerce') mask = pd.notnull(df['Time']) df = df.loc[mask, :] df['Time'] = pd.Index(df['Time']).tz_localize('UTC').tz_convert('US/Eastern') df = df.set_index('Time') counts = df.resample('10S')['ID'].count() fig, ax = plt.subplots(figsize=(6, 3.5)) counts.plot.line(ax=ax) plt.show()

转换时区后pd.read_csv失败

2 个答案: