Question

我有一个主题#s这样的文本文件（名为subjects_visit1.txt）：

以及如下所示的Excel工作表：

Low Connectivity    Subject CanEstimate Entropy     High Connectivity   Subject CanEstimate Entropy
0.0816764   870  TRUE   0.308933317     -0.0313064  668  TRUE   0.868862941
0.215038    577  TRUE   0.448918189     0.172506    600  TRUE   0.885315753
0.0596745   651  TRUE   0.47695019      0.268619    595  TRUE   0.896439952
0.267082    817  TRUE   0.500621849     -0.0507346  851  TRUE   0.907089718
0.18407 567  TRUE   0.508109648     0.182189    822  TRUE   0.915782923
0.0326328   731  TRUE   0.517241379     0.201325    623  TRUE   0.929279958
0.237822    625  TRUE   0.518493071     0.511613    622  TRUE   0.953520938
0.246291    913  TRUE   0.548079129     0.101731    850  TRUE   0.956564212
0.182494    619  TRUE   0.554281617     0.0195069   823  TRUE   0.958840854
-0.0321676  610  TRUE   0.55939053      0.0610047   632  TRUE   0.960237986
0.198884    655  TRUE   0.581442494     0.155816    770  TRUE   0.973656398
0.029618    631  TRUE   0.620796248     0.0703396   754  TRUE   1.012278949
0.205221    866  TRUE   0.630981714     0.19077 804  TRUE   1.023361826
-0.00397881 842  TRUE   0.658492788     0.115125    830  TRUE   1.033213695
0.193168    880  TRUE   0.665481783     -0.0440176  621  TRUE   1.035325469
0.0187139   838  TRUE   0.670966904     0.231593    603  TRUE   1.087118914
-0.0483586  829  TRUE   0.678253186     0.720004    732  TRUE   1.229303773
0.238947    634  TRUE   0.715214736     0.219465    746  TRUE   1.355378243

我想创建一个新的df，只包含我加载的文本文件中主题的数据，但是下面的代码还没有工作。我的主题列表的数据类型有问题吗？还是别的什么？

import pandas as pd

# load text file of subject #s
subject_list = open('subjects_visit2.txt', 'r')
lines = subject_list.read().split('\n')
subjs = list(lines)

newfile = pd.ExcelFile('amygdala_mPFC_data_pandas.xlsx')
df_ROI1 = newfile.parse("01")

# restrict to subject #s in text file 
print df_ROI1['Subject'].isin(subjs)

df_ROI1 = df_ROI1[df_ROI1['Subject'].isin(subjs)]

Answer 1

您可以使用以下内容：

In [5]: df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})

In [6]: df
Out[6]:
   A  B
0  5  1
1  6  2
2  3  3
3  4  5

In [7]: df[df['A'].isin([3, 6])]
Out[7]:
   A  B
1  6  2
2  3  3

顺便说一句，如果你使用Notebook环境，最好使用：

df.head(n=5) # Gives you the first 5 rows of the dataframe
df.sample(n=5) # Gives you a random set of 5 rows of the dataframe

Edit1：如果执行以下操作会发生什么：

values_list = df_ROI1['Subject'].unique()

if "577" in values_list:
    print ("577 is in the dataframe and is a string")
elif 577 in values_list:
    print ("577 is in the dataframe and is an integer")
else:
    print ("577 is NOT in the dataframe")

EDIT2：

所以你犯的错误就是给算法一个字符串而不是一个整数。

请尝试：

df_ROI1 = df_ROI1[df_ROI1['Subject'].isin([577])] # Without the quotes around 577
df_ROI1.head(n=5)

Answer 2

您可能需要pip install xlrd才能使用.xlsx文件否则，请将您的数据保存到.csv并使用pd.read_csv()

此外，您发布的数据似乎有8列，但我认为它只有4列，对吧？如果没有，则存在需要解决的重复变量名称的问题。

import pandas as pd

with open('subjects_visit2.txt', 'r') as infile:
    # put contents into a list without the newlines
    subject_list = infile.read().splitlines()

# convert subject_list to a list of integers
subject_list = [int(subject) for subject in subject_list]

# open data file and show 1st 5 rows
df = pd.read_excel('amygdala_mPFC_data_pandas.xlsx')
print(df.head())

# uses .query() which allows easy to read syntax.
# Note: The @ symbol allows access to objects not defined in the data frame
new_df = df.query('Subject in @subject_list')
print(new_df)

输出将如下所示：

   Connectivity  Subject CanEstimate   Entropy
0      0.081676      870        True  0.308933
1      0.215038      577        True  0.448918
2      0.059674      651        True  0.476950
3      0.267082      817        True  0.500622
4      0.184070      567        True  0.508110

    Connectivity  Subject CanEstimate   Entropy
1       0.215038      577        True  0.448918
9      -0.032168      610        True  0.559391
28      0.155816      770        True  0.973656

根据python pandas中的值选择行

2 个答案: