Question

我有两个dfs并且正在寻找一种基于df2中的行来选择（和计数）df1行的方法。

这是我的df1：

      Chromosome  Start position  End position Reference Variant  reads  \
0       chr1       109419841     109419841         C       T      1
1       chr1       197008365     197008365         C       T      1

   variation reads  % variation                 gDNA nomencl  \
0                1          100  Chr1(GRCh37):g.109419841C>T
1                1          100  Chr1(GRCh37):g.197008365C>T

            cDNA nomencl    ...    exon transcript ID          inheritance  \
0  NM_013296.4:c.-258C>T    ...       2   NM_013296.4  Autosomal recessive
1  NM_001994.2:c.*143G>A    ...     UTR   NM_001994.2  Autosomal recessive

  test type                      Phenotype male coverage male ratio covered  \
0   Unknown  Deafness, autosomal recessief             0                  0
1   Unknown          Factor 13 deficientie             0                  0

  female coverage female ratio covered ratio M:F
0               1                    1       0.0
1               1                    1       0.0

df1有以下列：

Chromosome                10561 non-null object
Start position            10561 non-null int64
End position              10561 non-null int64
Reference                 10415 non-null object
Variant                   10536 non-null object
reads                     10561 non-null int64
variation reads           10561 non-null int64
% variation               10561 non-null int64
gDNA nomencl              10561 non-null object
cDNA nomencl              10446 non-null object
protein nomencl           9997 non-null object
classification            10561 non-null object
status                    10561 non-null object
gene                      10560 non-null object
Sanger sequencing list    10561 non-null object
exon                      10502 non-null object
transcript ID             10460 non-null object
inheritance               8259 non-null object
test type                 10561 non-null object
Phenotype                 10380 non-null object
male coverage             10561 non-null int64
male ratio covered        10561 non-null int64
female coverage           10561 non-null int64
female ratio covered      10561 non-null int64

这是df2：

 Chromosome  Startposition  Endposition    Bases  Meancoverage  \
0       chr1       11073785     11074022  27831.0    117.927966
1       chr1       11076901     11077064  11803.0     72.411043

   Mediancoverage  Ratiocovered>10X  Ratiocovered>20X Genename Componentnr  \
0            97.0               1.0               1.0   TARDBP           1
1            76.0               1.0               1.0   TARDBP           2

  PositionGenes          PositionGenome                       Position
0      TARDBP.1  chr1.11073785-11074022  comp.1_chr1.11073785-11074022
1      TARDBP.2  chr1.11076901-11077064  comp.2_chr1.11076901-11077064

我想从df1中选择df2

中的所有行

＆＃39;染色体＆＃39;的相同值。
df1 [＆＃39;开始位置＆＃39;]＆gt; = df2.Startposition
df1 [＆＃39;结束位置＆＃39;]＆lt; = df2.Endposition。

如果在df2的同一行中满足这三个条件，我想在df1中选择相应的行。

我已经融合了三个专栏＆＃39; Chromosome＆＃39; Startposition＆＃39;和＆＃39; Endposition＆＃39;在＆＃39; PositionGenome＆＃39;生成一个lambda函数，但coundn没有提出任何东西。

因此，希望你能帮助我......

Answer 1

一个简短的更新：最后我用unix bedtools -wb解决了这个问题。如果有人能想出一个基于python的解决方案，我仍然会很高兴。

根据第二列

1 个答案: