Question

我有一个数据集如下：

Visitor ID    Page Id      TimeStamp
1             a            x1
2             b            x2
3             c            x3 
2             d            x4

以下是数据规则：

1）。将此视为网站数据，访问者访问网站并进行一些互动。 VID代表访问者唯一标识。页面ID是他访问过的页面的ID，时间戳是访问的时间。

2）。如果页面刷新，则时间戳将更改，因此将在数据集中创建一个新行，其中包含相同的VID值，Page Id但不同的Timestamp值。

3）。如果访问者点击其他页面，则时间戳和页面ID都将更改。假设他首先在页面'a'然后他转到页面'b'，所以他将在数据集中有相同VID的另一条记录，但是Page id now = b并且Timestamp是新的时间戳。

问题：

我想找出所有在访问页面'a'之后访问过页面'b'的唯一VID。请注意，我希望特定会话或日期。

有人可以帮助sql和Pythonic这样做吗？

由于

Answer 1

select unique(visitor_id) from table_name where page_id="a" and visitor_id in (select unique(visitor_id) from table_name where page_id="b" and timestamp="any day");

Answer 2

sql方式是：

select distinct(t1.vid) from my_table as t1 
inner join my_table as t2 on t1.vid = t2.vid
where t1.page_id = 'a' and t2.page_id='b' and t1.time < t2.time;

Answer 3

只是让你（或其他人）开始使用Pythonic部分：

如果可以，请将您的数据转换为NumPy record array（例如使用numpy.genfromtxt）：

records = np.array([(1,'a',100),
                    (2,'a',100),
                    (1,'b',200),
                    (1,'a',300)],
                   dtype=dict(names=['vid','pid','time'],
                              formats=['i4','S1','i4']))

其中字段'time'是一些可比较的int / float / str或python datetime.datetime实例。实际上'x1'，'x2'等也可以。然后你可以做像

这样的事情

records_of_interest = records[records['time'] > 200]

然后我会遍历访问者ID并查看他们的记录是否符合您的条件：

target_vids = []
vids = np.unique(records['vid'])
for vid in vids:
    # get the indices for the visitor's records
    ii = np.where(records['vid'] == vid)[0]
    # make sure they visited page 'b' at all
    if 'b' not in records[ii]['pid']:
        continue
    # check whether they visited 'a' before 'b' 
    lastvisit_b = np.where(records[ii]['pid'] == 'b')[0].max()
    firstvisit_a = np.where(records[ii]['pid'] == 'a')[0].min()
    if firstvisit_a < lastvisit_b:
        target_vids.append(vid)

target_vids现在包含您想要的访客ID。

然后，还有SQL的Python接口，这可能会将您的问题减少到一种语言......

在数据集中查找唯一值

3 个答案: