有没有办法一次运行多个事务?

时间:2020-12-19 04:12:52

标签: python sql jupyter-notebook faker datajoint

首先,我想提前道歉。我几个月前才开始学习这个,所以我需要完全分解的东西。我有一个使用 python 和 datajoint 的项目(使编写的 sql 代码更短),我需要在其中创建一个至少有 7 个机场、不同的飞机等等的机场。然后我需要用乘客预订来填充表格。这是我目前所拥有的。

        @schema 
        class Seat(dj.Lookup):
            definition = """
            aircraft_seat : varchar(25)
            """
            contents = [["F_Airbus_1A"],["F_Airbus_1B"],["F_Airbus_2A"],["F_Airbus_2B"],["F_Airbus_3A"], 
            ["F_Airbus_3B"],["F_Airbus_4A"],["F_Airbus_4B"],["F_Airbus_5A"],["F_Airbus_5B"], 
            ["B_Airbus_6A"],["B_Airbus_6B"],["B_Airbus_6C"],["B_Airbus_6D"],["B_Airbus_7A"], 
            ["B_Airbus_7B"],["B_Airbus_7C"],["B_Airbus_7D"],["B_Airbus_8A"],["B_Airbus_8B"], 
            ["B_Airbus_8C"],["B_Airbus_8D"],["B_Airbus_9A"],["B_Airbus_9B"],

这种情况一直持续下去,每架飞机上总共只有 144 个座位。

        @schema
        class Flight(dj.Manual):
            definition = """
            flight_no  : int
            ---
            economy_price : decimal(6,2)
            departure : datetime
            arrival : datetime 
            ---
            origin_code : int
            dest_code : int
            """ 
        @schema
        class Passenger(dj.Manual):
            definition = """
            passenger_id : int
            ---
            full_name : varchar(40)
            ssn : varchar(20)
            """
        @schema
        class Reservation(dj.Manual):
            definition = """
            -> Flight
            -> Seat
            ---
            -> Passenger
    
            """

然后我填充航班和乘客:

        Flight.insert((dict(flight_no = i,
                    economy_price = round(random.randint(100, 1000), 2), 
                    departure = faker.date_time_this_month(),
                    arrival = faker.date_time_this_month(),
                    origin_code = random.randint(1,7),
                    dest_code = random.randint(1,7)))
             for i in range(315))
        Passenger.insert(((dict(passenger_id=i, full_name=faker.name(), 
                   ssn = faker.ssn()))
             for i in range(10000)), skip_duplicates = True)

最后我创建了交易:

        def reserve(passenger_id, origin_code, dest_code, departure):
            with dj.conn().transaction:
             available_seats = ((Seat * Flight - Reservation) & Passenger & 
             {'passenger_id':passenger_id}).fetch(as_dict=True)
        try:
            choice = random.choice(available_seats)
        except IndexError:
            raise IndexError(f'Sorry, no seats available for {departure}')
        name = (Passenger & {'passenger_id': passenger_id}).fetch1('full_name')
        print('Success. Reserving seat {aircraft_seat} at ticket_price {economy_price} for 
        {name}'.format(name=name, **choice))
        Reservation.insert1(dict(choice, passenger_id=passenger_id), ignore_extra_fields=True)
  
     
       reserve(random.randint(1,1000), random.randint(1,7), 
       random.randint(1,7),random.choice('departure'))
       
       Output[]: Success. Reserving seat E_Yak242_24A at ticket_price 410.00 for Cynthia Erickson

       Reservation()
       Output[]: flight_no      aircraft_seat      passenger_id

             66           B_Yak242_7A           441

因此,我每天必须有 10.5 趟航班,而且飞机的客座率至少为 75%,这让我需要进行 30000 多次预订。有没有办法一次做 50 次?我一直在寻找答案,但一直找不到解决方案。谢谢。

1 个答案:

答案 0 :(得分:0)

这里是 DataJoint 的维护者之一。首先,我要感谢您试用 DataJoint;很好奇你是怎么知道这个项目的。

预警,这篇文章会很长,但我觉得这是澄清一些事情的好机会。关于有问题的问题,不确定我是否完全理解您的问题的性质,但让我遵循几点。在确定如何最好地处理您的案例之前,我建议您完整阅读此答案。

TL;DR:计算表是您的朋友。

多线程

由于它已经出现在评论中,因此值得解决的是,截至 2021-01-06,DataJoint 完全是线程安全的(至少从共享连接的观点)。不幸的是,这主要是由于与 PyMySQL 的 issue 存在关系,而 PyMySQL 是 DataJoint 的主要依赖项。也就是说,如果您在每个线程或进程上启动一个新连接,您应该不会遇到任何问题。但是,这是一种昂贵的解决方法,并且不能与事务结合使用,因为它们需要在单个连接中执行操作。说起来……

计算表和作业预留

计算表是您上述解决方案尝试中的一个明显遗漏。计算表提供了一种机制,将其实体与上游父表中的实体相关联,并在插入之前进行附加处理(在计算表类的 make 方法中定义),可以通过调用 {{1} } 方法为每个新条目调用 populate 方法。对您的 make 方法的调用受事务约束,应该可以实现您的要求。有关其使用的更多详细信息,请参阅文档中的 here

此外,为了获得额外的性能提升,还有一个名为 Job Reservation 的功能,它提供了一种方法,可以将多个工作人员集中在一起,以有组织的分布式方式处理大型数据集(使用 make)。我认为这里不需要它,但值得一提,最终取决于您如何查看下面的结果。您可以在我们的文档中here 找到有关此功能的更多信息。

架构设计

根据我对您的初始设计的理解,我有一些建议,我们可以如何改进数据流以提高清晰度和性能,并提供有关如何使用计算表功能的具体示例。如下图所示在我的本地设置中运行,我能够在 29 分钟 54 秒 内处理您对 30k 预订的要求,使用 2 种不同的飞机模型类型,<强>7 个机场,10k 个可能的乘客,550 个可用的航班。最低 75% 的座位容量没有验证只是因为我还没有看到您尝试这样做,但是如果您看到我如何分配座位,您会注意到它几乎就在那里。 :)

免责声明:我应该注意到,下面的设计仍然是对协调正确旅行预订的实际现实挑战的极大简化。采取了相当多的假设,主要是为了教育的利益,而不是提交完整的、即插即用的解决方案。因此,我已明确选择避免将 populate 用于以下解决方案,以便更容易理解。实际上,适当的解决方案可能包括更高级的主题以进一步提高性能,例如longbloblongblob

话虽如此,让我们首先考虑以下几点:

_update

语法和约定 Nits

最后,您提供的代码中只有一些小反馈。关于表定义,您应该在定义中只使用一次 import datajoint as dj # conda install -c conda-forge datajoint or pip install datajoint import random from faker import Faker # pip install Faker faker = Faker() Faker.seed(0) # Pin down randomizer between runs schema = dj.Schema('commercial_airtravel') # instantiate a workable database @schema class Plane(dj.Lookup): definition = """ # Defines manufacturable plane model types plane_type : varchar(25) # Name of plane model --- plane_rows : int # Number of rows in plane model i.e. range(1, plane_rows + 1) plane_columns : int # Number of columns in plane model; to extract letter we will need these indices """ contents = [('B_Airbus', 37, 4), ('F_Airbus', 40, 5)] # Since new entries to this table should happen infrequently, this is a good candidate for a Lookup table @schema class Airport(dj.Lookup): definition = """ # Defines airport locations that can serve as origin or destination airport_code : int # Airport's unique identifier --- airport_city : varchar(25) # Airport's city """ contents = [(i, faker.city()) for i in range(1, 8)] # Also a good candidate for Lookup table @schema class Passenger(dj.Manual): definition = """ # Defines users who have registered accounts with airline i.e. passenger passenger_id : serial # Passenger's unique identifier; serial simply means an auto-incremented, unsigned bigint --- full_name : varchar(40) # Passenger's full name ssn : varchar(20) # Passenger's Social Security Number """ Passenger.insert((dict(full_name=faker.name(), ssn = faker.ssn()) for _ in range(10000))) # Insert a random set of passengers @schema class Flight(dj.Manual): definition = """ # Defines specific planes assigned to a route flight_id : serial # Flight's unique identifier --- -> Plane # Flight's plane model specs; this will simply create a relation to Plane table but not have the constraint of uniqueness flight_economy_price : decimal(6,2) # Flight's fare price flight_departure : datetime # Flight's departure time flight_arrival : datetime # Flight's arrival time -> Airport.proj(flight_origin_code='airport_code') # Flight's origin; by using proj in this way we may rename the relation in this table -> Airport.proj(flight_dest_code='airport_code') # Flight's destination """ plane_types = Plane().fetch('plane_type') # Fetch available plane model types Flight.insert((dict(plane_type = random.choice(plane_types), flight_economy_price = round(random.randint(100, 1000), 2), flight_departure = faker.date_time_this_month(), flight_arrival = faker.date_time_this_month(), flight_origin_code = random.randint(1, 7), flight_dest_code = random.randint(1, 7)) for _ in range(550))) # Insert a random set of flights; for simplicity we are not verifying that flight_departure < flight_arrival @schema class BookingRequest(dj.Manual): definition = """ # Defines one-way booking requests initiated by passengers booking_id : serial # Booking Request's unique identifier --- -> Passenger # Passenger who made request -> Airport.proj(flight_origin_code='airport_code') # Booking Request's desired origin -> Airport.proj(flight_dest_code='airport_code') # Booking Request's desired destination """ BookingRequest.insert((dict(passenger_id = random.randint(1, 10000), flight_origin_code = random.randint(1, 7), flight_dest_code = random.randint(1, 7)) for i in range(30000))) # Insert a random set of booking requests @schema class Reservation(dj.Computed): definition = """ # Defines booked reservations -> BookingRequest # Association to booking request --- flight_id : int # Flight's unique identifier reservation_seat : varchar(25) # Reservation's assigned seat """ def make(self, key): # Determine booking request's details full_name, flight_origin_code, flight_dest_code = (BookingRequest * Passenger & key).fetch1('full_name', 'flight_origin_code', 'flight_dest_code') # Determine possible flights to satisfy booking possible_flights = (Flight * Plane * Airport.proj(flight_dest_city='airport_city', flight_dest_code='airport_code') & dict(flight_origin_code=flight_origin_code, flight_dest_code=flight_dest_code)).fetch('flight_id', 'plane_rows', 'plane_columns', 'flight_economy_price', 'flight_dest_city', as_dict=True) # Iterate until we find a vacant flight and extract details for flight_meta in possible_flights: # Determine seat capacity all_seats = set((f'{r}{l}' for rows, letters in zip(*[[[n if i==0 else chr(n + 64) for n in range(1, el + 1)]] for i, el in enumerate((flight_meta['plane_rows'], flight_meta['plane_columns']))]) for r in rows for l in letters)) # Determine unavailable seats taken_seats = set((Reservation & dict(flight_id=flight_meta['flight_id'])).fetch('reservation_seat')) try: # Randomly choose one of the available seats reserved_seat = random.choice(list(all_seats - taken_seats)) # You may uncomment the below line if you wish to print the success message per processed record # print(f'Success. Reserving seat {reserved_seat} at ticket_price {flight_meta["flight_economy_price"]} for {full_name}.') # Insert new reservation self.insert1(dict(key, flight_id=flight_meta['flight_id'], reservation_seat=reserved_seat)) return except IndexError: pass raise IndexError(f'Sorry, no seats available departing to {flight_meta["flight_dest_city"]}') Reservation.populate(display_progress=True) # This is how we process new booking requests to assign a reservation; you may invoke this as often as necessary 以明确区分主键属性和辅助属性(请参阅您的 --- 表)。出乎意料的是,这并没有在您的情况下引发错误,但应该这样做。我会提交一个问题,因为这似乎是一个错误。

虽然 Flight 暴露在 transaction 上,但很少需要直接调用它。 DataJoint 提供了在内部处理这个的好处,以减少来自用户的管理开销。但是,如果特殊情况需要,该选项仍然可用。对于您的情况,我会避免直接调用它,而是建议使用计算(或导入)表。