Query large table with 50 million rows

时间:2016-07-11 22:17:18

标签: mysql sql database mysql-workbench

trying to query a large table (senddb.order_histories) that has close to 50M rows and this is the MySQL query I am using:

FIRST APPROACH- inner join:

select a.id, 
    a.order_number, 
    a.sku_id,
    a.fulfillment_status, 
    a.modified_by, 
    a.created_at, 
    a.updated_at 
from senddb.order_line_items a
inner join (
    select order_line_item_id, 
    order_number, 
    order_status, 
    order_status_description, 
    action, 
    modified_by, 
    created_at, 
    max(updated_at) as updated_at
from senddb.order_histories 
where order_status in ('x','y','z')
and fulfillment_location = 'abcd'
group by order_line_item_id) as b
on a.id = b.order_line_item_id
and a.fulfillment_status = '2';

EXPLAIN output :

enter image description here

SECOND APPROACH- nested select:

select a.id, 
    a.order_number, 
    a.sku_id,
    a.fulfillment_status, 
    a.modified_by, 
    a.created_at, 
    a.updated_at 
from senddb.order_line_items a
where a.fulfillment_status = '2'
and a.id in (
select b.order_line_item_id from(
select order_line_item_id, 
    order_number, 
    order_status, 
    order_status_description, 
    action, 
    modified_by, 
    created_at, 
    max(updated_at) as updated_at
from senddb.order_histories 
where
order_status in ('x','y','z')
and fulfillment_location = 'abcd'
group by order_line_item_id) as b);

I believe nested select is a bad approach on large data but i anyhow added it here because it worked on my sample set. Anyway both the queries eventually time out after 600 seconds with the message : Error Code: 2013. Lost connection to MySQL server during query.

I would like to know if there are any ways to alter the query to make it run faster. I have already tried reducing the columns in the inner select / inner join but that should not really be an issue IMO. I also looked up a solution that says "create a clustered index" but i wasn't really able to follow. Any help is appreciated.

TABLE order_histories :

order_histories CREATE TABLE `order_histories` (
`id` int(4) unsigned NOT NULL AUTO_INCREMENT,
`order_number` varchar(24) DEFAULT NULL,
`order_status_description` varchar(255) DEFAULT NULL,
`datetime_stamp` datetime DEFAULT NULL,
`action` varchar(32) DEFAULT NULL,
`fulfillment_location` int(8) DEFAULT NULL,
`order_status` int(8) DEFAULT NULL,
`user_id` int(8) DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
`modified_by` varchar(32) DEFAULT NULL,
`order_line_item_id` int(11) DEFAULT NULL,
`pooled` tinyint(1) DEFAULT '0',
 PRIMARY KEY (`id`),
 KEY `order_histories_ecash_idx` (`order_number`),
 KEY `order_line_item_id` (`order_line_item_id`)
) ENGINE=InnoDB AUTO_INCREMENT=454738178 DEFAULT CHARSET=latin1

TABLE order_line_items :

order_line_items CREATE TABLE `order_line_items` (
`id` int(4) unsigned NOT NULL AUTO_INCREMENT,
`order_number` varchar(24) DEFAULT NULL,
`sku_id` int(8) DEFAULT NULL,
`original_price` float DEFAULT NULL,
`dept_description` varchar(100) DEFAULT NULL,
`description` varchar(100) DEFAULT NULL,
`quantity_ordered` int(8) DEFAULT NULL,
`gift_indicator` char(1) DEFAULT NULL,
`gift_wrap_flag` char(1) DEFAULT NULL,
`shipping_record_flag` char(1) DEFAULT NULL,
`gift_comments` varchar(100) DEFAULT NULL,
`item_status` char(1) DEFAULT NULL,
`tax_amount` float DEFAULT NULL,
`tax_rate` float DEFAULT NULL,
`upc` varchar(20) DEFAULT NULL,
`final_price` float DEFAULT NULL,
`line_number` int(8) DEFAULT NULL,
`master_line_number` int(8) DEFAULT NULL,
`gift_wrap_flag_type` char(1) DEFAULT NULL,
`color_code` varchar(4) DEFAULT NULL,
`size_id` varchar(6) DEFAULT NULL,
`width_id` varchar(6) DEFAULT NULL,
`brand` varchar(15) DEFAULT NULL,
`vpn` varchar(30) DEFAULT NULL,
`dept_number` int(8) DEFAULT NULL,
`class_number` int(8) DEFAULT NULL,
`non_merch_item` char(1) DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
`modified_by` varchar(32) DEFAULT NULL,
`chain_id` int(11) DEFAULT NULL,
`fulfillment_location` int(11) DEFAULT NULL,
`fulfillment_date` datetime DEFAULT NULL,
`fulfillment_status` int(11) DEFAULT NULL,
`fulfillment_sales_associate` int(11) DEFAULT NULL,
`gift_wrap_line_number` int(11) DEFAULT NULL,
`shipping_type` int(11) DEFAULT NULL,
`order_track_info_id` int(11) DEFAULT NULL,
`store_tlog_updated` varchar(1) DEFAULT NULL,
`shipping_tlx_code` int(11) DEFAULT NULL,
`store_closed` tinyint(1) DEFAULT NULL,
`flags` int(11) DEFAULT NULL,
`deal_based_index` int(11) DEFAULT NULL,
`tlog_calc_ret_price` float DEFAULT NULL,
`tlog_amount` float DEFAULT NULL,
`tlog_retail_price` float DEFAULT NULL,
`tlog_ext_amount` float DEFAULT NULL,
`tlog_flag_1` int(11) DEFAULT NULL,
`tlog_flag_2` int(11) DEFAULT NULL,
`tlog_flag_3` int(11) DEFAULT NULL,
`time_remaining` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `order_line_items_ecash_idx` (`order_number`),
KEY `order_line_item_fulfillment_location_idx` (`fulfillment_location`),
KEY `order_line_item_fulfillment_status_idx` (`fulfillment_status`),
KEY `upc_idx` (`upc`),
KEY `sku_id_idx` (`sku_id`),
KEY `order_line_items_idx001` (`order_number`,`id`,`fulfillment_status`),
KEY `order_track_info_id` (`order_track_info_id`),
KEY `shipping_type_idx` (`shipping_type`,`non_merch_item`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=11367052 DEFAULT CHARSET=latin1

4 个答案:

答案 0 :(得分:0)

You need to read up the optimization section of the MySQL docs. It contains a lot of information on how you can optimize your queries and data sets. The main idea here is to add indexes to the fields that are being used as the criteria in the WHERE clause of the SQL statements.

答案 1 :(得分:0)

This query can be simplified:

select a.id, 
    a.order_number, 
    a.sku_id,
    a.fulfillment_status, 
    a.modified_by, 
    a.created_at, 
    a.updated_at 
from senddb.order_line_items a
inner join senddb.order_histories b on a.id = b.order_line_item_id
where b.order_status in ('x','y','z')
and b.fulfillment_location = 'abcd'    
and a.fulfillment_status = '2';

Since you're only selecting values from table a, you don't need to select specific values from table b and can instead just apply your conditions. Outside of this, you need to ensure that b.order_line_item_id has an index on it. You can find more about creating indexes here. I'm not an expert in MySQL but something similar to this should work if senddb.order_histories.order_line_item_id isn't already the primary key.

CREATE INDEX IX_order_histories_order_line_item_id ON order_histories (order_line_item_id);

答案 2 :(得分:0)

Basically, both of your alternatives are using a "sub-SELECT, not an INNER JOIN.

The syntax of a true JOIN is one of the following:

SELECT ...
FROM X INNER JOIN Y USING (field_list)

... or ...

SELECT ...
FROM X INNER JOIN Y ON (x.field1 = y.field2) ...

But in both cases the objects being joined are tables or views.

I'm going to presume ... admittedly, without checking ... that Nick Larsen's answer #1 adequately re-expresses your original query using JOINs.

(Notice how, in his answer, the shorthand identifiers A and B are introduced as referring to each of the two table-names mentioned in his query.)

答案 3 :(得分:0)

Firstly, you need to decide if a 50 million resultset is what you are asking for. Big data tables are not there so that you can select all their rows. They are there so that you can ask them questions using sql queries. SQL is a query language, it's not a data loading language.

What's your purpose? If you want to copy the data you can do that by loading the data, for example, 1000 rows per query in a for loop. if you are loading the data for processing, you can do that in the same way.

If you want to derive statistical information, you can use outer join and return a low number of rows, using aggregate functions. But you shouldn't do that either, what you "should" do is to decide what you want from the table and preferably, run aggregate functions to store useful information in a different table. (mostly SELECT INTO queries) You should never need to join a table of 50 million records in the first place.

Telling you how to do something wrong using indexes wouldn't be the right thing here.