Cassandra - How to denormalize two joined tables?

时间:2015-07-28 15:58:51

标签: cassandra cql3

I Know cassandra doesn't support joins, so to use cassandra we need to denormalize tables. I would like to know how? Suppose I have two tables

<dl>
<dt>Publisher</dt>
<dd>Id : <i>Primary Key</i></dd>
  <dd>Name</dd>
  <dd>TimeStamp</dd>
  <dd>Address</dd>
  <dd>PhoneNo</dd>
  
  <dt>Book</dt>
  <dd>Id : <i>Primary Key</i></dd>
  <dd>Name</dd>
  <dd>ISBN</dd>
  <dd>Year</dd>
  <dd>PublisherId : <i>Foreign Key - Referenes Publisher table's Id</i></dd>
  <dd>Cost</dd>
  </dt>
</dl>

Please let me know how can I denormalize these tables in order to achieve the following operations efficiently
1. Search for all Books published by a particular publisher.
2. Search for all Publishers who published books in a given year.
3. Search for all Publishers who has not published books in a given year.
4. Search for all Publishers who has not published books till now.

I saw few articles regarding cassandra. But not able to conclude the denormalize for above operations. Please help me.

2 个答案:

答案 0 :(得分:1)

设计整个模式对于一个问题来说是一项相当大的任务,但一般而言,非规范化意味着您将在多个表中重复相同的数据,以便您可以读取单个行以获取每种类型所需的所有数据查询。

因此,您将为每种类型的查询创建一个表,其中包括以下几行:

  1. 创建一个按发布商ID分区的表格,并将图书ID作为集群列。
  2. 创建一个按年份分区的表格,并将发布商ID作为群集列。
  3. 创建一个包含所有发布商列表的表格。在应用程序中,您可以阅读此列表并以编程方式从表2中减去所需年份中的行。
  4. 我不确定到目前为止发布了什么&#34;&#34;手段。当您插入新书时,您可以检查表3中是否有出版商。如果没有,那么它就是新出版商。
  5. 因此,在数据的每一行中,您将重复使用查询返回的所有数据(即示例表中所有列的并集)。插入新书时,您可以将其插入所有表格中。

答案 1 :(得分:0)

听起来它可能会变得很大,所以我将采取第一个并逐步介绍我将如何接近它。你没有这样做,这只是一种方法。请注意,您可能必须为上述4个方案中的每个方案创建查询表。此表将仅解决第一种情况。

首先,我将为发布商地址创建一个类型。

CREATE TYPE address (
  street text,
  city text,
  state text,
  postalCode text
);

接下来,我将创建一个名为booksByPublisher的表。我会将address类型用于publisherAddress。我将使用publisherid作为分区键构建我的PRIMARY KEY,并在bookYearisbn上进行群集。

由于您希望能够查询特定发布者的所有书籍,因此将其指定为分区键是有意义的。将结果排序为年份,或者至少能够查看特定发布者的特定年份可能会有所帮助,因此我将bookYear作为第一个群集密钥。当然,要为发布商中的每本书创建唯一的CQL行,我会添加isbn以获得唯一性。

CREATE TABLE booksByPublisher (
  publisherid UUID,
  publisherName text,
  publisherAddress frozen<address>,
  publisherPhoneNo text,
  bookName text,
  isbn text,
  bookYear bigint,
  bookCost bigint,
  bookAuthor text,
  PRIMARY KEY (publisherid, bookYear, isbn)
);

INSERT INTO booksByPublisher (publisherid, publishername, publisheraddress, publisherphoneno, bookname, isbn, bookyear, bookcost, bookauthor)
VALUES (b7b99ee9-f495-444b-b849-6cea82683d0b,'Crown Publishing',{ street: '1745 Broadway', city: 'New York', state:'NY', postalcode: '10019'},'212-782-9000','Ready Player One','978-0307887443',2005,812,'Ernest Cline');

INSERT INTO booksByPublisher (publisherid, publishername, publisheraddress, publisherphoneno, bookname, isbn, bookyear, bookcost, bookauthor)
VALUES (b7b99ee9-f495-444b-b849-6cea82683d0b,'Crown Publishing',{ street: '1745 Broadway', city: 'New York', state:'NY', postalcode: '10019'},'212-782-9000','Armada','978-0804137256',2015,1560,'Ernest Cline');

INSERT INTO booksByPublisher (publisherid, publishername, publisheraddress, publisherphoneno, bookname, isbn, bookyear, bookcost, bookauthor)
VALUES (uuid(),'The Berkley Publishing Group',{ street: '375 Hudson Street', city: 'New York', state:'NY', postalcode: '10014'},'212-333-2354','Rainbox Six','978-0425170342',1999,867,'Tom Clancy');

现在我可以查询Crown Publishing(publisherid=b7b99ee9-f495-444b-b849-6cea82683d0b)发布的所有书籍(我的3行中),如下所示:

aploetz@cqlsh:stackoverflow2> SELECT * FROM booksbypublisher 
    WHERE publisherid=b7b99ee9-f495-444b-b849-6cea82683d0b;

 publisherid                          | bookyear | isbn           | bookauthor   | bookcost | bookname         | publisheraddress                                                              | publishername    | publisherphoneno
--------------------------------------+----------+----------------+--------------+----------+------------------+-------------------------------------------------------------------------------+------------------+------------------
 b7b99ee9-f495-444b-b849-6cea82683d0b |     2005 | 978-0307887443 | Ernest Cline |      812 | Ready Player One | {street: '1745 Broadway', city: 'New York', state: 'NY', postalcode: '10019'} | Crown Publishing |     212-782-9000
 b7b99ee9-f495-444b-b849-6cea82683d0b |     2015 | 978-0804137256 | Ernest Cline |     1560 |           Armada | {street: '1745 Broadway', city: 'New York', state: 'NY', postalcode: '10019'} | Crown Publishing |     212-782-9000

(2 rows)

如果我愿意,我还可以在2015年查询Crown Publishing的所有书籍:

aploetz@cqlsh:stackoverflow2> SELECT * FROM booksbypublisher
    WHERE publisherid=b7b99ee9-f495-444b-b849-6cea82683d0b AND bookyear=2015;

 publisherid                          | bookyear | isbn           | bookauthor   | bookcost | bookname | publisheraddress                                                              | publishername    | publisherphoneno
--------------------------------------+----------+----------------+--------------+----------+----------+-------------------------------------------------------------------------------+------------------+------------------
 b7b99ee9-f495-444b-b849-6cea82683d0b |     2015 | 978-0804137256 | Ernest Cline |     1560 |   Armada | {street: '1745 Broadway', city: 'New York', state: 'NY', postalcode: '10019'} | Crown Publishing |     212-782-9000

(1 rows)

但我无法仅通过bookyear 进行查询:

aploetz@cqlsh:stackoverflow2> SELECT * FROM booksbypublisher WHERE bookyear=2015;
InvalidRequest: code=2200 [Invalid query] message="Cannot execute this query as it might 
involve data filtering and thus may have unpredictable performance. If you want to execute
this query despite the performance unpredictability, use ALLOW FILTERING"

不要收听错误消息并添加ALLOW FILTERING。对于包含3行(甚至300行)的表,这可能会正常工作。但它不适用于有300万行的表(你会超时)。通过完整的分区键查询时,Cassandra的效果最佳。由于publisherid是我们的分区键,因此该查询执行得很好。但是,如果您需要按bookYear进行查询,则应创建一个使用bookYear作为其分区键的表。