6. 分区 - 本章小结 - 《设计数据密集型应用 - 中文翻译》

本章小结
参考文献

本章小结

在本章中，我们探讨了将大数据集划分成更小的子集的不同方法。数据量非常大的时候，在单台机器上存储和处理不再可行，则分区十分必要。分区的目标是在多台机器上均匀分布数据和查询负载，避免出现热点（负载不成比例的节点）。这需要选择适合于您的数据的分区方案，并在将节点添加到集群或从集群删除时进行再分区。

我们讨论了两种主要的分区方法：

键范围分区

其中键是有序的，并且分区拥有从某个最小值到某个最大值的所有键。排序的优势在于可以进行有效的范围查询，但是如果应用程序经常访问相邻的主键，则存在热点的风险。

在这种方法中，当分区变得太大时，通常将分区分成两个子分区，动态地再平衡分区。

散列分区

散列函数应用于每个键，分区拥有一定范围的散列。这种方法破坏了键的排序，使得范围查询效率低下，但可以更均匀地分配负载。

通过散列进行分区时，通常先提前创建固定数量的分区，为每个节点分配多个分区，并在添加或删除节点时将整个分区从一个节点移动到另一个节点。也可以使用动态分区。

两种方法搭配使用也是可行的，例如使用复合主键：使用键的一部分来标识分区，而使用另一部分作为排序顺序。

我们还讨论了分区和二级索引之间的相互作用。次级索引也需要分区，有两种方法：

按文档分区（本地索引），其中二级索引存储在与主键和值相同的分区中。这意味着只有一个分区需要在写入时更新，但是读取二级索引需要在所有分区之间进行分散/收集。
按关键词分区（全局索引），其中二级索引存在不同的分区中。辅助索引中的条目可以包括来自主键的所有分区的记录。当文档写入时，需要更新多个分区中的二级索引；但是可以从单个分区中进行读取。

最后，我们讨论了将查询路由到适当的分区的技术，从简单的分区负载平衡到复杂的并行查询执行引擎。

按照设计，多数情况下每个分区是独立运行的 — 这就是分区数据库可以扩展到多台机器的原因。但是，需要写入多个分区的操作结果可能难以预料：例如，如果写入一个分区成功，但另一个分区失败，会发生什么情况？我们将在下面的章节中讨论这个问题。

参考文献

David J. DeWitt and Jim N. Gray: “Parallel Database Systems: The Future of High Performance Database Systems,”Communications of the ACM, volume 35, number 6, pages 85–98, June 1992. doi:10.1145/129888.129894
Lars George: “HBase vs. BigTable Comparison,” larsgeorge.com, November 2009.
“The Apache HBase Reference Guide,” Apache Software Foundation, hbase.apache.org, 2014.
MongoDB, Inc.: “New Hash-Based Sharding Feature in MongoDB 2.4,” blog.mongodb.org, April 10, 2013.
Ikai Lan: “App Engine Datastore Tip: Monotonically Increasing Values Are Bad,” ikaisays.com,January 25, 2011.
Martin Kleppmann: “Java’s hashCode Is Not Safe for Distributed Systems,” martin.kleppmann.com, June 18, 2012.
David Karger, Eric Lehman, Tom Leighton, et al.: “Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web,” at 29th Annual ACM Symposium on Theory of Computing (STOC), pages 654–663, 1997. doi:10.1145/258533.258660
John Lamping and Eric Veach: “A Fast, Minimal Memory, Consistent Hash Algorithm,” arxiv.org, June 2014.
Eric Redmond: “A Little Riak Book,” Version 1.4.0, Basho Technologies, September 2013.
“Couchbase 2.5 Administrator Guide,” Couchbase, Inc., 2014.
Avinash Lakshman and Prashant Malik: “Cassandra – A Decentralized Structured Storage System,” at 3rd ACM SIGOPS International Workshop onLarge Scale Distributed Systems and Middleware (LADIS), October 2009.
Jonathan Ellis: “Facebook’s Cassandra Paper, Annotated and Compared to Apache Cassandra 2.0,”datastax.com, September 12, 2013.
“Introduction to Cassandra Query Language,” DataStax, Inc., 2014.
Samuel Axon: “3% of Twitter’s Servers Dedicated to Justin Bieber,” mashable.com, September 7, 2010.
“Riak 1.4.8 Docs,” Basho Technologies, Inc., 2014.
Richard Low: “The Sweet Spot for Cassandra Secondary Indexing,” wentnet.com, October 21, 2013.
Zachary Tong: “Customizing Your Document Routing,” elasticsearch.org, June 3, 2013.
“Apache Solr Reference Guide,” Apache Software Foundation, 2014.
Andrew Pavlo: “H-Store Frequently Asked Questions,” hstore.cs.brown.edu, October 2013.
“Amazon DynamoDB Developer Guide,” Amazon Web Services, Inc., 2014.
Rusty Klophaus: “Difference Between 2I and Search,” email to riak-users mailing list, lists.basho.com, October 25, 2011.
Donald K. Burleson: “Object Partitioning in Oracle,”dba-oracle.com, November 8, 2000.
Eric Evans: “Rethinking Topology in Cassandra,” at ApacheCon Europe, November 2012.
Rafał Kuć: “Reroute API Explained,” elasticsearchserverbook.com, September 30, 2013.
“Project Voldemort Documentation,” project-voldemort.com.
Enis Soztutar: “Apache HBase Region Splitting and Merging,” hortonworks.com, February 1, 2013.
Brandon Williams: “Virtual Nodes in Cassandra 1.2,” datastax.com, December 4, 2012.
Richard Jones: “libketama: Consistent Hashing Library for Memcached Clients,” metabrew.com, April 10, 2007.
Branimir Lambov: “New Token Allocation Algorithm in Cassandra 3.0,” datastax.com, January 28, 2016.
Jason Wilder: “Open-Source Service Discovery,” jasonwilder.com, February 2014.
Kishore Gopalakrishna, Shi Lu, Zhen Zhang, et al.: “Untangling Cluster Management with Helix,” at ACM Symposium on Cloud Computing (SoCC), October 2012.doi:10.1145/2391229.2391248
“Moxi 1.8 Manual,” Couchbase, Inc., 2014.
Shivnath Babu and Herodotos Herodotou: “Massively Parallel Databases and MapReduce Systems,” Foundations and Trends in Databases, volume 5, number 1, pages 1–104, November 2013.doi:10.1561/1900000036

上一章	目录	下一章
第五章：复制	设计数据密集型应用	第七章：事务