Page - Table of Contents
NoSQL – 4 Datastore / Database Types & Their Common Use Cases
• Web App User Session Details & Preferences
• Real-time Recommendations & Advertising
• In-Memory Data Caching
• User Profiles
• Real-Time Big Data
• Content Management
• Unstructured Data w Ever-changing Schemas
• Sensor Logs (IOT)
• User Preferences
• Geographic Information
• Reporting Systems
• Time Series Data
• Logging & Write-Heavy Apps
• Real-time Fraud Detection
• Social Media
• Network & Database Infrastructure Monitoring
• Recommendation Engine
• Data Privacy, Risk & Compliance
• AI & Analytics
1. RDBMS vs 2. NoSQL vs 3. NewSQL/Distributed SQL – Modern DB Families
ACID (RDBMS) vs BASE (NoSQL) – Transaction Processing Differences
• Atomicity: Either the task (or all tasks) within a transaction are performed or none of them are. This is the all-or-none principle. If one element of a transaction fails the entire transaction fails.
• Consistency: The transaction must meet all protocols or rules defined by the system at all times. The transaction does not violate those protocols and the database must remain in a consistent state at the beginning and end of a transaction; there are never any half-completed transactions.
• Isolation: No transaction has access to any other transaction that is in an intermediate or unfinished state. Thus, each transaction is independent unto itself. This is required for both performance and consistency of transactions within a database.
• Durability: Once the transaction is complete, it will persist as complete and cannot be undone; it will survive system failure, power loss and other types of system breakdowns.
• At any given microsecond in a database that uses ACID as its system constraint all the data are undergoing constant checks to make sure they fulfill those constraints. Such requirements had worked quite well for many years in the smaller, horizontally scalable, schema-driven, normalized, relational world of the Pre-Social Networking bygone age. Such past truisms are longer the case; Unstructured Data, Big Data, non-relational data structures, distributed computing systems and eventual consistency are now becoming more commonplace.
• Consistency refers to whether a system operates fully or not. Does the system reliably follow the established rules within its programming according to those defined rules? Do all nodes within a cluster see all the data they are supposed to? This is the same idea presented in ACID.
• Availability means just as it sounds. Is the given service or system available when requested? Does each request get a response outside of failure or success?
• Partition Tolerance represents the fact that a given system continues to operate even under circumstances of data loss or system failure. A single node failure should not cause the entire system to collapse.
• In the majority of instances, a distributed system can only guarantee two of the features, not all three. To ignore such a decision could have catastrophic results that include the possibility of all three elements falling apart simultaneously. The constraints of CAP Theorem on database reliability were monumental for new large-scale, distributed, non-relational systems: they often need Availability and Partition Tolerance, so Consistency suffers and ACID collapses.
• Basically Available: This constraint states that the system does guarantee the availability of the data as regards CAP Theorem; there will be a response to any request. But, that response could still be ‘failure’ to obtain the requested data or the data may be in an inconsistent or changing state, much like waiting for a check to clear in your bank account.
• Soft state: The state of the system could change over time, so even during times without input there may be changes going on due to ‘eventual consistency,’ thus the state of the system is always ‘soft.’
• Eventual consistency: The system will eventually become consistent once it stops receiving input. The data will propagate to everywhere it should sooner or later, but the system will continue to receive input and is not checking the consistency of every transaction before it moves onto the next one. Werner Vogel’s article “Eventually Consistent – Revisited” covers this topic is much greater detail.
• How do the vast data systems of the world such as Google’s BigTable and Amazon’s Dynamo and Facebook’s Cassandra (to name only three of many) deal with a loss of consistency and still maintain system reliability?
• The new style of database transaction processing has allowed for more efficient horizontal scaling at cost effective levels; checking the consistency of every single transaction at every moment of every change adds gargantuan costs to a system that has literally trillions of transactions occurring. The computing requirements are even more astronomical. Eventual consistency gave organizations such as Yahoo! and Google and Twitter and Amazon, plus thousands (if not millions) more the ability to interact with customers across the globe, continuously, with the necessary availability and partition tolerance, while keeping their costs down, their systems up, and their customers happy. Of course they would all like to have complete consistency all the time, but as Dan Pritchett discusses in his article “BASE: An Acid Alternative,” there has to be tradeoffs, and eventual consistency allowed for the effective development of systems that could deal with the exponential increase of data due to social networking, cloud computing and other Big Data projects.
Amazon Purpose-built Database – Choices
Amazon NoSQL Database Family – Documentation Links
Amazon DynamoDB Developer Guide (DG) – Links
• Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability.
• You can use Amazon DynamoDB to create a database table that can store and retrieve any amount of data, and serve any level of request traffic.
• Amazon DynamoDB automatically spreads the data and traffic for the table over a sufficient number of servers to handle the request capacity specified by the customer and the amount of data stored, while maintaining consistent and fast performance.
• Amazon DynamoDB
• DG – DynamoDB – Intro
• DG – Getting Started w DynamoDB
• DG – Working w Tables, Items, Queries, Scans & Indexes
• DG – In-Memory Acceleration w DynamoDB Accelerator (DAX)
• DG – NoSQL Workbench for DynamoDB
DynamoDB Developer Guide (DG) – Best Practice Links
• DG – Best Practices for Designing & Architecting w DynamoDB
• DG – NoSQL Design for DynamoDB
• DG – Best Practices for Designing & Using Partition Keys Effectively
• DG – Best Practices for Using Sort Keys to Organize Data
• DG – Best Practices for Using Secondary Indexes in DynamoDB
Amazon DocumentDB Developer Guide (DG) – Links
• Amazon DocumentDB (with MongoDB compatibility) is a fast, reliable, and fully managed database service that makes it easy for you to set up, operate, and scale MongoDB-compatible databases.
• Amazon DocumentDB (with MongoDB compatibility)
• DG – Amazon DocumentDB – Intro
• DG – Get Started with Amazon DocumentDB
• DG – Amazon DocumentDB Transactions
• DG – Best Practices for Amazon DocumentDB
Amazon Keyspaces (for Apache Cassandra) Developer Guide (DG) – Links
• Wide-column Datastore
• Amazon Keyspaces (for Apache Cassandra) is a scalable, highly available, and managed Apache Cassandra–compatible database service.
• Amazon Keyspaces (for Apache Cassandra)
• DG – Amazon Keyspaces (for Apache Cassandra) – Intro
• DG – Getting Started with Amazon Keyspaces
• DG – Working with Keyspaces, Tables, and Rows in Amazon Keyspaces
• DG – Using NoSQL Workbench with Amazon Keyspaces
Amazon Neptune User Guide (UG) – Links
• Fast, reliable Graph Database built for the cloud
• Amazon Neptune
• UG – Amazon Neptune – Intro
• UG – Getting Started with Graph Databases
• UG – Overview of Amazon Neptune Features
• UG – Managing Your Amazon Neptune Database
• UG – Best Practices: Getting the Most Out of Neptune
Amazon Dynamodb – Design Principles
DynamoDB is a fully managed NoSQL Key-Value & Document database
• DynamoDB is suited for workloads with any amount of data that require predictable read and write performance and automatic scaling from large to small and everywhere in between.
• DynamoDB scales up and down to support whatever read and write capacity you specify per second in provisioned capacity mode. Or you can set it to On-Demand mode and there is little to no capacity planning.
• DynamoDB stores 3 copies of data on SSD drives across 3 AZs in a region.
• DynamoDB’s most common datatypes are B (Binary), N (Number), and S (String)
• Tables consist of Items (rows) and Items consist of Attributes (columns)
READ & WRITE CONSISTENCY
• DynamoDB can be set to support Eventually Consistent Reads (default) and Strongly Consistent Reads on a per-call basis.
1. Eventually Consistent Reads data is returned immediately but data can be inconsistent. Copies of data will be generally consistent in 1 second.
2. Strongly Consistent Reads will always read from the leader partition since it always has an up-to-date copy. Data will never be inconsistent but latency may be higher. Copies of data will be consistent with a guarantee of 1 second.
• A Partition is when DynamoDB slices your table up into smaller chunks of data. This speeds up reads for very large tables.
• DynamoDB automatically creates Partitions for:
• Every 10 GB of Data or
• When you exceed RCUs (3000) or WCUs (1000) limits for a single partition
• When DynamoDB sees a pattern of a hot partition, it will split that partition in an attempt to fix the issue.
• DynamoDB will try to evenly split the RCUs and WCUs across Partitions
PRIMARY KEY DESIGN
• Primary keys define where and how your data will be stored in partitions
• The Key schema can be made up of two keys:
1. Partition Key (PK) is also known as HASH
2. The Sort Key (SK) is also known as RANGE
• When using the AWS DynamoDB API eg. CLI, SDK they refer to the PK and SK by their alternative names due to legacy reasons.
Primary key comes in two types:
1. Simple Primary Key (Using only a Partition Key)
2. Composite Primary Key (Using both a Partition and Sort Key)
Key Uniqueness is as follows:
• When creating a Simple Primary Key the PK value may be unique
• When creating a Composite Primary Key the combined PK and SK must be unique
• When using a Sort key, records on the partition are logically grouped together in Ascending order.
• DynamoDB has two types of Indexes:
1. LSI – Local Secondary index
2. GSI – Global Secondary Index
1. LSI – Local Secondary index
• Supports strongly or eventual consistency reads
• Can only be created with initial table (cannot be modified or and cannot deleted unless also deleting the table)
• Only Composite
• 10GB or less per partition
• Share capacity units with base table
• Must share Partition Key (PK) with base table.
2. GSI – Global Secondary Index
• Only eventual consistency reads (cannot provide strong consistency)
• Can create, modify, or delete at anytime
• Simple & Composite
• Can have whatever attributes as Primary Key (PK) or Secondary Key (SK)
• No size restriction per partition
• Has its own capacity settings (does not share with base table)
• Your table(s) should be designed in such a way that your workload primary access patterns do not use Scans. Overall, scans should be needed sparingly, for example for an infrequent report.
• Scans through all items in a table and then returns one or more items through filters
• By default returns all attributes for every item (use ProjectExpression to limit)
• Scans are sequential, and you can speed up a scan through parallel scans using Segments and Total Segments
• Scans can be slow, especially with very large tables and can easily consume your provisioned throughput.
• Scans are one of the most expensive ways to access data in DynamoDB.
• Find items based on primary key values
• Table must have a composite key in order to be able to query
• By default queries are Eventually Consistent (use ConsistentRead True to change Strongly Consistent)
• By default returns all attributes for each item found by a query (use ProjectExpression to limit)
• By default is sorted ascending (use ScanIndexForward to False to reverse order to descending)
• DynamoDB has two capacity modes, Provisioned and On-Demand. You can switch between these modes once every 24 hours.
• Provisioned Throughput Capacity is the maximum amount of capacity your application is allowed to read or write per second from a table or index
• Provisioned is suited for predictable or steady state workloads
• RCUs is Read Capacity Unit
• WCUs is Write Capacity Unit
• You should enable Auto Scaling with Provisioned capacity mode. In this mode, you set a floor and ceiling for the capacity you wish the table to support. DynamoDB will automatically add and remove capacity to between these values on your behalf and throttle calls that go above the ceiling for too long.
• If you go beyond your provisioned capacity, you’ll get an Exception: ProvisionedThroughputExceededException (throttling)
• Throttling is when requests are blocked due to read or write frequency higher than set thresholds. E.g. exceeding set provisioned capacity, partitions splitting, table/index capacity mismatch.
• On-Demand Capacity is pay per request. So you pay only for what you use.
• On-Demand is suited for new or unpredictable workloads
• The throughput is only limited by the default upper limits for a table (40K RCUs and 40K WCUs)
• Throttling can occur if you exceed double your previous peak capacity (high water mark) within 30 minutes. For example, if you previously peaked to a maximum of 30,000 ops/sec, you could not peak immediately to 90,000 ops/sec, but you could to 60,000 ops/sec.
• Since there is no hard limit, On-Demand could become very expensive based on emerging scenarios
CALCULATING READS & WRITES
1. CALCULATING READS (RCU)
A read capacity unit represents:
• one strongly consistent read per second,
• or two eventually consistent reads per second,
• for an item up to 4 KB in size.
How to calculate RCUs for strong
1. Round data up to nearest 4.
2. Divide data by 4
3. Times by number of reads
Here’s an example:
• 50 reads at 40KB per item. (40/4) x 50 = 500 RCUs
• 10 reads at 6KB per item. (8/4) x 10 = 20 RCUs
• 33 reads at 17KB per item. (20/4) x 33 = 132 RCUs
How to calculate RCUs for eventual
1. Round data up to nearest 4.
2. Divide data by 4
3. Times by number of reads
4. Divide final number by 2
5. Round up to the nearest whole number
Here’s an Example:
• 50 reads at 40KB per item. (40/4) x 50 / 2 = 250 RCUs
• 11 reads at 9KB per item. (12/4) x 11 / 2 = 17 RCUs
• 14 reads at 24KB per item. (24/4) x 14 / 2 = 35 RCUs
2. CALCULATING WRITES (WCU)
A write capacity unit represents:
• one write per second,
• for an item up to 1 KB
How to calculate Writes
1. Round data up to nearest 1.
2. Times by number of writes
Here’s an Example:
• 50 writes at 40KB per item. 40 x 50 = 2000 WCUs
• 11 writes at 1KB per item. 1 x 11 = 11 WCUs
• 18 writes at 500 BYTES per item. 1 x 18 = 18 WCUs
DynamoDB ACCELERATOR (DAX)
• DynamoDB Accelerator is a fully managed in-memory write through cache for DynamoDB that runs in a cluster
• Reads are Eventually Consistent
• Incoming requests are evenly distributed across all of the nodes in the cluster.
• DAX can reduce read response times to microseconds
• DAX is IDEAL for:
• fastest response times possible
• apps that read a small number of items more frequently
• apps that are read intensive
• DAX is NOT IDEAL for:
• Apps that require strongly consistent reads
• Apps that do not require microsecond read response times
• Apps that are write intensive, or that do not perform much read activity
• If you don’t need DAX consider using ElastiCache
• DynamoDB supports transactions via the TransactWriteItems and TransactGetItems API calls.
• Transactions let you query multiple tables at once and are an all-or-nothing approach (all API calls must succeed).
• DynamoDB Global tables provide a fully managed solution for deploying multi-region, multi-master databases.
• DynamoDB Streams allows you to set up a Lambda function triggered every time data is modified in a table to react to changes. Streams do not consume RCUs.
• DynamoDB API’s most notable commands via CLI: aws dynamodb <command>
• aws dynamodb get-item returns a set of attributes for the item with the given primary key. If no matching item, then it does not return any data and there will be no Item element in the response.
• aws dynamodb put-item Creates a new item, or replaces an old item with a new item. If an item that has the same primary key as the new item already exists in the specified table, the new item completely replaces the existing item.
• aws dynamodb update-item Edits an existing item’s attributes, or adds a new item to the table if it does not already exist.
• aws dynamodb batch-get-item returns the attributes of one or more items from one or more tables. You identify requested items by primary key. A single operation can retrieve up to 16 MB of data, which can contain as many as 100 items.
• aws dynamodb batch-write-item puts or deletes multiple items in one or more tables. Can write up to 16 MB of data, which can comprise as many as 25 put or delete requests. Individual items to be written can be as large as 400 KB.
• aws dynamodb create-table adds a new table to your account. Table names must be unique within each Region.
• aws dynamodb update-table Modifies the provisioned throughput settings, global secondary indexes, or DynamoDB Streams settings for a given table.
• aws dynamodb delete-table operation deletes a table and all of its items.
• aws dynamodb transact-get-items is a synchronous operation that atomically retrieves multiple items from one or more tables (but not from indexes) in a single account and Region. Call can contain up to 25 objects. The aggregate size of the items in the transaction cannot exceed 4 MB.
• aws dynamodb transact-write-items a synchronous write operation that groups up to 25 action requests. These actions can target items in different tables, but not in different AWS accounts or Regions, and no two actions can target the same item.
• aws dynamodb query finds items based on primary key values. You can query table or secondary index that has a composite primary key.
• aws dynamodb scan returns one or more items and item attributes by accessing every item in a table or a secondary index.