Storage Primer

Storage is a very critical component in the current IT domain. Choosing the right Storage platform and software is a critical part of a good Data center whether it is internal or external cloud. Even though I understood some Storage basics, I never ventured deep to understand the different storage technologies available. I tried to brush up my knowledge by doing some reading recently and I have tried to capture some of my reading in this blog.

Storage device(HDD vs RAID vs SSD)

HDD – Hard disk drive consists of a spindle with disks.

RAID(Redundant array of Independent disks) – Combines multiple HDDs to provide more reliability, throughput and capacity.

SSD – Solid state drive is a memory chip and it has no moving parts.

Storage device performance is measured in terms of throughput(data transfer rate), latency(time it takes to start a IO task) and IOPS(IO operations per second).  SSD scores better over HDD on all the performance parameters. RAID provides comparable throughput and IOPS as SSD, but SSD provides better latency. The only disadvantage of SSD is the much higher cost.

DAS vs SAN vs NAS

DAS – Disk attached storage, SAN – Storage area network, NAS -Network attached storage

Following table gives the major differences between the 3 types.

storage3

Structured, Unstructured and Semi-structured data

Structured data are typically enterprise databases for payroll, inventory etc. which has a pre-defined schema. Examples of Semi-structured data are web logs, sensor information etc. Majority of the data (70%) in internet is composed on unstructured data which includes photos, video, documents etc. Storage devices and database types are chosen based on the data type.

SQL vs NoSQL

Traditionally, all structured data was stored as relational databases like SQL. For Cloud based applications, data is typically unstructured or semi-structured, newer type of database like NoSQL usage is more predominant here. Following tables shows a brief comparison between the 2.

 SQL NoSQL
Primarily Relational database with a pre-defined schema. Difficult to change schema later Primarily non-relational or distributed. Can be document based, key-value pairs, graph databases
Vertically scalable Horizontally scalable
Examples are mysql, Oracle db, Postgres Examples are Mongodb, Cassandra. Lot of NoSQL databases are opensource.
SQL databases emphasizes on ACID properties ( Atomicity, Consistency, Isolation and Durability) Minimal emphasis on ACID properties. There are alternative ways to enforce ACID properties
Most suitable for complex transactions with complex queries Suitable for simpler transactions

Following link gives a good comparison between different NoSQL database types(Key value, Document based, Column-based, Graph-based). Following link gives real-world examples of NoSQL databases. Popular usecases are in web applications and gaming categories. Following picture shows how performance varies with data scale between SQL and NoSQL.

storage5

Object storage

Object storage is a type of NoSQL storage that’s used in the cloud. Amazon AWS S3 service and Dropbox service are examples for Object storage service. Object storage is used mainly for unstructured or semi-structured data. Following are some characteristics of Object storage.

  • Objects are identified by a unique key.
  • Multiple metadata can be associated with the object and this makes object retrieval easier.
  • Objects are immutable. When modification is needed, a new version of the object is created.
  • Object storage are highly scalable compared to NAS since it does not have some of the overhead of NAS like maintaining file structures, user permissions etc.

Following picture illustrates when to use Object storage vs traditional file/block storage based on the frequency of data usage.

storage1

Hadoop

Hadoop is not a database, but a distributed file system that allows for parallel computing. Hadoop is mainly used for offline data processing and not for realtime processing. Hadoop is used for applications like targeted ads, predictive analysis, recommendation engine etc. Hadoop uses Mapreduce to distribute computationally intensive task into multiple servers. Hadoop works closely with NoSQL databases. When someone says Big Data, they refer to both Hadoop and NoSQL databases. Following picture from here shows difference between offline and online data processing.

storage2

AWS Storage services

Amazon AWS has different services based on the type of storage the user needs. I have put their snapshot below to illustrate the point that there are plenty of storage options and the choice mainly depends on the application’s need.

storage4

In addition to whats listed above, AWS also provides EMR(Elastic Map Reduce) Hadoop service.

Openstack storage

Openstack has block storage(Cinder), Object storage(Swift) and Ephemeral storage managed by Nova.

References:

Pictures used in this blog are from references.

Leave a comment