Building Data Infrastructure

Will Larson (@lethain)

SocialCode (@socialcodeinc)

Growth problems.

Showing teams, products and databases increasing over time.

A familiar monolith.

Architecture diagram of monolithic architecture composed of a load balancer, Gunicorn web servers, MySQL, Redis and celery workers.

Monolithic data problems.

Our requirements.

  1. Decouple publishing and consumption.
  2. Support cheap data exploration.
  3. Facilitate schemas change.
  4. Fail rarely, fail loudly.
  5. Avoid datastore lock-in.

Our solution.

A data pipeline, with services publishing into an input service, Kafka and Zooper between the input service and an output service, and datamarts subscribing to the output service.

Why Kafka?

  1. Does pub-sub right.
  2. Data durability.
  3. Horizontal scalability.
  4. Zookeeper.

Why JSON Schema? Why HTTP?

{
   "title": "Example Schema",
   "type": "object",
   "properties": {
   "firstName": {"type": "string"},
   "lastName": {"type": "string"},
   "age": {
       "description": "Age in years",
       "type": "integer",
       "minimum": 0
    },
    "required": ["firstName", "lastName"]
}

Have our lives improved?

  1. More, specialized, databases.
  2. Reduced coordination cost across teams.
  3. More experimentation and exploration.
  4. Paradigm flexibility (Map-Reduce, Warehousing, ...)

What we've learned so far.

Questions?

Rejected slides

What about "big data" problems?

When an increase of data, in quantity or kind, causes your tools or approach to fail.

Details on the components.

  1. Simple HTTP interface.
  2. JSON Schema
  3. Kafka

/