Building Data Infrastructure

Will Larson (@lethain)

SocialCode (@socialcodeinc)

Growth problems.

Showing teams, products and databases increasing over time.

A familiar monolith.

Architecture diagram of monolithic architecture composed of a load balancer, Gunicorn web servers, MySQL, Redis and celery workers.

Monolithic data problems.

Expensive schema changes.
Data siloed by performance.
Explosion of access patterns.

Our requirements.

Decouple publishing and consumption.
Support cheap data exploration.
Facilitate schemas change.
Fail rarely, fail loudly.
Avoid datastore lock-in.

Our solution.

A data pipeline, with services publishing into an input service, Kafka and Zooper between the input service and an output service, and datamarts subscribing to the output service.

Why Kafka?

Does pub-sub right.
Data durability.
Horizontal scalability.
Zookeeper.

Why JSON Schema? Why HTTP?

{
   "title": "Example Schema",
   "type": "object",
   "properties": {
   "firstName": {"type": "string"},
   "lastName": {"type": "string"},
   "age": {
       "description": "Age in years",
       "type": "integer",
       "minimum": 0
    },
    "required": ["firstName", "lastName"]
}

Have our lives improved?

More, specialized, databases.
Reduced coordination cost across teams.
More experimentation and exploration.
Paradigm flexibility (Map-Reduce, Warehousing, ...)

What we've learned so far.

Kafka bindings in Python are young.
Don't plan ahead too far.
Performance takes time, and perhaps Java.

Questions?

Rejected slides

What about "big data" problems?

When an increase of data, in quantity or kind, causes your tools or approach to fail.

Details on the components.

Simple HTTP interface.
JSON Schema
Kafka