Storing Reliable, Easy-To-Use Archives in Amazon S3

Around six months ago, we rewrote our stats collection and processing pipeline from scratch. This new data pipeline follows the event stream architecture: messages sent from embedded Wistia players are put onto a distributed messaging platform called NSQ that workers process off of and write to the database.

One of the benefits of this architecture is clear separation and independence between the producers, which publish messages to NSQ as they arrive, and consumers, which read messages from NSQ. In case the consumers lose their network connections, producers can still write to NSQ, and eventually when consumers reconnect, they can continue where they left off.

Furthermore, NSQ allows multiple consumers to read from the same topic that producers write to. For our data pipeline, we have one set of consumers that process messages and save results to the database, as well as a separate consumer that compresses and archives messages to S3 so that we have a secure and everlasting source of everything that goes through the system. We call this consumer the “archiver."

Here is our philosophy on creating an archive for an event stream:

  • Keeping a record of every message the system processes forever is critical. This allows us to re-process data when things go wrong or we simply want to process it in a new way.
  • Archives should be stored in a human readable format. In our case, that's JSON. It's structured and easy to read.
  • Archives should just be log files. They're flat, portable, and everyone knows how to work with them.
  • Have these files just a GET request away. We love how easy it is to retrieve files from S3.
  • Use descriptive names for archive logs. We rotate the log files every hour before sending them to S3 so that the name of the file contains the full timestamp of when the data was consumed by the archiver.

This archive proved to be useful recently, when we had to re-process around 0.3% of approximately 30 days worth of data from video embeds. Because we only needed to re-process a tiny fraction of the data, we wanted to get only the parts of the log files that we actually needed from S3. So here's what we did:

  1. Output compressed content of log files from S3 to standard out using s3cmd.
  2. Pipe output into zgrep.
  3. Write matching output to file.
  4. Feed output file to NSQ and re-process that data.

Doing this for a single log file is pretty simple. This command saves lines from file.log.gz that match “foo” to output.log:

s3cmd get s3://my-bucket/file.log.gz - | zgrep foo > output.log

However, we needed to do this for roughly 30 days worth of data where the archive logs are in hourly chunks. There were many terabytes of data and over 700 files to search through. To do this in a sane way, here's the script that we ran with some example variables filled in:

start=$(date --date '20 jan 2015 00:00' +%s)
stop=$(date --date '24 feb 2015 00:00' +%s)

for t in $(seq ${start} 3600 ${stop})
  date=$(date --date @${t} +'%Y-%m-%d_%H')
  eval "s3cmd get $s3prefix.$date.log.gz - | zgrep $grep >> $output"

This script loops through the date range, filters out matching data from each file (an hourly chunk), and writes to the output file. Then, we fed the output file back into the NSQ event stream for re-processing and resolved the problem.

Having a reliable and easy to use archive system is crucial for our data pipeline because this a mission-critical piece of our infrastructure. To summarize our philosophy in just one sentence: Archive your data using human-readable files with descriptive nomenclature and store them somewhere that's secure yet readily available.

Leave a comment if you have any questions or can improve the script!

Keep Learning
Here are some related guides and posts that you might enjoy next.