Command Line Tools for Stroom

June 14, 2016 Kasper Jeppesen

tl;dr Stroom Data comes with a set of command line tools that lets you experiment with data held in Stroom using your prefered languages and frameworks.

At the core of Stroom Data lies the idea of the JSON document stream. It is a concept that through its simplicity both sets limitations and opens up possibilities. If you work in a team of data engineers that is planning a 12-month-long project to build a platform capable of ingesting terabytes of data daily for a customer-facing product, then you may find the limitations to be a hindrance. But, if you are a single engineer needing to ingest gigabytes of data daily in order to make them available for future exploratory work that is not yet well understood, then the possibilities should excite you!

Stroom Data is simple by design. In a matter of minutes, you can download the distribution, start the service, begin ingesting data, and start working on your first map-reduce job in Javascript. But no matter how simple and approachable the built-in tools and processes might be, nothing beats the stack you already know. That is why we have included a set of command line tools in the Stroom distribution that are specifically written to quickly move data in and out of Stroom using your language of choice.

If you want to export a stream so you can attack it with your current toolset—whether it be Python, R, Julia, Mathematica or some other awesome system we haven’t even heard of—all you need to do is use the sd_dump tool to write the contents of a stream to a file and sd_load to load your output file back into Stroom.

The following sample shows the simplicity of loading a file of raw json into a stream.

$ sd_load metrics output.json
  100.0% |XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
  1000K docs, 220MB data [23229 docs/s, 4.9 MB/s]
  Completed in 43 seconds

For some use cases, you don’t even need to extract the data in bulk, instead you can use the tools sd_read and sd_write to pipe a stream through your own tools and back into a stream in Stroom. The following sample shows how even grep can be used to filter the contents of a stream and write the output to another stream.

$ sd_read metrics | grep “iOS” | sd_write ios_metrics

In this sample, sd_read reads a stream of JSON documents called “metrics” and writes each of the documents as a line of text to standard out. Grep filters the data and passes it on to sd_write which then reads its input from standard in, collects it into batches, and appends it to a stream called ios_metrics.

The following screencast shows the usage of these tools in further detail.

Client Libraries

Command line tools are great for quick hacks and proof-of-concept explorations of your dataset, but in order to build stable, continuously running services you will want to stay tuned for the next release of Stroom Data that will include the first client libraries for a handful of languages in use at DoubleDutch. These libraries will let you consume and produce data in streams as well as provide the tools to easily set up incremental map-reduce services in your favorite language in the same way the built-in services let you write them in Javascript.

Got a specific language you would love to be writing your Stroom services in? Let us know in our google group: https://groups.google.com/forum/#!forum/stroomdata

As always, follow us on Twitter @StroomData to stay up-to-date on releases, screencasts, and other Stroom related news! Have a look at the github repo https://github.com/DoubleDutch/StroomData for more information and to download the latest release!

Previous Article
Client Libraries for Stroom

Next Article
Introduction to Stroom Data