Å·±¦ÓéÀÖ

Jump to ratings and reviews
Rate this book

Data Science at the Command Line

Rate this book
In this practical guide, you’ll learn how to leverage the power of the command line for doing data science. By combining small, yet powerful, command-line tools, you can quickly obtain, scrub, explore, and model your data. Even if you’re already comfortable processing data with R or Python, being able to integrate the command line into your existing workflow will make you a more efficient and productive data scientist.

Learn essential concepts and built-in commands of the *nix command line
Get started with your own Data Science Toolbox on either Linux, Mac OS X, or Microsoft Windows
Use classic command-line tools such as grep, sed, and awk
Obtain data from websites, APIs, databases, and spreadsheets
Parallelize and distribute data-intensive pipelines to remote machines, including AWS EC2
Clean data in CSV, JSON, and XML/HTML formats using csvkit, and jq, and scrape
Apply dimensionality reduction, clustering, regression, and classification algorithms
Visualize data and results from the command line using gnuplot and ggplot
Turn Bash one-liners and existing Python and R code into reusable command-line tools

120 pages

First published June 1, 2014

27 people are currently reading
371 people want to read

About the author

Jeroen Janssens

5Ìýbooks6Ìýfollowers
Jeroen Janssens, PhD, is a Senior Developer Relations Engineer at Posit, PBC. His expertise lies in visualizing data, implementing machine learning models, and building solutions using Python, R, JavaScript, and Bash. He’s passionate about open source and sharing knowledge. He’s the author of Python Polars: The Definitive Guide (O’Reilly, 2025) and Data Science at the Command Line (O’Reilly, 2021). Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University. He lives with his wife and two kids in Rotterdam, the Netherlands.

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
33 (23%)
4 stars
62 (43%)
3 stars
38 (26%)
2 stars
8 (5%)
1 star
0 (0%)
Displaying 1 - 17 of 17 reviews
Profile Image for Suhrob.
485 reviews62 followers
October 6, 2018
I use and love CLI on daily basis, but fact is that it is not suitable for most data analysis tasks.

Despite there are some tools (a few introduced in the book) to work with data, you will sooner or later (and mostly better if sooner) end up re-doing everything in python/R...

The book also did not age well...

csvkit is nowadays replaced by xsv, drake has not seen a commit since 2015 (and seems not very useful anyway).

So what you get:
- some very basic intro to relevant bash tools (curl, sed)
- some outdated tools like csvkit, drake
- some useless curiosities (feedgnuplot)
- some side-stepping CLI tools in favor of higher level tools (python/R/Weka)
- a very good chapter on GNU parallel

So CLI is great, but this book is not.

Profile Image for Venkatesh-Prasad.
223 reviews
May 29, 2017
The book provides an easy and simple route to basic data analysis tasks -- scrubbing and exploration. It will be useful to readers who 1) are interested in data analysis and just getting started, 2) have been using tools such as R and Python for data analysis and have wanted simpler ways to scrub and explore data, or 3) are interested in improving your command-line chops in the context of data analysis. However, this is not the book to learn data analysis/science.

The author provides a virtualization-based image to try out the tools described in the book. This is great; specifically, for Windows users. However, this did make me wonder if and how (better) would the solutions to translate over to Windows Powershell, which has some nice features absent from Unix shells. Also, while the book mentions numerous tools, many of them are scripts (e.g., scrape, cols) created by the and I wish they were available as installable packages via systems such as apt and port.

While the book tries to cover data modeling, it leans quite a bit on R and Python for this purpose. Instead, I would have preferred it would have explored more data scrubbing and exploration tricks using tools such as awk.

At <200 pages, the book is a short read but it also can seem a bit light if you are command-line or python/R savvy. That said, as a command-line user, I did learn about some new Unix shell tools/tricks, e.g., crush-tools, seq, parallel, csvsql of csvkit, drake (make for data analysis).

Overall, a good introduction to the command-line based path to data scrubbing and exploration.
Profile Image for Tim Tilberg.
9 reviews
August 14, 2020
The first 40 pages unpacks dozens of practical ideas and tools I had never previously considered. It introduced me to tools like [jq](), [csvtoolkit]() which led to [xsv](), and many more. It talks about why sometimes running data tasks using unix shell tools can be significantly more efficient than writing a move in a programming language (full parallelization with an entirely buffered pipeline).

After reading this book, I became significantly more efficient at researching and summarizing flat file data. A lot of it is based on creating CLI tools from simple Python scripts. As a Rubyist, I found that it was all easily transferrable to whatever scripting language you prefer.
Profile Image for m.
20 reviews
April 29, 2019
As a backend/infra engineer with a few years experience and an introductory level knowledge of machine learning algorithms, I still found this sprinkled with useful snippets, tools, tips, and references. That being said, I found the structure of it odd - seems like it would be better off as a cookbook - and thought it was confused in identifying and catering to its target audience e.g. the early chapters are beginner (i.e. familiar with data science, but not the command line) friendly, but the latter ones markedly less so.
Profile Image for ´³´Ç²õé.
232 reviews
August 16, 2019
Pretty cool book if you are not already accustomed to your own things (there is a chapter on modelling data with good tools which I would not use because I am much more used to using different stuff). If you already have your own habits, you can still learn quite a bit of things (at least I did) and get some inspiration to build your own command line tools for data science!
157 reviews3 followers
October 7, 2017
This books is a little focused on the tools. Which is good, which also means i need to revisit the book as I explore the tool. The commandline tools introduce is very interesting. I will definitely adopt it. It just didn't bring in a lot of new idea for me.
Profile Image for Pritesh Shrivastava.
80 reviews6 followers
September 15, 2018
I used this book as more of a guide to get familiar with the regular command line workflow. I find IDEs a much better tool for data science than a command line, as it's replroducible. But nice to know some shortcuts available in bash scripts.
Profile Image for Ondrej Kokes.
56 reviews20 followers
July 10, 2019
Not time tested tools. The book mentions coreutils and other unix tools fairly lightly, then spends much of the time with random tools that will become obsolete sooner or later (as we can see already now).

Pick up Unix Power Tools or Classic Shell Scripting instead.
67 reviews1 follower
March 25, 2020
Great overview and examples of command line tools to perform data science. I would wish for a few more tools and a deeper dive.
13 reviews1 follower
February 24, 2015
This book will really help you turn your command line hacking into scalable and well managed data projects.

This book isn't about BIG data, it's about getting hands on data on your desktop in a flexible, fast and fun way.
However, the Author isn't asking you to give up hadoop etc, he's asking you if you'd like another set of tools for another day.

The book is well structured, it's flow and style are good and it provides an easy read.

If you have no command line experience there's a brief intro but if you're a command line veteran the book probably starts to get interesting around page 50 and this gives you a core of around 100 pages. This is a bit short but it's also densely packed.

The book covers several topics including:

* Using and transforming data in plain-text/csv/json formats with csv being the main focus
* Tools to wrap R and other tools - including an example of how to build such a wrapper for weka
* Tools for turning your command line hacking into scalable and managed projects
* The mini-changes in mindset which is needed to get the most out of the command line for data science

Basically: If you want to do data science from end-to-end (get data, clean data, explore data, visualize, model and interpret data) and do all this from the command line this book is a very good place to start.

Pros:
Hands on, dozens of tools and examples, Vagrant box provided with everything pre-installed.

Cons:
The book is a little light at times, more info could be given on tools such as drake.

Some of the diagrams span across pages and on others the differences between types is lost since the colours aren't significantly different in black and white print. I find this quite disappointing.

All in all:

I enjoyed the book and have some real gains from reading it. In a world which has so many BIG solutions for BIG data it's nice to get a book that provides small,flexible,solid tools that let you have fun with your data at the command line.

I'd recommend buying it if you want some hands on fun with data at your desktop.
Profile Image for Steven Pennebaker.
58 reviews
October 16, 2014
This is an excellent book. Thorough and clear, it has enough basic information for beginners but even intermediate and advanced users will pick up plenty of new tricks. When I've had to solve these types of problems in the past, I've leaned pretty heavily on AWK and, to a lesser extent, XSL (!). This book introduced me to a bunch of utilities that were new to me and reminded me of a few old friends I haven't used in years.
Profile Image for Ravi Sinha.
310 reviews11 followers
December 29, 2014
Great compilation of well-known, not-so-well-known, and brand-new custom command line tools for OSEMNing with your data. Reads like a good tutorial. Could use a little bit of refinement - e.g. some commands are used multiple times before they are explained in detail. I installed the tools natively, but you can also install a VM instead. But in either case, clone the book's companion repository- it has all the data and the author-supplied command line tools.
Profile Image for Arthur.
96 reviews5 followers
August 12, 2014
It is a promising book, mostly for beginners, but an intermediate data scientist will find some good material to learn or will be inspired to dig into some very advanced topics.
In general having your own data science toolkit as a service as an idea is great!

The book is still in making so it only my preliminary rating.
Profile Image for Dgg32.
146 reviews6 followers
November 15, 2014
A good but short demonstration of using command line tools to do data science. I have learned quite some new ideas from the book. Well worth reading.
230 reviews3 followers
March 28, 2016
It is short, but very useful book. Most of the commands are practical and can be used without a lot of adjustments. Highly recommended reading!
Displaying 1 - 17 of 17 reviews

Can't find what you're looking for?

Get help and learn more about the design.