In this practical guide, you’ll learn how to leverage the power of the command line for doing data science. By combining small, yet powerful, command-line tools, you can quickly obtain, scrub, explore, and model your data. Even if you’re already comfortable processing data with R or Python, being able to integrate the command line into your existing workflow will make you a more efficient and productive data scientist.
Learn essential concepts and built-in commands of the *nix command line Get started with your own Data Science Toolbox on either Linux, Mac OS X, or Microsoft Windows Use classic command-line tools such as grep, sed, and awk Obtain data from websites, APIs, databases, and spreadsheets Parallelize and distribute data-intensive pipelines to remote machines, including AWS EC2 Clean data in CSV, JSON, and XML/HTML formats using csvkit, and jq, and scrape Apply dimensionality reduction, clustering, regression, and classification algorithms Visualize data and results from the command line using gnuplot and ggplot Turn Bash one-liners and existing Python and R code into reusable command-line tools
Jeroen Janssens, PhD, is a Senior Developer Relations Engineer at Posit, PBC. His expertise lies in visualizing data, implementing machine learning models, and building solutions using Python, R, JavaScript, and Bash. He’s passionate about open source and sharing knowledge. He’s the author of Python Polars: The Definitive Guide (O’Reilly, 2025) and Data Science at the Command Line (O’Reilly, 2021). Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University. He lives with his wife and two kids in Rotterdam, the Netherlands.
I use and love CLI on daily basis, but fact is that it is not suitable for most data analysis tasks.
Despite there are some tools (a few introduced in the book) to work with data, you will sooner or later (and mostly better if sooner) end up re-doing everything in python/R...
The book also did not age well...
csvkit is nowadays replaced by xsv, drake has not seen a commit since 2015 (and seems not very useful anyway).
So what you get: - some very basic intro to relevant bash tools (curl, sed) - some outdated tools like csvkit, drake - some useless curiosities (feedgnuplot) - some side-stepping CLI tools in favor of higher level tools (python/R/Weka) - a very good chapter on GNU parallel
The book provides an easy and simple route to basic data analysis tasks -- scrubbing and exploration. It will be useful to readers who 1) are interested in data analysis and just getting started, 2) have been using tools such as R and Python for data analysis and have wanted simpler ways to scrub and explore data, or 3) are interested in improving your command-line chops in the context of data analysis. However, this is not the book to learn data analysis/science.
The author provides a virtualization-based image to try out the tools described in the book. This is great; specifically, for Windows users. However, this did make me wonder if and how (better) would the solutions to translate over to Windows Powershell, which has some nice features absent from Unix shells. Also, while the book mentions numerous tools, many of them are scripts (e.g., scrape, cols) created by the and I wish they were available as installable packages via systems such as apt and port.
While the book tries to cover data modeling, it leans quite a bit on R and Python for this purpose. Instead, I would have preferred it would have explored more data scrubbing and exploration tricks using tools such as awk.
At <200 pages, the book is a short read but it also can seem a bit light if you are command-line or python/R savvy. That said, as a command-line user, I did learn about some new Unix shell tools/tricks, e.g., crush-tools, seq, parallel, csvsql of csvkit, drake (make for data analysis).
Overall, a good introduction to the command-line based path to data scrubbing and exploration.
The first 40 pages unpacks dozens of practical ideas and tools I had never previously considered. It introduced me to tools like [jq](), [csvtoolkit]() which led to [xsv](), and many more. It talks about why sometimes running data tasks using unix shell tools can be significantly more efficient than writing a move in a programming language (full parallelization with an entirely buffered pipeline).
After reading this book, I became significantly more efficient at researching and summarizing flat file data. A lot of it is based on creating CLI tools from simple Python scripts. As a Rubyist, I found that it was all easily transferrable to whatever scripting language you prefer.
As a backend/infra engineer with a few years experience and an introductory level knowledge of machine learning algorithms, I still found this sprinkled with useful snippets, tools, tips, and references. That being said, I found the structure of it odd - seems like it would be better off as a cookbook - and thought it was confused in identifying and catering to its target audience e.g. the early chapters are beginner (i.e. familiar with data science, but not the command line) friendly, but the latter ones markedly less so.
Pretty cool book if you are not already accustomed to your own things (there is a chapter on modelling data with good tools which I would not use because I am much more used to using different stuff). If you already have your own habits, you can still learn quite a bit of things (at least I did) and get some inspiration to build your own command line tools for data science!
This books is a little focused on the tools. Which is good, which also means i need to revisit the book as I explore the tool. The commandline tools introduce is very interesting. I will definitely adopt it. It just didn't bring in a lot of new idea for me.
I used this book as more of a guide to get familiar with the regular command line workflow. I find IDEs a much better tool for data science than a command line, as it's replroducible. But nice to know some shortcuts available in bash scripts.
Not time tested tools. The book mentions coreutils and other unix tools fairly lightly, then spends much of the time with random tools that will become obsolete sooner or later (as we can see already now).
Pick up Unix Power Tools or Classic Shell Scripting instead.
This book will really help you turn your command line hacking into scalable and well managed data projects.
This book isn't about BIG data, it's about getting hands on data on your desktop in a flexible, fast and fun way. However, the Author isn't asking you to give up hadoop etc, he's asking you if you'd like another set of tools for another day.
The book is well structured, it's flow and style are good and it provides an easy read.
If you have no command line experience there's a brief intro but if you're a command line veteran the book probably starts to get interesting around page 50 and this gives you a core of around 100 pages. This is a bit short but it's also densely packed.
The book covers several topics including:
* Using and transforming data in plain-text/csv/json formats with csv being the main focus * Tools to wrap R and other tools - including an example of how to build such a wrapper for weka * Tools for turning your command line hacking into scalable and managed projects * The mini-changes in mindset which is needed to get the most out of the command line for data science
Basically: If you want to do data science from end-to-end (get data, clean data, explore data, visualize, model and interpret data) and do all this from the command line this book is a very good place to start.
Pros: Hands on, dozens of tools and examples, Vagrant box provided with everything pre-installed.
Cons: The book is a little light at times, more info could be given on tools such as drake.
Some of the diagrams span across pages and on others the differences between types is lost since the colours aren't significantly different in black and white print. I find this quite disappointing.
All in all:
I enjoyed the book and have some real gains from reading it. In a world which has so many BIG solutions for BIG data it's nice to get a book that provides small,flexible,solid tools that let you have fun with your data at the command line.
I'd recommend buying it if you want some hands on fun with data at your desktop.
This is an excellent book. Thorough and clear, it has enough basic information for beginners but even intermediate and advanced users will pick up plenty of new tricks. When I've had to solve these types of problems in the past, I've leaned pretty heavily on AWK and, to a lesser extent, XSL (!). This book introduced me to a bunch of utilities that were new to me and reminded me of a few old friends I haven't used in years.
Great compilation of well-known, not-so-well-known, and brand-new custom command line tools for OSEMNing with your data. Reads like a good tutorial. Could use a little bit of refinement - e.g. some commands are used multiple times before they are explained in detail. I installed the tools natively, but you can also install a VM instead. But in either case, clone the book's companion repository- it has all the data and the author-supplied command line tools.
It is a promising book, mostly for beginners, but an intermediate data scientist will find some good material to learn or will be inspired to dig into some very advanced topics. In general having your own data science toolkit as a service as an idea is great!
The book is still in making so it only my preliminary rating.