Name: Data Science at the Command Line
Rating: 3.85 (17 reviews)
ISBN: 9781491947784

Rate this book

Data Science at the Command Line

Jeroen Janssens

Rate this book

In this practical guide, you’ll learn how to leverage the power of the command line for doing data science. By combining small, yet powerful, command-line tools, you can quickly obtain, scrub, explore, and model your data. Even if you’re already comfortable processing data with R or Python, being able to integrate the command line into your existing workflow will make you a more efficient and productive data scientist.

Learn essential concepts and built-in commands of the *nix command line
Get started with your own Data Science Toolbox on either Linux, Mac OS X, or Microsoft Windows
Use classic command-line tools such as grep, sed, and awk
Obtain data from websites, APIs, databases, and spreadsheets
Parallelize and distribute data-intensive pipelines to remote machines, including AWS EC2
Clean data in CSV, JSON, and XML/HTML formats using csvkit, and jq, and scrape
Apply dimensionality reduction, clustering, regression, and classification algorithms
Visualize data and results from the command line using gnuplot and ggplot
Turn Bash one-liners and existing Python and R code into reusable command-line tools

GenresProgrammingTechnologyNonfictionTechnicalComputer ScienceComputersTextbooks

120 pages

First published June 1, 2014

27 people are currently reading

371 people want to read

About the author

Jeroen Janssens

5��books6��followers

Jeroen Janssens, PhD, is a Senior Developer Relations Engineer at Posit, PBC. His expertise lies in visualizing data, implementing machine learning models, and building solutions using Python, R, JavaScript, and Bash. He’s passionate about open source and sharing knowledge. He’s the author of Python Polars: The Definitive Guide (O’Reilly, 2025) and Data Science at the Command Line (O’Reilly, 2021). Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University. He lives with his wife and two kids in Rotterdam, the Netherlands.

What do you think?

Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars

33 (23%)

4 stars

62 (43%)

3 stars

38 (26%)

2 stars

8 (5%)

1 star

0 (0%)

Displaying 1 - 17 of 17 reviews

Suhrob

485 reviews62 followers

October 6, 2018

I use and love CLI on daily basis, but fact is that it is not suitable for most data analysis tasks.

Despite there are some tools (a few introduced in the book) to work with data, you will sooner or later (and mostly better if sooner) end up re-doing everything in python/R...

The book also did not age well...

csvkit is nowadays replaced by xsv, drake has not seen a commit since 2015 (and seems not very useful anyway).

So what you get:
- some very basic intro to relevant bash tools (curl, sed)
- some outdated tools like csvkit, drake
- some useless curiosities (feedgnuplot)
- some side-stepping CLI tools in favor of higher level tools (python/R/Weka)
- a very good chapter on GNU parallel

So CLI is great, but this book is not.

Venkatesh-Prasad

223 reviews

May 29, 2017

The book provides an easy and simple route to basic data analysis tasks -- scrubbing and exploration. It will be useful to readers who 1) are interested in data analysis and just getting started, 2) have been using tools such as R and Python for data analysis and have wanted simpler ways to scrub and explore data, or 3) are interested in improving your command-line chops in the context of data analysis. However, this is not the book to learn data analysis/science.

The author provides a virtualization-based image to try out the tools described in the book. This is great; specifically, for Windows users. However, this did make me wonder if and how (better) would the solutions to translate over to Windows Powershell, which has some nice features absent from Unix shells. Also, while the book mentions numerous tools, many of them are scripts (e.g., scrape, cols) created by the and I wish they were available as installable packages via systems such as apt and port.

While the book tries to cover data modeling, it leans quite a bit on R and Python for this purpose. Instead, I would have preferred it would have explored more data scrubbing and exploration tricks using tools such as awk.

At <200 pages, the book is a short read but it also can seem a bit light if you are command-line or python/R savvy. That said, as a command-line user, I did learn about some new Unix shell tools/tricks, e.g., crush-tools, seq, parallel, csvsql of csvkit, drake (make for data analysis).

Overall, a good introduction to the command-line based path to data scrubbing and exploration.

Tim Tilberg

9 reviews

August 14, 2020

The first 40 pages unpacks dozens of practical ideas and tools I had never previously considered. It introduced me to tools like [jq](), [csvtoolkit]() which led to [xsv](), and many more. It talks about why sometimes running data tasks using unix shell tools can be significantly more efficient than writing a move in a programming language (full parallelization with an entirely buffered pipeline).

After reading this book, I became significantly more efficient at researching and summarizing flat file data. A lot of it is based on creating CLI tools from simple Python scripts. As a Rubyist, I found that it was all easily transferrable to whatever scripting language you prefer.

my-books

20 reviews

April 29, 2019

As a backend/infra engineer with a few years experience and an introductory level knowledge of machine learning algorithms, I still found this sprinkled with useful snippets, tools, tips, and references. That being said, I found the structure of it odd - seems like it would be better off as a cookbook - and thought it was confused in identifying and catering to its target audience e.g. the early chapters are beginner (i.e. familiar with data science, but not the command line) friendly, but the latter ones markedly less so.

data-eng-analysis tech-computer-science unix-cli-bash-terminal

��ǲ�é

232 reviews

August 16, 2019

Pretty cool book if you are not already accustomed to your own things (there is a chapter on modelling data with good tools which I would not use because I am much more used to using different stuff). If you already have your own habits, you can still learn quite a bit of things (at least I did) and get some inspiration to build your own command line tools for data science!

Sweemeng Ng

157 reviews3 followers

October 7, 2017

This books is a little focused on the tools. Which is good, which also means i need to revisit the book as I explore the tool. The commandline tools introduce is very interesting. I will definitely adopt it. It just didn't bring in a lot of new idea for me.

Pritesh Shrivastava

80 reviews6 followers

September 15, 2018

I used this book as more of a guide to get familiar with the regular command line workflow. I find IDEs a much better tool for data science than a command line, as it's replroducible. But nice to know some shortcuts available in bash scripts.

Ondrej Kokes

56 reviews20 followers

July 10, 2019

Not time tested tools. The book mentions coreutils and other unix tools fairly lightly, then spends much of the time with random tools that will become obsolete sooner or later (as we can see already now).

Pick up Unix Power Tools or Classic Shell Scripting instead.

Michael Lee

16 reviews

August 9, 2019

Must read for ML engineers in enterprise settings.

Stein Karlsen

67 reviews1 follower

March 25, 2020

Great overview and examples of command line tools to perform data science. I would wish for a few more tools and a deeper dive.

Ethan J

343 reviews11 followers

Want to read

August 11, 2022

seems not quite interesting

did-not-finish

John Alan

13 reviews1 follower

February 24, 2015

This book will really help you turn your command line hacking into scalable and well managed data projects.

This book isn't about BIG data, it's about getting hands on data on your desktop in a flexible, fast and fun way.
However, the Author isn't asking you to give up hadoop etc, he's asking you if you'd like another set of tools for another day.

The book is well structured, it's flow and style are good and it provides an easy read.

If you have no command line experience there's a brief intro but if you're a command line veteran the book probably starts to get interesting around page 50 and this gives you a core of around 100 pages. This is a bit short but it's also densely packed.

The book covers several topics including:

* Using and transforming data in plain-text/csv/json formats with csv being the main focus
* Tools to wrap R and other tools - including an example of how to build such a wrapper for weka
* Tools for turning your command line hacking into scalable and managed projects
* The mini-changes in mindset which is needed to get the most out of the command line for data science

Basically: If you want to do data science from end-to-end (get data, clean data, explore data, visualize, model and interpret data) and do all this from the command line this book is a very good place to start.

Pros:
Hands on, dozens of tools and examples, Vagrant box provided with everything pre-installed.

Cons:
The book is a little light at times, more info could be given on tools such as drake.

Some of the diagrams span across pages and on others the differences between types is lost since the colours aren't significantly different in black and white print. I find this quite disappointing.

All in all:

I enjoyed the book and have some real gains from reading it. In a world which has so many BIG solutions for BIG data it's nice to get a book that provides small,flexible,solid tools that let you have fun with your data at the command line.

I'd recommend buying it if you want some hands on fun with data at your desktop.

Steven Pennebaker

58 reviews

October 16, 2014

This is an excellent book. Thorough and clear, it has enough basic information for beginners but even intermediate and advanced users will pick up plenty of new tricks. When I've had to solve these types of problems in the past, I've leaned pretty heavily on AWK and, to a lesser extent, XSL (!). This book introduced me to a bunch of utilities that were new to me and reminded me of a few old friends I haven't used in years.

Ravi Sinha

310 reviews11 followers

December 29, 2014

Great compilation of well-known, not-so-well-known, and brand-new custom command line tools for OSEMNing with your data. Reads like a good tutorial. Could use a little bit of refinement - e.g. some commands are used multiple times before they are explained in detail. I installed the tools natively, but you can also install a VM instead. But in either case, clone the book's companion repository- it has all the data and the author-supplied command line tools.

coding data-science unix

Arthur

96 reviews5 followers

August 12, 2014

It is a promising book, mostly for beginners, but an intermediate data scientist will find some good material to learn or will be inspired to dig into some very advanced topics.
In general having your own data science toolkit as a service as an idea is great!

The book is still in making so it only my preliminary rating.