How to get started with data science in containers
2017-08-24 08:35
393 查看
http://blog.kaggle.com/2016/02/05/how-to-get-started-with-data-science-in-containers/
he biggest impact on data science right now is not coming from a new algorithm or statistical method. It’s coming from Docker containers. Containers solve a bunch of tough problems simultaneously: they make it
easy to use libraries with complicated setups; they make your output reproducible; they make it easier to share your work; and they can take the pain out of the Python data science stack.
![](http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/02/Bell_jar_apparatus-270x300.jpg)
We use Docker containers at the heart of Kaggle Scripts. Playing around
with Scripts can give you a sense of what you can do with data science containers. But you can also put them to work on your own computer, and in this post I’ll explain how.
Containers are like ultralight virtual machines. When you restore a normal VM from a snapshot it can take a minute or so to get going, but Docker containers start up in roughly a millisecond. So you can run something
inside a container just like you’d run a native binary. Every time you restart the container, its execution environment is identical, which gives you reproducibility. And containers run identically on OS X, Windows and Linux, so collaborating and sharing becomes
much easier than before.
Personally, I think the best thing about containers is that they eliminate the pain of using Python for data science. R and Python are both great for statistics, each with its own strengths and weaknesses, but
one striking difference between them is in how they handle libraries and packages. R’s
works very smoothly, and conflicts between packages are rare. If you come across a new piece of work that uses a library you don’t have on your system, you can install it from CRAN and be underway in a few moments.
What a contrast with Python. In the Python world, a typical workflow would be something like this: notice that you need library
so call
But
exists on your system via
so
itself but only partially removes the new stuff, then
on a later build of
which you install, only to discover that libraries
linked to an older
that just got stomped on. And so on, and so on.
Python installations gradually accrete problems like this, with conflicts building up between libraries, and further conflicts between separate Python setups on the same system. The
helps a little, but in my experience it just delays the crash. Eventually you reach a point where you have to completely reinstall Python from scratch. And that’s not to mention the hours you can spend getting a new library to work.
If you use Python in a container instead, all those problems vanish. You only have to invest time once in setting up the container: once the build is complete, you’re all set. In fact, if you use one of Kaggle’s
containers, you don’t need to worry about building anything at all. And you can try out new packages without any hassles, because as soon as you exit a container session, it resets itself to a pristine state.
To run Kaggle Scripts, we put together three Docker containers:
an R installation with all of CRAN and a dozen extra packages,
a recent build of Julia 0.5 with a set of data science libraries installed, and
an Anaconda Python setup with a large set of libraries. To see the details of what’s inside, you can browse the Dockerfiles that are used to build them, which are all open source. We had to split them up into several parts so we could auto-build them on Docker
Hub: here are links to Python part 1, part
2, part 3; rcran0 to 22,
and rstats; and Julia part
1, part 2.
One side note: we only support Python 3. I mean come on, it’s 2016.
Here’s a recipe for setting up the Python container locally. These exact steps are for OS X, but the Windows or Linux equivalents are easy to figure out if you rtfm.
Step one is to head over to the Docker site and install Docker on your system.
They’ve made the install process very easy, so that shouldn’t take more than the twinkling of an eye.
Step two: the default install creates a Linux VM to run your containers, but it’s quite small and struggles to handle a typical data science stack. So make a new one, which in this example I’ll call
Obviously, you can tailor the
for your system. Step three: start it up.
Later, if you open a new terminal window and Docker complains about Cannot connect to the Docker daemon. Is the docker daemon running on this host? then rerunning those two lines should
sort it out.
Step four: pull the image you want to use.
You’re now at a point where you can run stuff in the container. Here’s an extra step that will make it super easy: put these lines in your
the Windows equivalent)
Now you can use
a replacement for calling
of
and run
start a Jupyter notebook session. All of them will have immediate access to the complete data science stack that Kaggle assembled.
I hope you enjoy using these containers as much as I have. And let me just add one more plug for Kaggle
Scripts—it’s a great way to share ideas and show off what you’ve made.
P.S. Here’s some more detail on how the
work. The three commands are Bash functions. The syntax
the Kaggle Python container. You give the container session access to the directory that you’re currently in by adding
the session start in that working directory. The
tidies up the container session after you exit. By default, Docker sessions hang around in case you want to do a post-mortem on them. Finally, the
that the container’s stdin, stdout and stderr will be attached to your terminal. There are many other options that
you can use, but I’ve found those to be the most useful.
Jamie Hall is a data scientist and engineer at Kaggle. This article is cross-posted from his personal
blog.
DOCKERPRODUCTPYTHON
Pierre-Alain
Excellent post ! Thanks a lot.
It really *was* a pain to install a python data science stack.
Note : I had to change --ip="*" by --ip="0.0.0.0" in .bash_profile to make the kjupyter command work (Mac)
Badrul Alom
I did this, but now I'm getting an error when I try to launch kjupyter: Couldn't get a file descriptor referring to the console
and going to http://0.0.0.0:8888/ just tells Firefox couldn't connect
Fei Zhan
Awesome. Solved my problem as well.
Johnny
Chan
In addition to the ip change (great thanks for this!), make sure `/tmp/working` exists. If not, create it with `mkdir /tmp/working`. Now when you run `kjupyter` you may copy and paste the url from console to
a browser: `The Jupyter Notebook is running at: http://0.0.0.0:8888/?token=xxxxxxxxxxxxxxx`.
(I notice that the auto brower pop up does not include the token bit. You need to physically copy and paste the entire URL string with the token part to the browser).
Michał Wajszczuk
Thanks for insights about Docker!
I have a question what is the size of kaggle/python image? Because my SDD have some space limiation.
T.
Morgan
For me, the image has proven to be about 15GB. It's huge.
Diego Menin
Hi, I'm confused about the "$PWD:/tmp/working -w=/tmp/working"; Where is that tmp/working folder supposed to be?, I couldn't find it anywhere. I imagine that's where the object on the starting page should live,
right?
Gabi
Huiber
It seems to me that this is your present working directory in the Docker virtual environment. If this recipe worked for you, when you do 'pwd' you will still see your current pwd path on the host, and no /tmp/working
anywhere. But when you go to the kpython prompt, os.getcwd() will return /tmp/working.
Alex Telfar
Hmm. Dont' know what I have done wrong, but I can't seem to get the jupyter notebooks working in the docker container. When I run your command (kjupyter), I get
socket.gaierror: [Errno -2] Name or service not known
and it tries to take me to some random IP which fails.
I also tried launching it from within kaggle/python environment and i get
No web browser found: could not locate runnable browser.
Any pointers? (using mac and the other commands work fine...)
Dario Lopez Padial
I resolved it in kjupyter with --ip="0.0.0.0"
Samir
Do as per Pierre-Alain suggested for MAC user:
change --ip="*" by --ip="0.0.0.0" in .bash_profile to make the kjupyter command work (Mac)
Jenny Yu
Hi, I downloaded Docker Toolbox (my PC is Windows 7), and followed your example to pull the kaggle/python. I've tried multiple times, but it always freezes (see picture attached). Is there a way around this problem?
Thanks.
César Palma Morante
Did you solve this?
Jenny Yu
No i didn't solve it. Still a problem .
Sergio Casca
It froze me once because the partition where I was storing the docker images ran out of free space. Hope it's the same simple case.
Adam Levin
Warning, if you have less than 8GB of ram on the machine you try to install this on, you are in for a wild ride.
D8amonk
Any windows users looking to add those commands, remember you've got to vim a .bashrc file with the above (last) snippet pasted in, and then also vim a .bash_profile containing the single line `. .bashrc` so
it gets run every time you open the docker quickstarter.
Andrey Akhmetov
Hey Guys! Was anybody able to run notebooks on Ubuntu/other linux?
Daniele
It's working for me changing --ip parameter:
docker run -v `pwd`:/tmp/working -w=/tmp/working -p 8888:8888 --name kaggle --rm -it kaggle/python jupyter notebook --no-browser --ip="0.0.0.0" --notebook-dir=/tmp/working
M. K.
Hi, Anyone knows how to access jupyter notebook once the connexion is launched? Since bashrc include --no-browser, I appreciate we need to launch the dashboard manually, but how exactly?
My prompt windows says 'The Jupyter Notebook is running at: http://0.0.0.0:888/'. But when I type this into my browser
(Chrome), it tells me it's not accessible. Any help would be greatly appreciated.
Please note:
- kpython and ikpython work fine
- I have Windows
- I have changed ip="*" by --ip="0.0.00" as suggested. Tried 127.0.0.0 as I thought 0.0.0.0 is a Mac-only address, but same issue
- prompt window message ends with "~/.bashrc: line 8: open: command not found" not sure if it's related to the --no-browser thing but thought it could help diagnostic what's wrong
John Zhu
This worked for me on Mac
John Zhu
FOR MACS:
from:
--ip="*"
to:
--ip="0.0.0.0" in .bash_profile
Shan Lin
The image that get pulled locally doesn't contain any dataset. How do I retrieve a dataset from Kaggle?
Amit
Is there any instruction for setup in docker for mac?
Anneloes Louwe
Nice post! One question: I have TensorFlow installed and working on my (host) computer. However, when I run TensorFlow inside the kaggle container, it uses only CPU. Does anyone know how to fix this?
Vincent
I got the error "docker: Error response from daemon: invalid bind mount spec ..." on my Windows 10. Anyone knows how to solve the problem?
tanventure
Thanks for your notes, very interesting. Just want to let you know the links above: links to Python part 1, part 2, part 3; rcran0 to 22, and rstats; and Julia part 1, part 2, are all broken. Please take a look
and I am keen to read them.
Tanventure
Johnny
Chan
Does it mean we need to store all notebooks and kaggle datasets under `/tmp/working`? (and what if the mac gets rebooted and `/tmp` gets flushed away? I'm keen to store both notebooks and datasets somewhere under
my local `$HOME` directory. The problem I'm facing is that within the kjupyter notebook environment I'm only allowed to "see" `/tmp/working` (i.e. can't get to my `$HOME` on the mac). Any top tip I would be very grateful!
Johnny
Chan
Ahhh... I have just solved the problem! The key is the current directory where you invoke the `kjupyter` command. i.e. e.g. if I invokve `kjupyter` at `/Users/johnny/kaggle`, then all subdirectories would be
"mapped" to `/tmp/working/` on the docker machine.
Andrew Nyago
I've been running docker run --rm -it kaggle/rstats for a two days now (internet is slightly slow) but i'e got all the parts bt there's a file f0b24ff7f2aa that is currently at 6GB and doesnt show how much is
left.
can someone please inform me on the maximum size of that file please
Andrew Nyago
I've been running docker run --rm -it kaggle/rstats for a two days now (internet is slightly slow) but i'e got all the parts bt there's a file f0b24ff7f2aa that is currently at 6GB and doesnt show how much is
left.
can someone please inform me on the maximum size of that file please
he biggest impact on data science right now is not coming from a new algorithm or statistical method. It’s coming from Docker containers. Containers solve a bunch of tough problems simultaneously: they make it
easy to use libraries with complicated setups; they make your output reproducible; they make it easier to share your work; and they can take the pain out of the Python data science stack.
![](http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/02/Bell_jar_apparatus-270x300.jpg)
We use Docker containers at the heart of Kaggle Scripts. Playing around
with Scripts can give you a sense of what you can do with data science containers. But you can also put them to work on your own computer, and in this post I’ll explain how.
Why use containers?
Containers are like ultralight virtual machines. When you restore a normal VM from a snapshot it can take a minute or so to get going, but Docker containers start up in roughly a millisecond. So you can run somethinginside a container just like you’d run a native binary. Every time you restart the container, its execution environment is identical, which gives you reproducibility. And containers run identically on OS X, Windows and Linux, so collaborating and sharing becomes
much easier than before.
Personally, I think the best thing about containers is that they eliminate the pain of using Python for data science. R and Python are both great for statistics, each with its own strengths and weaknesses, but
one striking difference between them is in how they handle libraries and packages. R’s
install.packages()mechanism
works very smoothly, and conflicts between packages are rare. If you come across a new piece of work that uses a library you don’t have on your system, you can install it from CRAN and be underway in a few moments.
What a contrast with Python. In the Python world, a typical workflow would be something like this: notice that you need library
X,
so call
pip install X, which also installs dependencies
A,
Band
C.
But
Balready
exists on your system via
easy_install,
so
pipcancels
itself but only partially removes the new stuff, then
import Brefuses to work ever again. Or you discover that
Crelies
on a later build of
numpy,
which you install, only to discover that libraries
Yand
Zare
linked to an older
numpylibrary
that just got stomped on. And so on, and so on.
Python installations gradually accrete problems like this, with conflicts building up between libraries, and further conflicts between separate Python setups on the same system. The
virtualenvsystem
helps a little, but in my experience it just delays the crash. Eventually you reach a point where you have to completely reinstall Python from scratch. And that’s not to mention the hours you can spend getting a new library to work.
If you use Python in a container instead, all those problems vanish. You only have to invest time once in setting up the container: once the build is complete, you’re all set. In fact, if you use one of Kaggle’s
containers, you don’t need to worry about building anything at all. And you can try out new packages without any hassles, because as soon as you exit a container session, it resets itself to a pristine state.
What’s in them exactly?
To run Kaggle Scripts, we put together three Docker containers: kaggle/rstatshas
an R installation with all of CRAN and a dozen extra packages,
kaggle/juliahas
a recent build of Julia 0.5 with a set of data science libraries installed, and
kaggle/pythonis
an Anaconda Python setup with a large set of libraries. To see the details of what’s inside, you can browse the Dockerfiles that are used to build them, which are all open source. We had to split them up into several parts so we could auto-build them on Docker
Hub: here are links to Python part 1, part
2, part 3; rcran0 to 22,
and rstats; and Julia part
1, part 2.
One side note: we only support Python 3. I mean come on, it’s 2016.
How to get started
Here’s a recipe for setting up the Python container locally. These exact steps are for OS X, but the Windows or Linux equivalents are easy to figure out if you rtfm.Step one is to head over to the Docker site and install Docker on your system.
They’ve made the install process very easy, so that shouldn’t take more than the twinkling of an eye.
Step two: the default install creates a Linux VM to run your containers, but it’s quite small and struggles to handle a typical data science stack. So make a new one, which in this example I’ll call
docker2.
disk-size,
cpu-countand
memorynumbers
for your system. Step three: start it up.
sort it out.
Step four: pull the image you want to use.
.bashrcfile (or
the Windows equivalent)
kpythonas
a replacement for calling
python,
ikpythoninstead
of
ipython,
and run
kjupyterto
start a Jupyter notebook session. All of them will have immediate access to the complete data science stack that Kaggle assembled.
I hope you enjoy using these containers as much as I have. And let me just add one more plug for Kaggle
Scripts—it’s a great way to share ideas and show off what you’ve made.
P.S. Here’s some more detail on how the
.bashrcentries
work. The three commands are Bash functions. The syntax
docker run ... kaggle/python Xwill execute command
Xinside
the Kaggle Python container. You give the container session access to the directory that you’re currently in by adding
-v $PWD:/tmp/working, and for convenience
-w=/tmp/workingmakes
the session start in that working directory. The
--rmswitch
tidies up the container session after you exit. By default, Docker sessions hang around in case you want to do a post-mortem on them. Finally, the
-itmeans
that the container’s stdin, stdout and stderr will be attached to your terminal. There are many other options that
you can use, but I’ve found those to be the most useful.
Jamie Hall is a data scientist and engineer at Kaggle. This article is cross-posted from his personal
blog.
DOCKERPRODUCTPYTHON
Pierre-Alain
Excellent post ! Thanks a lot.
It really *was* a pain to install a python data science stack.
Note : I had to change --ip="*" by --ip="0.0.0.0" in .bash_profile to make the kjupyter command work (Mac)
Badrul Alom
I did this, but now I'm getting an error when I try to launch kjupyter: Couldn't get a file descriptor referring to the console
and going to http://0.0.0.0:8888/ just tells Firefox couldn't connect
Fei Zhan
Awesome. Solved my problem as well.
Johnny
Chan
In addition to the ip change (great thanks for this!), make sure `/tmp/working` exists. If not, create it with `mkdir /tmp/working`. Now when you run `kjupyter` you may copy and paste the url from console to
a browser: `The Jupyter Notebook is running at: http://0.0.0.0:8888/?token=xxxxxxxxxxxxxxx`.
(I notice that the auto brower pop up does not include the token bit. You need to physically copy and paste the entire URL string with the token part to the browser).
Michał Wajszczuk
Thanks for insights about Docker!
I have a question what is the size of kaggle/python image? Because my SDD have some space limiation.
T.
Morgan
For me, the image has proven to be about 15GB. It's huge.
Diego Menin
Hi, I'm confused about the "$PWD:/tmp/working -w=/tmp/working"; Where is that tmp/working folder supposed to be?, I couldn't find it anywhere. I imagine that's where the object on the starting page should live,
right?
Gabi
Huiber
It seems to me that this is your present working directory in the Docker virtual environment. If this recipe worked for you, when you do 'pwd' you will still see your current pwd path on the host, and no /tmp/working
anywhere. But when you go to the kpython prompt, os.getcwd() will return /tmp/working.
Alex Telfar
Hmm. Dont' know what I have done wrong, but I can't seem to get the jupyter notebooks working in the docker container. When I run your command (kjupyter), I get
socket.gaierror: [Errno -2] Name or service not known
and it tries to take me to some random IP which fails.
I also tried launching it from within kaggle/python environment and i get
No web browser found: could not locate runnable browser.
Any pointers? (using mac and the other commands work fine...)
Dario Lopez Padial
I resolved it in kjupyter with --ip="0.0.0.0"
Samir
Do as per Pierre-Alain suggested for MAC user:
change --ip="*" by --ip="0.0.0.0" in .bash_profile to make the kjupyter command work (Mac)
Jenny Yu
Hi, I downloaded Docker Toolbox (my PC is Windows 7), and followed your example to pull the kaggle/python. I've tried multiple times, but it always freezes (see picture attached). Is there a way around this problem?
Thanks.
César Palma Morante
Did you solve this?
Jenny Yu
No i didn't solve it. Still a problem .
Sergio Casca
It froze me once because the partition where I was storing the docker images ran out of free space. Hope it's the same simple case.
Adam Levin
Warning, if you have less than 8GB of ram on the machine you try to install this on, you are in for a wild ride.
D8amonk
Any windows users looking to add those commands, remember you've got to vim a .bashrc file with the above (last) snippet pasted in, and then also vim a .bash_profile containing the single line `. .bashrc` so
it gets run every time you open the docker quickstarter.
Andrey Akhmetov
Hey Guys! Was anybody able to run notebooks on Ubuntu/other linux?
Daniele
It's working for me changing --ip parameter:
docker run -v `pwd`:/tmp/working -w=/tmp/working -p 8888:8888 --name kaggle --rm -it kaggle/python jupyter notebook --no-browser --ip="0.0.0.0" --notebook-dir=/tmp/working
M. K.
Hi, Anyone knows how to access jupyter notebook once the connexion is launched? Since bashrc include --no-browser, I appreciate we need to launch the dashboard manually, but how exactly?
My prompt windows says 'The Jupyter Notebook is running at: http://0.0.0.0:888/'. But when I type this into my browser
(Chrome), it tells me it's not accessible. Any help would be greatly appreciated.
Please note:
- kpython and ikpython work fine
- I have Windows
- I have changed ip="*" by --ip="0.0.00" as suggested. Tried 127.0.0.0 as I thought 0.0.0.0 is a Mac-only address, but same issue
- prompt window message ends with "~/.bashrc: line 8: open: command not found" not sure if it's related to the --no-browser thing but thought it could help diagnostic what's wrong
John Zhu
This worked for me on Mac
John Zhu
FOR MACS:
from:
--ip="*"
to:
--ip="0.0.0.0" in .bash_profile
Shan Lin
The image that get pulled locally doesn't contain any dataset. How do I retrieve a dataset from Kaggle?
Amit
Is there any instruction for setup in docker for mac?
Anneloes Louwe
Nice post! One question: I have TensorFlow installed and working on my (host) computer. However, when I run TensorFlow inside the kaggle container, it uses only CPU. Does anyone know how to fix this?
Vincent
I got the error "docker: Error response from daemon: invalid bind mount spec ..." on my Windows 10. Anyone knows how to solve the problem?
tanventure
Thanks for your notes, very interesting. Just want to let you know the links above: links to Python part 1, part 2, part 3; rcran0 to 22, and rstats; and Julia part 1, part 2, are all broken. Please take a look
and I am keen to read them.
Tanventure
Johnny
Chan
Does it mean we need to store all notebooks and kaggle datasets under `/tmp/working`? (and what if the mac gets rebooted and `/tmp` gets flushed away? I'm keen to store both notebooks and datasets somewhere under
my local `$HOME` directory. The problem I'm facing is that within the kjupyter notebook environment I'm only allowed to "see" `/tmp/working` (i.e. can't get to my `$HOME` on the mac). Any top tip I would be very grateful!
Johnny
Chan
Ahhh... I have just solved the problem! The key is the current directory where you invoke the `kjupyter` command. i.e. e.g. if I invokve `kjupyter` at `/Users/johnny/kaggle`, then all subdirectories would be
"mapped" to `/tmp/working/` on the docker machine.
Andrew Nyago
I've been running docker run --rm -it kaggle/rstats for a two days now (internet is slightly slow) but i'e got all the parts bt there's a file f0b24ff7f2aa that is currently at 6GB and doesnt show how much is
left.
can someone please inform me on the maximum size of that file please
Andrew Nyago
I've been running docker run --rm -it kaggle/rstats for a two days now (internet is slightly slow) but i'e got all the parts bt there's a file f0b24ff7f2aa that is currently at 6GB and doesnt show how much is
left.
can someone please inform me on the maximum size of that file please
相关文章推荐
- How to Get Started in Data Science
- EvenBus3.0 中文翻译(-)How to get started with EventBus in 3 steps
- How to get started with WebRTC
- How to get started with WebRTC and iOS without wasting 10 hours of your life
- How to Get Started with JMeter: Part 3 – Reports & Performance Metrics Best Practices
- how to get keyboard key with non blocking in terminal
- How to get response content with specified post data by post method
- How to get started with Cute Editor(zz)
- How to Get Started With JMeter: Part 1 - Installation & Test Plans
- How to Get the Frequency Table of a Categorical Variable as a Data Frame in R
- How to get data from Oracle DB in silverlight via WCF ?
- scala: How to write a simple HTTP GET request client in Scala (with a timeout)
- QT14 how to save data in sqlite database with pushbutton
- How to Get Started With JMeter: Part 2 - Building Scripts and Running JMeter
- How to create custom navigation menu in SharePoint with XML data source 使用XML数据源在SharePoint创建自定义导航菜单
- https://towardsdatascience.com/how-to-train-your-own-object-detector-with-tensorflows-object-detecto
- How to create custom navigation menu in SharePoint with XML data source 使用XML数据源在SharePoint创建自定义导航菜单
- A Programmer's Guide to Data Mining 2:Get started with recommendation system(User based filtering)
- 在Bean(java类)中如何取到jspx中Bindings的数据?How to get the ADF BindingContainer in a managed bean?