您的位置:首页 > 大数据 > 人工智能

How to get started with data science in containers

2017-08-24 08:35 393 查看
http://blog.kaggle.com/2016/02/05/how-to-get-started-with-data-science-in-containers/

he biggest impact on data science right now is not coming from a new algorithm or statistical method. It’s coming from Docker containers. Containers solve a bunch of tough problems simultaneously: they make it
easy to use libraries with complicated setups; they make your output reproducible; they make it easier to share your work; and they can take the pain out of the Python data science stack.


We use Docker containers at the heart of Kaggle Scripts. Playing around
with Scripts can give you a sense of what you can do with data science containers. But you can also put them to work on your own computer, and in this post I’ll explain how.


Why use containers?

Containers are like ultralight virtual machines. When you restore a normal VM from a snapshot it can take a minute or so to get going, but Docker containers start up in roughly a millisecond. So you can run something
inside a container just like you’d run a native binary. Every time you restart the container, its execution environment is identical, which gives you reproducibility. And containers run identically on OS X, Windows and Linux, so collaborating and sharing becomes
much easier than before.
Personally, I think the best thing about containers is that they eliminate the pain of using Python for data science. R and Python are both great for statistics, each with its own strengths and weaknesses, but
one striking difference between them is in how they handle libraries and packages. R’s 
install.packages()
 mechanism
works very smoothly, and conflicts between packages are rare. If you come across a new piece of work that uses a library you don’t have on your system, you can install it from CRAN and be underway in a few moments.
What a contrast with Python. In the Python world, a typical workflow would be something like this: notice that you need library
X
,
so call 
pip
install X
, which also installs dependencies 
A
B
 and 
C
.
But 
B
 already
exists on your system via 
easy_install
,
so 
pip
 cancels
itself but only partially removes the new stuff, then 
import
B
 refuses to work ever again. Or you discover that 
C
relies
on a later build of 
numpy
,
which you install, only to discover that libraries 
Y
 and 
Z
 are
linked to an older 
numpy
 library
that just got stomped on. And so on, and so on.
Python installations gradually accrete problems like this, with conflicts building up between libraries, and further conflicts between separate Python setups on the same system. The 
virtualenv
 system
helps a little, but in my experience it just delays the crash. Eventually you reach a point where you have to completely reinstall Python from scratch. And that’s not to mention the hours you can spend getting a new library to work.
If you use Python in a container instead, all those problems vanish. You only have to invest time once in setting up the container: once the build is complete, you’re all set. In fact, if you use one of Kaggle’s
containers, you don’t need to worry about building anything at all. And you can try out new packages without any hassles, because as soon as you exit a container session, it resets itself to a pristine state.


What’s in them exactly?

To run Kaggle Scripts, we put together three Docker containers: 
kaggle/rstats
 has
an R installation with all of CRAN and a dozen extra packages, 
kaggle/julia
 has
a recent build of Julia 0.5 with a set of data science libraries installed, and
kaggle/python
 is
an Anaconda Python setup with a large set of libraries. To see the details of what’s inside, you can browse the Dockerfiles that are used to build them, which are all open source. We had to split them up into several parts so we could auto-build them on Docker
Hub: here are links to Python part 1, part
2, part 3rcran0 to 22,
and rstats; and Julia part
1, part 2.
One side note: we only support Python 3. I mean come on, it’s 2016.


How to get started

Here’s a recipe for setting up the Python container locally. These exact steps are for OS X, but the Windows or Linux equivalents are easy to figure out if you rtfm.
Step one is to head over to the Docker site and install Docker on your system.
They’ve made the install process very easy, so that shouldn’t take more than the twinkling of an eye.
Step two: the default install creates a Linux VM to run your containers, but it’s quite small and struggles to handle a typical data science stack. So make a new one, which in this example I’ll call 
docker2
.

Obviously, you can tailor the 
disk-size
cpu-count
 and 
memory
 numbers
for your system. Step three: start it up.

Later, if you open a new terminal window and Docker complains about Cannot connect to the Docker daemon. Is the docker daemon running on this host? then rerunning those two lines should
sort it out.
Step four: pull the image you want to use.

You’re now at a point where you can run stuff in the container. Here’s an extra step that will make it super easy: put these lines in your 
.bashrc
 file
 (or
the Windows equivalent)

Now you can use 
kpython
 as
a replacement for calling 
python
ikpython
 instead
of 
ipython
,
and run 
kjupyter
 to
start a Jupyter notebook session. All of them will have immediate access to the complete data science stack that Kaggle assembled.
I hope you enjoy using these containers as much as I have. And let me just add one more plug for Kaggle
Scripts—it’s a great way to share ideas and show off what you’ve made.
P.S. Here’s some more detail on how the 
.bashrc
 entries
work. The three commands are Bash functions. The syntax 
docker
run ... kaggle/python X
 will execute command 
X
 inside
the Kaggle Python container. You give the container session access to the directory that you’re currently in by adding 
-v
$PWD:/tmp/working
, and for convenience 
-w=/tmp/working
 makes
the session start in that working directory. The 
--rm
 switch
tidies up the container session after you exit. By default, Docker sessions hang around in case you want to do a post-mortem on them. Finally, the 
-it
 means
that the container’s stdin, stdout and stderr will be attached to your terminal. There are many other options that
you can use, but I’ve found those to be the most useful.
Jamie Hall is a data scientist and engineer at Kaggle. This article is cross-posted from his personal
blog.

DOCKERPRODUCTPYTHON

Pierre-Alain

Excellent post ! Thanks a lot.

It really *was* a pain to install a python data science stack.
Note : I had to change --ip="*" by --ip="0.0.0.0" in .bash_profile to make the kjupyter command work (Mac)

Badrul Alom

I did this, but now I'm getting an error when I try to launch kjupyter: Couldn't get a file descriptor referring to the console

and going to http://0.0.0.0:8888/ just tells Firefox couldn't connect

Fei Zhan

Awesome. Solved my problem as well.

Johnny
Chan

In addition to the ip change (great thanks for this!), make sure `/tmp/working` exists. If not, create it with `mkdir /tmp/working`. Now when you run `kjupyter` you may copy and paste the url from console to
a browser: `The Jupyter Notebook is running at: http://0.0.0.0:8888/?token=xxxxxxxxxxxxxxx`.
(I notice that the auto brower pop up does not include the token bit. You need to physically copy and paste the entire URL string with the token part to the browser).

Michał Wajszczuk

Thanks for insights about Docker!
I have a question what is the size of kaggle/python image? Because my SDD have some space limiation.

T.
Morgan

For me, the image has proven to be about 15GB. It's huge.

Diego Menin

Hi, I'm confused about the "$PWD:/tmp/working -w=/tmp/working"; Where is that tmp/working folder supposed to be?, I couldn't find it anywhere. I imagine that's where the object on the starting page should live,
right?

Gabi
Huiber

It seems to me that this is your present working directory in the Docker virtual environment. If this recipe worked for you, when you do 'pwd' you will still see your current pwd path on the host, and no /tmp/working
anywhere. But when you go to the kpython prompt, os.getcwd() will return /tmp/working.

Alex Telfar

Hmm. Dont' know what I have done wrong, but I can't seem to get the jupyter notebooks working in the docker container. When I run your command (kjupyter), I get
socket.gaierror: [Errno -2] Name or service not known
and it tries to take me to some random IP which fails.
I also tried launching it from within kaggle/python environment and i get
No web browser found: could not locate runnable browser.
Any pointers? (using mac and the other commands work fine...)

Dario Lopez Padial

I resolved it in kjupyter with --ip="0.0.0.0"

Samir

Do as per Pierre-Alain suggested for MAC user:
change --ip="*" by --ip="0.0.0.0" in .bash_profile to make the kjupyter command work (Mac)

Jenny Yu

Hi, I downloaded Docker Toolbox (my PC is Windows 7), and followed your example to pull the kaggle/python. I've tried multiple times, but it always freezes (see picture attached). Is there a way around this problem?
Thanks.

César Palma Morante

Did you solve this?

Jenny Yu

No i didn't solve it. Still a problem .

Sergio Casca

It froze me once because the partition where I was storing the docker images ran out of free space. Hope it's the same simple case.

Adam Levin

Warning, if you have less than 8GB of ram on the machine you try to install this on, you are in for a wild ride.

D8amonk

Any windows users looking to add those commands, remember you've got to vim a .bashrc file with the above (last) snippet pasted in, and then also vim a .bash_profile containing the single line `. .bashrc` so
it gets run every time you open the docker quickstarter.

Andrey Akhmetov

Hey Guys! Was anybody able to run notebooks on Ubuntu/other linux?

Daniele

It's working for me changing --ip parameter:
docker run -v `pwd`:/tmp/working -w=/tmp/working -p 8888:8888 --name kaggle --rm -it kaggle/python jupyter notebook --no-browser --ip="0.0.0.0" --notebook-dir=/tmp/working

M. K.

Hi, Anyone knows how to access jupyter notebook once the connexion is launched? Since bashrc include --no-browser, I appreciate we need to launch the dashboard manually, but how exactly?

My prompt windows says 'The Jupyter Notebook is running at: http://0.0.0.0:888/'. But when I type this into my browser
(Chrome), it tells me it's not accessible. Any help would be greatly appreciated.

Please note:

- kpython and ikpython work fine

- I have Windows

- I have changed ip="*" by --ip="0.0.00" as suggested. Tried 127.0.0.0 as I thought 0.0.0.0 is a Mac-only address, but same issue

- prompt window message ends with "~/.bashrc: line 8: open: command not found" not sure if it's related to the --no-browser thing but thought it could help diagnostic what's wrong

John Zhu

This worked for me on Mac

John Zhu

FOR MACS:
from:

--ip="*"
to:

--ip="0.0.0.0" in .bash_profile

Shan Lin

The image that get pulled locally doesn't contain any dataset. How do I retrieve a dataset from Kaggle?

Amit

Is there any instruction for setup in docker for mac?

Anneloes Louwe

Nice post! One question: I have TensorFlow installed and working on my (host) computer. However, when I run TensorFlow inside the kaggle container, it uses only CPU. Does anyone know how to fix this?

Vincent

I got the error "docker: Error response from daemon: invalid bind mount spec ..." on my Windows 10. Anyone knows how to solve the problem?

tanventure

Thanks for your notes, very interesting. Just want to let you know the links above: links to Python part 1, part 2, part 3; rcran0 to 22, and rstats; and Julia part 1, part 2, are all broken. Please take a look
and I am keen to read them.
Tanventure

Johnny
Chan

Does it mean we need to store all notebooks and kaggle datasets under `/tmp/working`? (and what if the mac gets rebooted and `/tmp` gets flushed away? I'm keen to store both notebooks and datasets somewhere under
my local `$HOME` directory. The problem I'm facing is that within the kjupyter notebook environment I'm only allowed to "see" `/tmp/working` (i.e. can't get to my `$HOME` on the mac). Any top tip I would be very grateful!

Johnny
Chan

Ahhh... I have just solved the problem! The key is the current directory where you invoke the `kjupyter` command. i.e. e.g. if I invokve `kjupyter` at `/Users/johnny/kaggle`, then all subdirectories would be
"mapped" to `/tmp/working/` on the docker machine.

Andrew Nyago

I've been running docker run --rm -it kaggle/rstats for a two days now (internet is slightly slow) but i'e got all the parts bt there's a file f0b24ff7f2aa that is currently at 6GB and doesnt show how much is
left.
can someone please inform me on the maximum size of that file please

Andrew Nyago

I've been running docker run --rm -it kaggle/rstats for a two days now (internet is slightly slow) but i'e got all the parts bt there's a file f0b24ff7f2aa that is currently at 6GB and doesnt show how much is
left.
can someone please inform me on the maximum size of that file please
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: