A Vagrant Data Science Virtual Ubuntu 18.04 Box

This post is not about game programming, but about setting up a Linux virtual box loaded with data science tools.  The repository can be found here on Github.  Below is from the README on Github for this box.

Quick Start

  1. Download the free VirtualBox virtual machine player.
  2. download and install the appropriate Vagrant package for your OS.
  3. Clone or unzip this repository somewhere. In this example, I assume you are on Windows and have put the files (including the Vagrantfile) in:D:\VagrantDataScience2018
  4. Open a command-line (recommended: Terminal on Mac, PowerShell or Git Bash on Windows, your favorite terminal on Linux)
  5. cd into this project’s root folder; In this example, if using Git Bash, you would type:cd \d\VagrantDataScience2018
  6. Check everything is gonna be alright so far by listing the directory. In Git Bash, that would be:ls -AF

and you should see something like:

$ ls -AF
.git/ bootstrap.sh*  howtos/  LICENSE  README.md  Vagrantfile

It is important that bootstrap.sh and Vagrantfile be there, as well as the howtos directory (or at least the README.md)

  1. Runvagrant up

in the terminal. This will take some time, a lot gets installed.

  1. If all goes well, you will see the howtos scroll by, and you can go to the machine by typing:vagrant ssh

Setting Up a Few Things

In your guest OS that you just ssh’d into using “vagrant ssh”, you can find the howtos in the directory:

/vagrant/howtos

These give you quick instructions for setting up the Kaggle app and accessing the guest OS’s web, RStudio, and Jupyter notebook in your Host OS’s web browser.

The howtos are reproduced here for convenience:

Kaggle HOWTO

Set up an account on kaggle.com, and in your profile, generate API key.

In Guest:

Place the key in ~/.kaggle/kaggle.json

chmod 600 ~/.kaggle/kaggle.json

kaggle -h

Jupyter Notebook HOWTO

In guest:

jupyter notebook --generate-config

jupyter notebook password

choose a password, can’t be blank

jupyter notebook --ip 0.0.0.0

open browser on host, browse to localhost:8888, type

in the password you set if asked

RStudio HOWTO

open browser on host, browse to localhost:8787

username: vagrant

password: vagrant

Web HOWTO

in guest, html files in /var/www/html

open browser on host, browse to localhost:8080

What’s Included

  • Tools: figlet and toilet: for printing banners to the terminal
  • web server: apache2
  • Build Tools: build-essential gfortran gcc-multilib g++-multilib libffi-dev libffi6 libffi6-dbg python-crypto python-mox3 python-pil python-ply libssl-dev zlib1g-dev libbz2-dev libexpat1-dev libbluetooth-dev libgdbm-dev dpkg-dev quilt autotools-dev libreadline-dev libtinfo-dev libncursesw5-dev tk-dev blt-dev libssl-dev zlib1g-dev libbz2-dev libexpat1-dev libbluetooth-dev libsqlite3-dev libgpm2 mime-support netbase net-tools bzip2 p7zip unrar-free npm
  • R and R Studio: r-base r-base-dev gdebi-core rstudio-server-1.1.453-amd64.deb: a statistics and data science scripting language
  • Java: openjdk-11-doc openjdk-11-jdk openjdk-11-jdk-headless openjdk-11-jre openjdk-11-jre-headless
  • Python: python3-pip python3-all python3-all-dev python-all python-all-dev python-pip ipython ipython-notebook
  • Octave: a free Matlab replacement
  • Python packages: awscli bigmler csvkit numpy scipy nose skll matplotlib pandas numexpr tables openpyxl xlsxwriter xlrd feedparser beautifulsoup4 plotly statsmodels dataset pymongo nltk networkx deap pydot2 rpy2 jug nose
  • Jupyter: jupyter-core jupyter-notebook: Python web-based data-science notebook
  • Kaggle: for interacting with kaggle.com data science page
  • cowsay: for the heck of it
  • xml2json-command: for data wrangling
  • Update: added Pandas to Python3 packages.  Don’t know why I forgot to do that before.

Deplorable Mountaineer

Keeping Your Git Repo Local

Introduction

Github is a good place to keep your remote Git repositories, especially if you want to share them.  However, if you want to keep it private, it costs extra.  Also, Github has a limit on the sizes of files it can store.  You can partly get around this with Git LFS, though that has some convenience costs and the free version has limits.  An alternative would be to have an external hard drive, perhaps one connected to your network, and keep your “remote” Git repo local.

My Setup: An Example Case

Personally, I have an external drive connected through the network to my Windows 10 PC, mapped as my Z: drive, and I use it as a backup drive.  It is also a good place to store remote Git repos.

I also use Atlassian’s Sourcetree for my Git repo management.  This is a GUI front-end to Git that makes working with Git convenient.  It also allows you to use an internal Git system so you do not need to install Git separately.  This system comes with Git Bash, a convenient BASH shell that is useful as a command shell separately from using Git.

Let us suppose I have a project called MyProject that already has some files in it, and I want to maintain a separate Git repo.  It is located in the directory D:\Documents\Reps\MyProject.  You can follow along with your own project if you change the paths and names appropriately.

Keeping your Git repo local with Sourcetree and Git Bash

Stage and Commit the project locally

  1. Open up Sourcetree.
  2. We first want to turn the project into a Git repo.  To do this, select from the top menu, the File menu, then select the item Clone/New.
  3. Press the Create button (with a Plus (+) sign icon on it).
  4. Uncheck “Create Repository on Account” since we are going to use a “local” remote.
  5. Either use the Browse button or just type, to set the repo path to D:\Documents\Repos\MyProject (or whatever it is for your situation).
  6. Click the blue Create button.
  7. You get a popup asking if you want to create the Git repo in a directory that already exists.  Yes, do this.  It will not overwrite your information (unless you have a directory or file called “.git” in the directory already).
  8. The “Working Copy” of your repo will appear.  If it doesn’t, select Working Copy from the left sidebar menu.
  9. Now, click the Stage All button, somewhere near the middle of the window, at the top-right of the Ustaged Files panel.  (You will only be able to do this if there are files in your project.   If not, you can add a test file just to try out these steps.)  The file(s) will move to the upper Staged Files panel.
  10. Type in some commit message in the commit box at the bottom (such as, “Initial Commit”).  Then press the Commit button at the bottom-left of the window.

Your project has now been committed, but only locally, that is within the newly-created .git directory in your project directory.

Create the Remote on your Disk Drive

Now, we get to the real point of this post: creating the remote repository on your disk drive.  My setup is that the external networked hard drive is the Z drive, and I want to put the remote in a file called Z:\Repos\MyProject.  As before, change this according to your actual situation.

  1. While still in Sourcetree with the Working Copy of MyProject still showing, click the Terminal button near the top-right of the window.
  2. In the Git Bash terminal window that appears (which should automatically have a working directory of /d/Documents/Repos/MyProject, or whatever your project working directory is), type the following to create the new repo (changing path and project name to what you need it to be):git init --bare /z/repos/MyProject

    Note, git bash uses /z/ instead of Z:\ for the root of the Z drive, and the directory slashes go forward instead of backward.  This can be fast or slow depending on how fast your drive is.  If all goes well, you will see a response like “Initialized empty Git repository in Z:/repos/MyProject/”.

  3. While still in the Git Bash terminal window, type the following to clone your project to that new repo you created (while you are still in the same working directory):git clone --bare /z/repos/MyProject

    If all goes well, you will see a message like “Cloning into bare repository ‘MyProject.git’…”.  It is ok if you get a message like “warning: You appear to have cloned an empty repository.”, because we still need to “push” the last commit to this remote repo.

Connect and Push the Local to the Remote

  1. In Sourcetree, select the Repository menu from the top menu and select the “Add Remote…” item.
  2. In the popup, with the Remotes tab active, click the Add button
  3. For Remote name: check the Default Remote.
  4. For Url/Path, type/z/repos/MyProject

    (or the one that you used above)

  5. Ignore the Optional Extended Integration section
  6. Click OK, then click OK again.
  7. As a test, click Push to push your last commit (make a new commit first if you need to). Check the “Push?” checkbox, and leave local and remote branches as Master, and leave “Track?” checked.
  8. Click the Push button at the bottom.
  9. Depending on amount to push and speed of the disk drive you push to, it might take a while. (My Z drive is slow, being an external networked drive)
  10. It should finish with no errors: no news is good news!

Conclusion

You now know how to keep a git repo local.  I considered showing how to set up a Gitlab instance in a virtual machine, but that was quite involved, and there are some difficult issues with using networked drives and Gitlab.