Various methods to ingest data into Google Colab

Sateesh Babu
6 min readFeb 2, 2021

Make use of free analytical notebooks to practice AI algorithms in Python.

Photo by Myriam Jessier on Unsplash

If you want to learn or practice Python based AI algorithms without spending a dollar in upgrading your laptop or tablet, then “Google Colab” notebook is the best option. Data ingestion and preparation is very important steps in a Machine Learning project.

Since the data comes from different places, it needs to be cleansed and transformed in a way that allows you to analyze it together with data from other sources.

Data ingestion is a process by which data is moved from one or more sources to a destination where it can be stored and further analyzed.

In this blog, we will explore various options in getting data into cloud based analytical notebook.

What’s Google Colab Notebook?

Google Colab (short form of Colaboratory) is a free Cloud based Jupyter notebook environment from Google Research that allows you to;

1.Write and execute code in Python
2.Document your code that supports mathematical equations
3.Create/Upload/Share notebooks
4.Import/Save notebooks from/to Google Drive
5.Import/Publish notebooks from GitHub
6.Import external datasets e.g. from Kaggle
7.Integrate PyTorch, TensorFlow, Keras, OpenCV
8.Free Cloud service with free GPU and TPU

The availability of free runtime — GPU, TPU is really the best thing about Google colab notebooks, ability to train these models in a matter of minutes or seconds. For more details on the CPU & GPU specifications, please refer to the Google Colab research link .There are few minor setbacks while trying to connect to GPU runtime or do not support R/Scala yet. So far, I have not faced any major issue while practicing my machine learning algorithms in Python. Another awesome feature which is publishing notebooks on GitHub. This really assisted me in maintaining and accessing my code repository from anywhere.

Let’s visualize the big picture before we get start with our design.

Prerequisites:

Run the following prerequisites before you explore ingesting techniques.
A. Login to Google Colab
B. Install & Import Python Packages
C. Mounting your Google Drive

A. How to login to Google Colab?

Following step to create new or open existing notebook:

Step 1: Sign-in using your Gmail credentials.

Step 2: Click Open new notebook in the “File” menu.

Step 3: Rename the notebook.

B. Install and Import Python packages

Anaconda’s Jupyter Notebook shipped with several pre-installed data libraries. On the top of that, Google Colab, provides even more pre-installed machine learning and deep learning libraries such as Keras, TensorFlow, and PyTorch. For any installation of custom library, please run the “!pip install” command as below.

Import the following packages into Colab notebook.

C.Mounting your Google Drive

Google drive can be accessed in multiple ways. One of the options is, mounting your Google Drive in the runtime’s virtual machine.

After running the following script, click on the link to retrieve the authorization code. Paste the code in the text box and execute.

Once you have finished, you are now able to access your Google Drive files under: “/content/gdrive/”

Data Ingestion methods:

There are various ways in getting data into Google Colab notebook. Here I have listed some of them.

  1. Import files from Google Drive
  2. Ingest data from GitHub
  3. Ingest UCI Machine Learning datasets from Web url
  4. Ingest Kaggle datasets
  5. Import Data from Local Drive
  6. Import Data from Database

Remember: The method that you use, the file size of the data, and the file format can all have an impact on the ingestion and on query performance.

1.Import files from Google Drive

Depends on your use case, you can save source dataset in Google Drive and use the scripts below to import data into your colab notebooks.

2.Ingest data from GitHub

I have downloaded “Housing” Dataset as per the instructions from Aurelien’s book on Machine Learning with Scikit-Learn and TensorFlow.

I have downloaded “Housing” Dataset as per the instructions from Aurelien’s book on Machine Learning with Scikit-Learn and TensorFlow.

You have to provide token details to access your private GitHub repository.

3.Ingest UCI Machine Learning datasets from Web

You will also find amazing data sets on UCI Machine Learning Repository. In the example below, I have used pandas to import IRIS dataset.

4.Ingest Kaggle datasets

If you have Kaggle dataset API to download, then follow the following the steps. Important prerequisite is to save your Kaggle Jason file (API key) in your Google drive and then mount the drive to Google Colab.

Unzip the files in your current drive and then delete the zip file. Use pandas to read the datasets for further analysis.

5.Import Data from Local Drive

After importing files package, execute files.upload and choose files from your local directory. It returns a dictionary of the files which were uploaded. Use pandas to read the required file.

Files are saved in the colab folder.

6.Import Data from Database

SQLAlchemy provides a pythonic way of interacting the databases. The approach is same way for all SQL databases such as — MySQL, Oracle, PostgreSQL. Steps are as follow:

1. Create connection string
2. Establish connection, “Engine”
3. Define and execute the SQL query
4. use a fetch method “.fetchall()” to get the data.

Execute the below code in sequence:

Other Tips:

For Colab notebooks tips and data ingestion scripts, please find this notebook on my GitHub.

Conclusion:

Finally, we have successfully created a Google Colab notebook within a matter of a few minutes. Based on your project requirements and data architecture step-up, you can apply above data ingestion methods before start practicing on your machine learning algorithms (Python scripts).

Google Colab adds collaboration, free GPU and TPU, cloud features, and additional pre-installed ML libraries. With above data ingestion methods, you can read the data from original source or can copy the datasets into your Google Drive for your practice.

Google is really helping reduce the entry barrier into deep learning or running complex stacked machine learning models. So, make use of Colab notebooks.

I will keep updating this post if I explore and learn new ways of ingesting the datasets into Google Colab.

Hope this post has been helpful, stay safe!

The opinions expressed here represent my own and not those of my current or any previous employers.

--

--

Sateesh Babu

Sr. Data Architect | Solution Consultant. As continuous learner, always curious about Machine Learning and AI.