Three-Legged Side Projects: A Full Stack Data Scientist’s Secret Weapon
Editor’s note: Ryan Day is a speaker for ODSC West this October 29th-31st. Be sure to check out his two talks, “Creating APIs That Data Scientists Will Love with FastAPI, SQLAlchemy, and Pydantic,” and “Using APIs in Data Science Without Breaking Anything,” there!
A full-stack data scientist’s range of expertise has to be quite broad and deeply technical — we are more likely to have a cloud or programming certification on our CV than a PhD. It sometimes feels like I am part data analyst, part software developer, and part cloud architect. It can be challenging at times to continue cultivating rich expertise in such a variety of disciplines. The technique I have found that works best is a three-legged side project.
The Three-Legged Side project
Side projects are useful in any technical or data-related field. A side project is a self-directed project that you perform on your own time outside of your company’s network with tools that you control yourself. They’re an investment in your skills: evenings and weekends that you spend honing your craft and growing your career.
A three-legged side project focused on the connection between three related tools or techniques. In a recent side project, I combined REST APIs, Data Pipelines, and Relational Databases. Another of my projects combined Jupyter Notebooks, Parquet files, and Data Visualizations. Modern data science stacks require a lot of connectivity between frameworks or software packages, and three components are about the right number to require integration between the pieces.
Here’s a Venn diagram view of what one of those projects looked like:
Here’s a screenshot of how I’ve implemented this side project to practice those three legs. I used Apache Airflow for the data pipeline. The data source is a FastAPI REST API. And it updates a relational database implemented in SQLite:
I follow a few soft guidelines in my projects:
- Stick to a single technology ecosystem. I generally stick to the Python ecosystem, where there are a range of mature open-source frameworks for any data science task. If I mix and match tech stacks (Python with Java for instance), it is too easy to get bogged down in configuration details.
- Go wide before going deep. Start by connecting basic “hello world” demonstrations in each of the technologies with the others. This minimum viable side project becomes a working shell that you can build on.
- Use industry-standard tools. Ideally, the tools I learn will bring best-of-breed experience to my employer, and make my own skills valued in the marketplace.
- Develop in the open. Don’t keep your learning to yourself! Blog about it, and share your code with others in the industry (more about that later).
Key Tools for Your Side Project
As I have created side projects over the years, I have developed several go-to technologies. These allow me to follow my soft guidelines and focus on what I am learning instead of the underlying tools:
- GitHub Public Repositories. A well-maintained Github public repo allows me to be organized in my work and move quickly. It also allows me to develop in the open and quickly share my work with other data scientists.
- Docker containers. Containerizing your code makes it much easier to experiment with multiple different technologies without worrying about installing code in a local environment or maintaining complicated dependencies. Some software frameworks provide a dockerized installation, which is even cleaner than installing with pip or curl.
- SQLite databases. Many side projects require a database. SQLite is dead simple to use — it is a single file in your local filesystem or container — and is widely supported by other tools. Start with SQLite until you have a desperate need to step up to a formal relational database.
The Best of All Possible Worlds: Development Containers
I have recently started using cloud Development Containers for all my side projects. According to the open-source Development Container Specification, “A development container allows you to use a container as a full-featured development environment. It can be used to run an application, to separate tools, libraries, or runtimes needed for working with a codebase, and to aid in continuous integration and testing.” So far I have used GitHub Codespaces, but there are other implementations available.
From the user’s perspective, a Dev Container is basically a powerful Linux system with Visual Studio Code running in the cloud and pre-loaded with many standard tools and libraries. For example, the default GitHub Codespace comes out of the box with Python, Docker, SQLite, and dozens of other tools. It also has a native connection to your GitHub repository.
If you asked me a year ago what the best development machine for a data scientist was, I would have argued for my MacBook Air, which also has many of the “it just works” benefits that make side projects smooth. Now my answer has changed: the machine doesn’t matter if you’re running a Dev Container in the Cloud.
Conclusion
One of the most exciting parts of data science is the speed at which the techniques and technology we use are advancing. Here’s how exploring side projects can help you stay on your toes. Whether you are beginning a data science career or growing your skillset, self-directed side projects are one of the quickest paths to learning.
To get a hands-on demonstration of this learning technique, join one of my virtual tutorials at ODSC West Sessions this year: Creating APIs that Data Scientists Will Love walks you through using FastAPI, SQLAlchemy, and Pydantic for API development. Using APIs in Data Science Without Breaking Anything demonstrates advanced methods for consuming APIs in your data science projects, and how to create Python Software Development Kits (SDK).
About the Author:
Ryan Day is an advanced data scientist at the Conference of State Bank Supervisors (CSBS), a non-profit association in the financial services industry. He is an AWS-certified solutions architect and a member of the National Association of Business Economics. He is an experienced open-source developer who participates in the FastAPI project.
Ryan is currently writing a book titled Hands-On APIs for AI and Data Science that demonstrates the value of hands-on side projects. It will be published in April 2025 by O’Reilly Publishing.
Originally posted on OpenDataScience.com
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Interested in attending an ODSC event? Learn more about our upcoming events here.