5 Tools for Getting Started with Data Science on GitHub
Depending on who you ask, the definition of “data scientist” can vary from “Excel expert” to “deep learning engineer” to “MLOps practitioner” — working individually, or as part of a team. Given this broad spectrum of software engineering experience, it can be challenging for data scientists to ensure that their models and experiments are brought into production safely and sustainably. GitHub can help data scientists with their full end-to-end data science lifecycle, as they track and version control both data and code, reproduce experiments, collaborate effectively with their team members, and deploy models to production.
Below are five tools on GitHub that can help accelerate your machine learning development process:
VS Code extensions
First up, we have Visual Studio Code and its extension marketplace. VS Code is a free, lightweight code editor that was built with extensibility in mind: from the UI to the editing experience, almost every part of VS Code can be customized and enhanced. Below are a subset of my favorite extensions and features of the VS CodeIDE, but make sure to check out the marketplace for thousands more:
– Python: This extension provides Python language IntelliSense, linting, debugging, code navigation, code formatting, refactoring, variable and test exploration, and much more.
– SQL Tools: This database explorer is a collection of community-managed extensions that offer support for many common relational databases, including MySQL, SQLite, PostGres, MariaDB, Microsoft SQL Server, and much more.
– Draw.io: An extension that lets you view and edit rich diagrams directly within the editor.
– Live Share: real-time collaborative editing within VS Code (either local, or via the browser).
– GitHub Pull Requests: allows you to review and manage GitHub pull requests and issues in Visual Studio Code, including authenticating and connecting to GitHub; listing and browsing PRs from within VS Code; in-editor commenting, and more.
– Source Control Management: perhaps my most favorite feature in VS Code. If you’re not a fan of git via the command line, this feature gives you a way to merge changes and create graphics locally.
If you are browsing any repo on github.com, just clicking . on your keyboard will immediately launch you into github.dev: a browser-based editing environment for GitHub. This browser-based IDE gives you a quick way to edit and navigate code; and is especially useful if you want to edit multiple files at a time, or if you want to take advantage of all of the powerful code editing features of Visual Studio Code when making a change.
Many of the VS Code extensions listed in the previous section are web-enabled, and you can even use specialized compute within the browser. Personally, I have used github.dev with the Pyodide extension both for demos, and to run Python courses using the data science stack: it’s a painless way to create a free, transient Python scratch-pad.
GitHub Codespaces provides cloud-powered development environments for any activity — whether it’s a long-term project, or a short-term task like reviewing a pull request or testing a small change. You can work with Codespaces instances in VS Code locally, or in a browser-based editing environment directly from any GitHub repo — and, even better, all of the extensions for VS Code automatically work in Codespaces.
You can either use the out-of-the-box Codespace environment, or customize your Codespace instances on a per-project basis, via something called a devcontainer.json file. Example customizations include:
– Setting the Linux-based operating system to use.
– Automatically installing various tools, runtimes, and frameworks.
– Forwarding commonly used ports.
– Setting environment variables.
– Configuring editor settings and installing preferred extensions.
Your existing requirements.txt, Dockerfiles, and conda environment YAMLs are automatically understood by Codespaces, and can be used in devcontainer.json references. If you aren’t a fan of VS Code, you can even use a variety of front-ends with Codespaces, such as Jupyter notebooks or JupyterLab. To create a new Codespace, just click the “Code” button on any GitHub repo, or head to codespace.new.
Model and Data Templates
As you create experiments and machine learning models, it is important to clarify the intended use cases of your work and to minimize any usage contexts for which they are not well-suited. AI ethics researchers are in the process of creating standards for these best practices, which can be included in your repos the same way as your would include a LICENSE.md or a CONTRIBUTIONS.md:
– Model Cards (Mitchell et al, 2018): describes the model, its intended uses and potential limitations, the training parameters and experimental information, and the datasets used to train and evaluate results.
– Datasheets for Datasets (Gebru et al, 2021): a markdown file that describes a dataset’s motivation, composition, collection process, and recommended uses. These datasheets facilitate better communication between dataset creators and consumers, and encourage the machine learning community to prioritize transparency and accountability.
An example YAML section from a model card that specifies metadata:
Github Actions allow you to automate, customize, and execute software development workflows directly in your repository. You can think of GitHub Actions as supercharged cron jobs, that can be used for every step of your machine learning and data science development process, from:
– Consuming and transforming data.
– Appending new data in cloud storage buckets.
– Version controlling datasets.
– Retraining models, and storing performance metrics.
– Generating reports and dashboards.
– Deploying new models.
…and much, much more. You can view and search through Data and Machine Learning Actions in our marketplace, and be sure to take a look at our collection of resources on how to facilitate machine learning operations practices with GitHub.
About the author on Data Science on GitHub:
Paige Bailey (@dynamicwebpaige) is the product lead for data science, machine learning, and MLOps at GitHub. Prior to joining GitHub, Paige worked on machine learning developer tools in Microsoft’s developer tools division and was a product manager for machine learning APIs and platforms at Google Brain and DeepMind.
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform.