Why Learn Python for Data Processing

ODSC - Open Data Science
4 min readJun 15, 2021

--

Python is one of the most popular languages in the world. It’s used in a lot of different fields, like web services, automation, data science, managing computer infrastructure, and artificial intelligence and machine learning.

Its readable and concise syntax makes it a great option for teaching students their “your first programming language,” but under the façade of an easy and amicable language, there’s a huge amount of power.

Python is easy to learn and to use, sure, but it’s also capable of fantastic feats in demanding environments like video games, banking services, healthcare, or state-of-the-art scientific research.

Python, in particular, is an appreciated language for Data Processing, for several reasons, among others:

  • Python for data processing allows writing code using different styles, without forcing you to set into a specific way of doing things. It’s very easy to create prototypes and experiment with code. Processing data, particularly from not-very-clean sources requires a lot of tweaking, back and forth, and struggling to capture every possibility.
  • Python3 greatly improved the multilanguage support, making every string in the system UTF-8, which helps to process data in different languages encoded in different character sets.
  • The standard library is very powerful and full of useful modules to work natively with common file formats like CSV files, zip files, databases, etc.
  • The third-party library for Python is huge, and has incredibly good modules that allow it to extend the capabilities of a program. There are libraries to connect to any kind of database, to create complex RESTful services, to generate machine learning models, draw all kind of graphs, including interactive ones, produce reports in formats like PDF or Word, modules to analyze geospatial data, creating command-line interfaces, graphical interfaces, parse data and everything in between. The composability of the tools makes it easy to use several of them, for example, to analyze geospatial data, create some graphs with the findings, and to generate a report in PDF afterward
  • This can include powerful tools like integrated environments like Jupyter Notebooks where you can execute code and get instant feedback. Python is quite agnostic in terms of the required development environments, allowing it to work from a simple text editor (I personally use Vim) to advanced options like Visual Studio.

A quick, but insightful way of showing the power of Python for data processing, is to present how easy it is to operate with common files. Let’s imagine that we have a text file with a bunch of lines with numbers, and we want to calculate their average and store it in a new file

example.txt
---
5
4
3
7

We can read the file a with a clause that will automatically close the file when it’s finished. The file is open by default as text, which allows to read it in lines iterating through it.

with open('example.txt') as file:
numbers = [int(line) for line in file]

The list numbers process the file line by line and transform each into an integer, as it’s read as text. This structure, with a loop between brackets, is called a list comprehension in Python and allows to generate lists in an easy and readable way.

The average can be calculated by adding all numbers and dividing by the number of them, the length of the list.

average = sum(numbers) / len(numbers)

Finally, we store the result in a new file. We use the same with clause, but this time opening in writing mode adding ‘w’. The file is also written in text format.

with open('result.txt', 'w') as file:
file.write(f'Average: {average}')

The f-string allows to replace in a template string the variable average by putting it inside between curly brackets.

And that’s it. Five lines of code that deal with reading and writing into two files, transform the input from text to integers, and perform the calculation. All the code is very easy to follow.

This method of reading from a text input, performing some calculations, and dumps the results in a text output, is very useful in data processing, as several steps can be performed in order to generate complex pipelines. The intermediate steps are stored, allowing to repeat, in case of an error, only the required steps, and not from the start. Because the read and the write is so easy, it allows saving points to avoid repeating processing the data multiple times.

This barely scratches the surface, of course. Python has included in its standard library modules to read and write in CSV format, and there are a lot of third-party options to read and write in other formats like HTML, PDF, or even Word or Excel format.

There are also modules that allow to present the information, not only in text format, but including different kinds of graphics, like the useful Matplotlib. And powerful data manipulation modules like Pandas to crunch the numbers and obtain insightful results.

Original post here.

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform.

--

--

ODSC - Open Data Science
ODSC - Open Data Science

Written by ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.

No responses yet