An Overview of Building End-To-End Big Data Reporting & Analytics Systems
In the 21st century, it’s not a hidden secret that Big Data is driving business and leading to growth and development across different industries such as e-commerce, health, finance, etc., in an unprecedented way. We all have heard the catchphrase “Data is Gold,” to not to leave behind the competitors’ industries and corporations are capturing and storing as much data as they can. Big data reporting is captured by businesses & enterprises across different streams such as transactional data, browsing data, users’ interactions data, and through different touchpoints i.e. web, mobile & other smart devices, etc.
Building pipelines to collect, record, & collate information is a first step towards building scalable analytics & intelligent big data reporting processing systems. Once the data is stored, the next main complex task is to derive insights and perform analytics over these huge piles of data being collected on a regular basis. The typical workflow in major industries & corporations for big data reporting involves having dedicated teams for performing analytics and reporting the information stored in structured format e.g. SQL tables. The main challenge faced today is the rate and the scale at which the data is captured and stored periodically. This alarming rate of data arrival also pushes for the rate & the necessity to dive deeper into deriving insights to make effective business decisions. To handle the scale & rate of information and to dive deeper into deriving insights instantly, the major question that needs to be addressed is: Can we build effective, scalable industry-grade end-to-end big data systems to perform automatic reporting and descriptive analytics over the captured data?
This problem of building an automatic End-to-End system with big data reporting has been a topic of interest in the research community and has been an area of active research under the theme of Natural Language Interfaces to Database [NLIDB], with research papers dating back to 1980s [1]. A rough abstraction of such an automatic NLIDB system is shown in the image below [Figure 1]. There are two main aspects of these NLIDB systems:
- Converting Natural Language (NL) to Structured Query Language (SQL)
- Fetching results from Databases (DB) using the structured query
Figure1: Image Source [Soumya, MD and Patil, BA, 2017]
One of the main complex tasks of building such an automated system for big data reporting is to understand Natural Language queries effectively which is an open area of active research. Once we have converted Natural Language to a Structured Query Language, the task becomes easier to execute the SQL statement, fetch, & present results.
Natural Language Understanding: Human language is quite complex and ambiguous, sometimes it becomes hard to understand the other person in communication, making machines understand natural language is a daunting task.
Let’s look at a simple example to show the ambiguous nature of human language, e.g. “ I saw a girl with a telescope.” This sentence with the same set of words, without any change, can have two different interpretations:
- First Interpretation: I saw a girl, using a telescope I had.
- Second Interpretation: I saw a girl who had a telescope.
Image Source: Link
Thus aiming to convert a free-flowing natural language to a structured query language, where we can have multiple variations of the input query to represent a single SQL statement is a complex challenge.
The main questions that need to be addressed while building such a scalable industry-grade reporting and descriptive system are:
- How to build a robust & comprehensive pipeline to capture the nuances effectively for an industrial use-case setting?
- How to handle Natural Language Queries effectively?
If some of these questions interest you, and you want to learn more on it: check out our upcoming talk at ODSC India 2020 during my talk, “Natural Language Querying for Industry Grade Data Analytics Systems.”
For more information on the topic, you can check information in the suggested reading material or contact the author.
References
1. C. Raymond Perrault, and Barbara J. Grosz. “Natural-language interfaces.” Exploring Artificial Intelligence. Morgan Kaufmann, 1988. 133–172.
2. Soumya, M. D., and B. A. Patil. “An interactive interface for natural language query processing to database using semantic grammar.” International Journal of Advance Research, Ideas and Innovations in Technology 3, 2017. 193–198.
3. Salil Rajeev Joshi, Bharath Venkatesh, Dawn Thomas, Yue Jiao, and Shourya Roy. “A Natural Language and Interactive End-to-End Querying and Reporting System.” In Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, pp. 261–267. 2020.
Author: ODSC APAC Speaker Bio:
Piyush Arora is a Research Scientist with American Express AI Labs, Bangalore, India since September 2019. Before joining American Express, he was a Post-Doctoral researcher with the ADAPT Centre, Dublin, Ireland. He completed PhD in computer science in August 2018 from Dublin City University, Ireland. His areas of interest are Information Extraction and Retrieval, MachineLearning, User Search Behavior, Natural Language Processing, Sentiment Analysis, and Deep Learning. During his PhD, he worked in the area of Interactive Information Retrieval,towards developing novel solutions for improving the overall search experience and user learning. Mainly focused on Natural Language Processing and Machine Learning techniques and tools and their application for different tasks, domains, and languages.