In the last ten years, social media has opened the floodgates to the whole community across the world to interact on a 24x7 basis over 365 days from anywhere to anywhere. There are no geographical boundaries or time zones or any fixed format/type of data. Big Data is characterized byV3 (V-Velocity, V-Variety and V-Volume), where data could be text, picture, image, video or audio. This data is unstructured and non-relational, requiring high tech workforce to pre-process and make it available to the decision-maker in the required format. Software like Hadoop, Ruby, Python, MongoDB have taken the centre stage in Big Data environment. It is a big game where large software houses like Google, Oracle, IBM, Microsoft, Rack space and Amazon are using their cloud services to handle Big Data and charging for relevant information from their clients. It provides cutting edge to customers for forecasting trends related to marketing, sales, CRM, SCM and even human likes/dislikes for carrying out propaganda/election complain by political parties.
Emerging Cloud computing and Bid data environment are taking away routine jobs of data entry and basic coding but it is also offering new jobs though needing a new skill set. This trend is unstoppable and we all must learn new skills and embrace cloud computing and Big data as the future computing environment. It is estimated that India will have a 32% share of Big Data global market by 2025. Despite initial apprehension about data ownership, security and privacy, cloud computing and Big Data are the new vehicles for any business growth and nobody wants to be left behind in this race. This short article is an attempt to motivate young professionals to “Get Set and Go” in this race of emerging technologies. A brief introduction to the potential of Big Data and its associated data processing tools is given in succeeding paragraphs.
What is Big Data?
Although "Big Data" is quite a fashionable term nowadays, yet it is not always clear to many people. Some organizations and particularly their IT staff often try to draw attention to their large-size traditional RDBMS projects by labelling those as "Big Data,", while there is nothing big about those. Big data size is in the range of Terabytes or even Pico bytes with a lot of complexity, unstructured data and various formats. Database which is just a few terabytes is not considered Big Data since that can be easily handled for traditional RDBMS data processing software tools like Oracle, SQL, JSPL, XML, C++, Java In large-size enterprises, the concept of data warehouses and data mining has been in vogue for over 15 years. These organizations have distributed computing environment using their own leased lines or private cloud using internet facility. Big Data is an approach to describe data problems that are unsolvable using traditional tools of RDBMS, Three distinct characteristics of Big Data, popularly called (3V,s) are briefly described below:
a) Volume: It is related to very high volumes of data ranging from dozens of Terabytes, and even Pico bytes. Indeed data is very huge and keep continuously growing at a great speed and round the clock. (24x7) over 365 days a year.
b) Variety: Data is organized in multiple structures, ranging from raw (unstructured) data, semi-structured data and structured (stored in rows and columns) data. To make things even more complex, data can be text, an email with and without attachment, SMS, audio clip WhatsApp, photo, video or sound clips and in any language/format.
c) Velocity: Data from registered customers, clients business partners could be coming through normal channels is expected formats but it can come any time through leased data-lines or internet. However.data from social media network, feedback from online trading could come any time and simultaneously from many locations on the globe.
How big is Big Data? A common question is often asked today is- “ How big is Big data and how Big data technologies can help businesses to succeed?”. Is Big Data size in terabytes, Pico bytes, or even more? To understand this in simple words, Big Data and its key special features have n been briefly discussed in subsequent sections. Every large size data held on a group of servers is not Big Data until it meets two criteria as follows.
· Type of data. We all know that traditional business data related to bank accounts, production, dispatch and stocking of goods in a warehouse may change at any time. In contrast, Big Data represents a “log of records” where each record describes some event like a purchase in a store, customer visit to a retail store, a web- page viewed by online buyer, a sensor fed data at a given moment, customer online feedback or a short message ( like comment ) on a social network
Scalability and Elasticity: Big Data has a huge volume of data that requires parallel processing and a special approach to store data on multiple clusters computers (nodes). In addition, Big Data solution needs automatic scalability and recovery. To cope with ever-growing data volume, we don’t need to introduce any changes to the software each time the number of data increases. If this happens, the system will allocate more nodes, and the data will be redistributed among them automatically and seamlessly.
Scalability and Elasticity: Big Data has a huge volume of data that requires parallel processing and a special approach to store data on multiple clusters computers (nodes). In addition, Big Data solution needs automatic scalability and recovery. To cope with ever-growing data volume, we don’t need to introduce any changes to the software each time the number of data increases. If this happens, the system will allocate more nodes, and the data will be redistributed among them automatically and seamlessly.
Big Data System Architecture. The customer will expect Big Data systems to efficiently handle data volume, complexity and scalability aspects which have been major issues in traditional RDBMS Client-Server environment Big Data system must have high performance irrespective of size and complexity of data. Some desirable features of Big Data are:
· Simple Design. Complex system design is more likely to develop faults and become harder to understand and debug it. To overcome implementation complexity, algorithms and module designs should be simple. Lambda Architecture is a good example which takes care of most of the desired features
· Wider Applicability. Big Data system should support a wide range of applications. - Financial, Banking, e-Commerce, Social media analytics and scientific applications.
· Quick response. The vast majority of applications require very low latency, typicality within a few milliseconds. The system should be able very fast while reading, updating and retrieving information for display on the screen. Customers become impatient if the response is slow.
· Resilience. Big Data systems must be good fault-tolerant and continue to perform reliably and efficiently despite if some servers, in some cluster go down randomly. RDBMS based distributed databases have issues related to the consistency of data, the duplicity of data, maintaining back up data at multiple locations and concurrency of data. Big Data system must be robust enough to avoid these limitations and be more human-fault tolerant. Its recovery mechanism should be so efficient so that end-user does not feel any disruption.
· Scalability. Scalability is the ability to maintain consistent performance whenever there is a sudden surge in incoming data. The system should automatically add the required extra resources to the system configuration. In traditional systems, Application DevelopmentAgency (ADA) carries out the Load test and Stress test to ensure acceptable response time during peak load. They fine-tune hardware devices and carry out optimization of software design to meet customer requirements.
· Extensibility. There should be no need for any major programming effort when a customer makes a change in business process/rules. Extensible systems allow functionality to be added with a minimal development cost.
· Minimal maintenance. Maintenance is the work required to keep a system running. Big Data system should carry out any scheduled maintenance without any slowdown in response time or any inconvenience to the customers.
Categories of Big Data. There are two categories of Big Data sources:-
· Internal Data. The organization generates its own data and controls it, This will include Corporate database ( ERP), Internal documents ( SOPs, Instructions ), in-house call centres data, website logs, sensors and controllers.
· External default. It is public data or the data generated outside the organization. As such the organization neither owns nor controls it. This data includes- Social Media like Face book, statistical data, public domain data, Machine Learning data.
Technical Terms. Big Data community speaks their own language and you should be aware of some of the technical terms for easy interaction with these people. In fact, each of these terms can be a chapter of 8 to 10 pages. Some of the common names and terms are very briefly given below.:
· Сloud. It is the delivery of on-demand computing resources on a pay-for-use basis. It is used by many large organizations and particularly, for Big data applications, demanding fast scalability say automatically adding required computers as per increase in load.
· Hadoop. Itis a framework used for distributed storage of huge amounts of data and parallel data processing. It breaks a large size data into smaller chunks to be processed separately on different data nodes (servers). It similarly automatically collects the outputs across the multiple nodes to compile into n a single output
· HDFS. Hadoop Distributed File System (HDFS) allows multiple files to be simultaneously stored at multiple locations and processed. The customer need not worry as these files could by lying on any cluster and on any server of that cluster. To the end-user, these data files appear to be in one location.
· NoSql. It is commonly referred to as ("Not Only SQL") and it represents a completely different framework of databases that allows for high-performance processing of information at massive scale. This database infrastructure has been well-adapted for big data applications
· Mango DB. It is an open-source database with the capability to handle both structured as well as unstructured data.
- · Apache Spark. It is a framework used for in-memory parallel data processing, which makes near real-time analytics possible.
· Apache Pig. It is a Data Flow Framework based on Hadoop Map Reduce. It is particularly suited for large unstructured data.
- · Python. It is a programming language like C++ or java but having added features to handle Big data applications
- · Ruby. Ruby is a dynamic, interpreted, reflective, object-oriented, general-purpose programming language. It was designed and developed in the mid-1990s by Yukihiro "Matz" Matsumoto in Japan.
- · R Studio. It a powerful programming language particularly designed to handle statistical data and carry out prediction as per trends.
Big Data Applications. Big Data technologies is a boon for the large enterprises operating globally, to grow fast and provide greater satisfaction to their customers and distributors (Retail stores). There can be many scenarios where Big data applications are already in place or they are potential candidates for using Big data. Some common examples are briefly given in subsequent sections.
a) Customer analytics. Customer Analytics is a software tool suitable for CRM functions like tracking customers likes/dislikes, wish-list, buying habits and financial capacity. Let us say a global industry house has over 10 million customers, hundreds of retailers and five manufacturing plants in five counties. To develop a comprehensive view of customers and retailers requirements, collecting, stocking and distributing of goods, you at the corporate Head Quarter (HQ) need to analyze a huge data of the variety of formats and from multiple locations. The more data sources you use at the HQ, the more accurate the picture you will get for timely decision making. This is not one time exercise but an ongoing process and on a 24x7 basis over 365 days. Customer analytics is equally beneficial for all customers, retail stores and corporate HQ. Let us take a case of a customer visiting a retail store and what all information is picked up and transmitted across the network:
· Personal data. ( If registered – Name, gender, age, address, contact phone ).
· Demographic data ( working woman, 35 years old, married have two children.).
· Buying Habits (Frequent buyer, clothing and jewellery, high-value shopping).
· Transactional data (weekend, the products she buys each time, visits a retail store.)
· Web behaviour data ( frequency of online trading, her wish list and the products she puts into her basket when she shops online).
· Data from customer-created texts (customer comments/feedback about the product and delivery system when doing on-line shopping ).
b) Industrial Analytics. To avoid expensive downtimes which will impact all the related processes, manufacturing, transportation, one can suitably place sensors to gather field data to carry out proactive maintenance For this. You could be collecting and analyzing sensor data for several months to form a history of observations. Based on this historical data, the system can identify a set of patterns that are likely to end up with a mechanical breakdown. For instance, the system recognizes that pattern formed by temperature sensors is similar to the pre-failure situation and alerts the maintenance team to check the machinery. Preventive maintenance is one of the examples of how manufacturers can use Big Data.
c) Business Process Analytics ( BPA). Some Companies also use big data analytics to monitor the performance of their remote employees and improve the efficiency of the processes. Let’s take transportation as an example. Companies can collect and store the telemetry data that comes from each truck /car in real-time to identify the typical behaviour of each driver. Once the pattern is defined, the system analyzes real-time data, compares it with the pattern and display the exception signal if there is a mismatch. Thus, the company can ensure safe working conditions for its drivers by enforcing regular and timely halt for rest.
d) Analytics for fraud detection. (AFD). Banks can detect credit card fraud in real-time. If the system detects that somebody other than the owner, is using a credit card, the system could block suspicious activities and notify the owner. For example, if an unauthorized person is trying to withdraw money in Spain, while you reside in Texas USA), before declining the transaction, the bank can check the user’s info on the social network and maybe at that period, you are spending vacations in Spain.
Big data's role in the future.
Large size businesses have been struggling over two decades for a new technique for efficient capturing information about their customers, products, services, business partners and competitors. Earlier, RDBMS based data capturing, storing and retrieving, processing and displaying for decision making was pretty straightforward following well-approved logics and formats. However, over time functioning and, data processing of large companies. with multiple, locations in/outside home country have become more complicated. To grow fast in a highly competitive global market, these companies have added more product lines and diversified their product delivery systems. Data issues are not limited to business organizations but are equally applicable to government organizations dealing with health, care, agriculture, weather prediction, global warming, Research and Development (R&D) organizations,
Today, we are indeed dealing with a lot of complexity, volume and variety of data. Some data is structured and stored in a traditional relational databases, while other data, including' documents related to customers, service records, and even pictures and videos, are in unstructured forms We also have to consider other sources of data generated by machines and sensors deployed inside a machinery, on ground and in outer space. Other external information sources are human-generated on social media and click-stream data generated from website interactions. Although each data source can be separately identified, managed yet, the challenge today is how a user makes sense of the intersection of all these different types of data. When you are dealing with so much information in so many different forms, it is impossible to think about traditional data management. Therefore, you have, to think about managing data differently.
Issues related to Big Data
· Big Data is still evolving itself to meet industry requirements but it will take some time say 5-10 years to mature and have universally accepted norms and industry standards.
· It takes time to learn and have good competence in handling Big Data tools. There is a shortage of professionals to meet the future requirement in handling Big data environment
· The higher the volumes of data entering your organization, the bigger will be your velocity challenge The commonly held rule of thumb is that if your data storage and analysis indicates any of these three (V3) characteristics, you have Big Data challenge on your hand. Each of the V3 criteria poses its own challenge when analyzing data. It is the responsibility of the service provider or data handling organization / IT department to take care of all technical aspects and ensure that received data is safe, accurate, consistent and clean.
· Big data may contain some omissions or errors and is not suitable where absolute accuracy is crucial. Therefore big data does not serve many purposes for book-keeping or transactions handling. Big data is statistically correct to reflect emerging trends can help to identify market risks based on the analysis of customer behaviour, industry benchmarks, product portfolio performance.
Big Data industry and data science are evolving new solutions very rapidly. This is one of the hottest IT trends of 2018, Big Data analytics is increasingly widespread in multiple industries, from using in banking and financial services to healthcare and government sectors. and open-source Big Data tools are the mainframe of any Big Data architect’s toolkit. Big Data is not only the right step for business success of the large organization, but it also offers many jobs to energetic professionals to become highly sought after and well paid Big consultant. You need good competency in new skills like Hadoop, MongoDB, Map-Reduce, Ruby, Java, Python. R- Studio. This can help you in dealing with Business Intelligence (BI) and statistical analysis using R language. There are many jobs in manufacturing, banking and health care sectors awaiting for bright data scientists.
You may also refer to my book – “Career challenges during global uncertainty “, available on www.amazon.com
Dr Sarbjit Singh, Former Principal, Apeejay College of Engineering, Gurgaon, Haryana, India
Comments