“Data really powers everything that we do.”
Learning things from data or making data speak is Data Science. Now, we are ready to talk about what data science is. It’s a thing that encapsulates some programming skills, some statistical readiness, some visualization techniques, and, last but not least, a lot of business senses. The kind of business sense that I in particular care about is the ability and willingness, sometimes eagerness, to translate any business questions into questions answerable using currently or forthcoming available data within one’s reach. Data Science is the study of where information comes from, what it represents and how it can be turned into a valuable resource in the creation of business and IT strategies. Mining large amounts of structured and unstructured data to identify patterns can help an organization rein in costs, increase efficiencies, recognize new market opportunities and increase the organization’s competitive advantage.
“Data Science is the process of collecting, storing, processing, describing and modeling data.”
I will walk you through this process using CSPDM framework, which covers every step of the data science.
1. Collecting Data :
Data collection is actually the process of gathering information aka data. What you do after collecting the data is entirely up to you. It may be stored, analyzed or measured.
The data collection process can be either primary or secondary.
The secondary method relates to collecting data that already has been collected by someone (through Google search of public data sets, by using Application Programming Interfaces(APIs), accessing databases, spreadsheets, etc). This method is usually the first attempt to get the data you need (when you Google something).
Example 1:
Let’s assume the data scientist is working for an e-commerce company. A question the data scientist may be interested in, is Which items do customers buy?
Here the data already exists within the organization and the data scientist must have knowledge of accessing the data using programming or SQL or NOSQL..etc.
Example 2:
Let us assume the Data scientist is working for a political party. A question the data scientist may be interested in, is what is people’s opinion about the new agenda of the party?
Here the data exists(in the form of tweets, Facebook posts…)but not within the organization.So the data scientist must have web crawling(scraping)and other skills apart from programming skills.
Example 3:
Let us assume the Data scientist is working with farmers. A question the data scientist may be interested in, is the Effect of type of seed, fertilizer, irrigation on yield?
Here the data is not available and the data scientist needs to design experiments to collect data using Statistics etc.
Skills required.
– Intermediate Programming – Knowledge of Databases – Knowledge of Statistics
2. Storing Data :
After collecting data, the next immediate thing to do is storing data. Data storage is the recording (storing) of information (data) in a storage medium.
1)Transactional and Operational data which includes patient records, claims invoices, customer records etc. This data is structured and in general stored in relational databases. This data is optimized for SQL queries.
2)Data from Multiple Databases which include bank accounts, credit cards, and investments, etc. There is a need to integrate this data into a common repository. That is where Data Warehouses come into picture. Data warehouses accumulate structured data from various databases. This data is curated and optimized for analytics.
3)Unstructured Data includes t which include bank accounts, credit cards, and investments, etc. There is a need to integrate this data into a common repository. That is where Data Warehouses come into the picture. Data warehouses accumulate structured data from various databases. This data is curated and optimized for analytics.
– High volume – High variety – High velocity
Skills Required
Programming Knowledge of relational databases Knowledge of NoSQL databases Knowledge of Data Warehouses Knowledge of Data Lakes(Hadoop)
3. Processing Data
Once your data is stored, next you will have to process the data. Data processing occurs when data is collected and translated into usable information and stored in databases. This involves :
1)Data Wrangling or Data Munging:The complete process of ETL (Extraction, Transformation, and Loading) is called Data Wrangling or Data Munging.
2)Data Cleaning: Data cleaning includes:
– Filling missing values – Correct spelling errors – Identify and remove outliers – Standardize keyword tags
3) Data scaling, normalizing and standardising:
Scaling included converting data in kilometers to miles or rupees to dollars etc.
Standardizing includes ensuring data has zero mean, unit variance.
Normalizing Data includes having values between zero and one.
Skills Required – Programming skills – Map Reduce(Hadoop) – SQL and NoSQL databases – Basic Statistics 4. Describing Data:
After processing data we come to the stage of describing data. Describing data refers to the presentation of your data to a non-technical layman. We deliver the results in to answer the business questions we asked when we first started the project, together with the actionable insights that we found through the data science process.
1) Visualizing data:
Understanding how the data looks for effective understanding is Data Visualisation. Examples include grouped bar charts and scatter plots.
2)Summarising Data:
Summarising data helps to answer questions related to it. Inferential Statistics and Descriptive Statistics help in Describing data in a better way.
Skills Required – Statistics – Excel – Python Programming – R Programming – Tableau
5. Modelling Data:
This is the stage where most people consider interesting. As many people call it “where the magic happens”. Last step Data modeling (data modeling) is the process of creating a data model for the data to be stored in a database.
Statistical Modelling
In simple terms, statistical modeling is a simplified, mathematically-formalized way to approximate reality (i.e. what generates your data) and optionally to make predictions from this approximation. The statistical model is the mathematical equation that is used.
2. Algorithmic Modelling
This type of modeling focuses on building functions that deal with high dimensional data. The goals of this Modelling includes:
– Estimating the function using data and optimization techniques. – Given a new input, Predict the output.
Skills Required : – Probability theory – Inferential statistics – Calculus Optimisation algorithms – Machine learning and Deep learning – Python packages and frameworks (numpy, scipy, scikit-learn, TF, PyTorch, Keras)
Feel free to leave a message if you have any feedback, and share it with anyone that might find this useful
Pingback: Classification Report in Machine Learning: Evaluating Model Performance - Data Analysis Experts