Data Science Dictionary
Our Data Science Dictionary is a collection of terms and definitions specifically related to the field of data science. It can be helpful for beginners to have a reference to common terms and their meanings, as well as more experienced data scientists to have a place to look up more specialized terminology.
A
View Data Science & Machine Learning Terms beginning with the Letter “A”
A/B Testing
A/B testing is a method of testing where two versions (A and B) of a web page or app are tested to determine which version produces better results.
Absolute Error
Absolute Error is the difference between the predicted value and the actual value.
Accuracy
Accuracy is the degree to which a model correctly predicts the value of a given data point.
Active Learning
Active Learning is a type of machine learning where the algorithm actively chooses which data points to use in order to learn from.
Ad Hoc Analysis
Ad Hoc Analysis is an exploratory analysis technique used to uncover patterns or relationships in a dataset. Often times this work is performed “Ad Hoc” or “On The Fly”… meaning it it done in the moment.
AdaBoost
AdaBoost is a machine learning algorithm that is used to improve the accuracy of predictions made by a model. It does this by iteratively training a model on a subset of the data, and then using the model to predict the outcomes for the remaining data.
Adaptive Boosting
Adaptive boosting, also known as AdaBoost, is a machine learning algorithm used to improve the accuracy of a classification model. It works by iteratively training a series of weak classifiers on a dataset, and then combining their predictions to produce a more accurate final prediction.
Additive Smoothing
Additive smoothing: Additive smoothing is a technique used in data science to smooth out noisy datasets. It is a form of noise reduction, and it works by adding a small amount of noise to the dataset in order to obscure the original noise. This makes it easier to identify the underlying trends in the data.
Algorithm
An algorithm is a step-by-step procedure for solving a problem or accomplishing a task.
Analytics
Analytics is the process of examining data to uncover hidden patterns and insights. It can be used to improve business decisions, understand customers, detect fraud or to understand what is happening in the world.
AngularJS
Definition of AngularJS AngularJS is a JavaScript framework for building web applications. It lets you use HTML as your template language, and extends the HTML vocabulary to provide directives for defining your application’s user interface. It also provides a model–view–controller (MVC) framework that helps you structure and manage your application code. What is AngularJS used…
Artificial Intelligence (AI)
Artificial intelligence (AI) is the ability of a computer program to learn from data, recognize patterns and make decisions with minimal human intervention
AutoML
AutoML is automated Machine Learning, a technique for automatically selecting and optimizing Machine Learning models.
Average Precision (AP)
A/P testing is a method of testing where multiple variations of a web page or app are tested to determine which version produces the best results.
B
View Data Science & Machine Learning Terms beginning with the Letter “B”
Backpropagation
Backpropagation is a neural network algorithm used to train artificial neural networks. The algorithm calculates the gradient descent of the cost function with respect to the weights of the neurons in the network.
Bagging
Bagging is a technique for improving the accuracy of predictions made by a machine learning algorithm. It works by training multiple models on different subsets of the data, and then combining…
Bayes’ Theorem
Bayes’ Theorem is a mathematical formula that calculates the likelihood that a particular event will happen, based on the conditional probability of that event occurring, given that some other event has…
Bayesian Inference
Bayesian Inference is a statistical method of using prior knowledge and data to generate probabilities about future events.
Bayesian Network
A Bayesian network, also called a belief network, is a probabilistic graphical model that represents a set of random variables and their conditional dependencies. Variables are represented as nodes in the…
Bias
Bias is a systematic error in judgment or decision making, resulting in judgments that are not completely impartial. It can refer to either statistical bias or cognitive bias. Statistical bias is the result of an …
Big Data
Big Data is a term used to describe the large volume of data that organizations collect and store. The term was first coined in 2001 by Doug Laney, an analyst at Gartner, and it was originally used…
Binomial Distribution
A binomial distribution is a statistical distribution that gives the probability of a certain number of successes in a series of n independent Bernoulli trials.
Bivariate Analysis
Bivariate Analysis is the examination of two sets of data, typically in order to identify any correlations between them. This can be used to inform further analysis, or to gain a better understanding of the data…
Boosting
Boosting is a machine learning technique for improving the accuracy of predictions. It is a type of ensemble learning, which combines the predictions of several individual models in order to produce a…
Bootstrap
Bootstrap is a technique for estimating the parameters of a statistical model from a limited number of observations. It is a resampling technique that is used to build confidence intervals or to generate…
Boxplot
A boxplot is a graphical representation of the five-number summary of a dataset. The five-number summary includes the minimum, first quartile, median, third quartile, and maximum of a dataset…
Business Intelligence (BI)
Business Intelligence (BI) is the process of gathering, organizing, and analyzing data to help business leaders make better decisions. BI can include everything from analyzing sales data to tracking customer…
C
View Data Science & Machine Learning Terms beginning with the Letter “C”
Categorical
A categorical variable is a type of data that is represented by a list of values. These values can be either numbers or text, and they are usually organized into categories.
Categorical Variable
A categorical variable is a data type that represents qualitative information. For example, colors (red, blue, green), genders (male, female), and religions (Catholic, Protestant, Muslim) are all categorical…
Census
Census: a survey of all the people in a particular area or population, used to collect information about them.
chi-square test
The Chi-square Test is a statistical hypothesis test used to determine the significance of the difference between the observed and expected values of a given data set. The chi-square statistic is used to …
Classification
Classification: The task of classification is to take a set of observations and assign one of a pre-determined number of classes to each observation. Classification is a technique used in data science to …
Clustering
Clustering: Clustering is a technique used in data science to group similar items together. This can be useful for organizing data and understanding relationships between different groups of data.
Coefficient
A coefficient is a number that measures how much one variable changes when another variable changes.
Computational Linguistics
Computational linguistics is the application of artificial intelligence and machine learning techniques to natural language processing tasks. It encompasses areas such as parsing, machine translation, and text …
Confidence Interval
Confidence Interval: A range of values that is calculated based on sample statistics and is used to measure how accurate a prediction made by a model is likely to be.
Confusion Matrix
Confusion Matrix: A confusion matrix is a table that is used to summarize the performance of a classification algorithm. The table shows how many times each class was predicted by the algorithm and …
Continuous Variable
A continuous variable is a mathematical construct that can take on any value within a given range. In contrast, discrete variables can only take on specific, discrete values. Continuous variables are important…
Covariance
Covariance is a measure of how two variables change together. It is calculated as the variance of the product of the two variables divided by the product of their standard deviations.
Cross Validation
Cross Validation: Cross validation is a technique used in data science to improve the accuracy of predictions. It works by splitting the data into two sets, a training set and a testing set. The training set is …
D
View Data Science & Machine Learning Terms beginning with the Letter “D”
D3
Definition of D3 D3 is a JavaScript library for data-driven documents. It helps you to create data visualisations using HTML, SVG and CSS. D3 makes it easy to bind data to DOM nodes, making it easy to manipulate and style your data. What is D3 used for? D3, or Data-Driven Documents, is a JavaScript library…
Data
Definition of Data Data is the plural of datum, which is a single piece of information. In data science, data refers to the large set of information that is collected and analyzed. This information can be in any form, including text, numbers, images, or binary code. Data science is the process of extracting insights from…
Data Analysis
Definition of Data Analysis Data Analysis: Data analysis is the process of inspecting, cleansing, transforming, and modeling data with the goal of extracting useful information from it. What is a Data Analysis used for? Data analysis is an important process used to gain insights and understanding from data sets. It can be used to uncover…
Data Analyst
Definition of Data Analyst Data Analyst: A data analyst is a professional who is responsible for interpreting data and presenting it in a way that is easy to understand. They work with teams of data scientists and engineers to help turn raw data into insights that can be used to make decisions. What does a…
Data Breach
Definition of Data Breach Data Breach: A data breach is an incident in which sensitive, confidential, or private data is accessed or released without authorization. What are the usual things to consider when a data breach occurs? When a data breach occurs, there are a number of considerations that need to be taken into account….
Data Collection
Definition of Data Collection Data Collection: Data collection is the process of gathering data, often from different sources, for analysis. This can be done through surveys, interviews, focus groups, or other methods. What is a Data Collection used for? A Data Collection is a collection of data that is used for the purpose of analysis…
Data Engineer
Definition of Data Engineer Data Engineer: A data engineer is a professional who creates and maintains the data pipelines that allow a company to make use of data science. Data engineers are responsible for ensuring that data is correctly collected, cleansed, and organized, so that it can be used by data scientists to glean insights…
Data Engineering
Definition of Data Engineering Data Engineering: Data engineering is the process of extracting meaning from data and transforming it into a form that can be used by business analysts, managers, and other decision-makers. Data engineering involves creating models and tools to make data more accessible and useful. What is Data Engineering used for? Data engineering…
Data Frame
Definition of Data Frame Data Frame: A data frame is a rectangular table of data consisting of rows and columns. The data in each column has the same type, and the order of the columns is defined by the programmer. What is a Data Frame used for? A data frame is a two-dimensional data structure…
Data Governance
Definition of Data Governance Data Governance: The process of governing data involves implementing structure, controls, and processes around the management of data. Data governance aims to ensure that data is consistently accurate, complete, and accessible across the enterprise. It also helps identify and protect sensitive information while maximizing its value. What is Data Governance used…
Data Integration
Definition of Data Integration Data Integration: Data integration is the process of combining data from multiple sources into a single coherent dataset. This can be done manually, but more often it is done with software that can combine the data automatically. The goal of data integration is to make it easier to analyze the data…
Data Lake
Definition of Data Lake Data Lake: A data lake is a term used in big data management to describe a storage repository that holds a large volume of raw data in its native format. The data in a data lake can be processed and analyzed by the business users who own it, without having to…
Data Mining
Definition of Data Mining Data Mining: Data mining is the process of extracting valuable information from large data sets. This information can be used to make decisions about business operations, product development, and other strategic initiatives. Data mining involves using sophisticated algorithms to identify patterns and trends in data. What is Data Mining used for?…
Data Model
Definition of Data Model Data Model: A data model is a conceptual representation of data that is used to understand and design systems. A data model is a conceptual framework that defines how data is structured and how it is accessed. A data model can be used to represent data in a database, in a…
Data Preparation
Definition of Data Preparation Data preparation: Data preparation is the process of getting data ready for analysis. This often includes cleaning up the data, removing outliers, and transforming it into a form that is suitable for the analysis that will be performed. Why is a Data Preparation process used? What is it good for? Data…
Data Quality
Definition of Data Quality Data Quality: Data quality is a measure of how accurate and consistent data is. This can be determined by looking at factors such as the completeness of data, the accuracy of data, and the timeliness of data. Why does Data Quality matter? Data quality is of paramount importance when it comes…
Data Science
Definition of Data Science Data Science: Data science is the study of data and the application of statistical analysis, machine learning, and other computational techniques to extract knowledge from data. What is Data Science used for? Data Science is a field that combines science, mathematics, and technology to analyze data and make predictions. It is…
Data Scientist
Definition of Data Scientist Data Scientist: A data scientist is a professional who extracts knowledge and insights from data in order to solve problems. What does a Data Scientist do? A Data Scientist is a professional who specializes in using extensive scientific, mathematical and computer programming knowledge to extract meaningful insights from vast amounts of…
Data Structure
Definition of Data Structure Data Structure: A data structure is a way of organizing and storing data. Data structures can be simple, like an array of numbers, or more complex, like a tree. What is a Data Structure used for? A data structure is a way of organizing data so that it can be efficiently…
Data Visualization
Definition of Data Visualization Data Visualization: Data Visualization is the process of transforming data into a graphical representation that is easier to understand. This can be done in order to identify patterns, trends, and correlations that would otherwise be hidden in a table of data. What are Data Visualizations used for? Data visualizations are used…
Data Wrangling
Definition of Data Wrangling Data wrangling is the process of manipulating and cleaning data in order to prepare it for analysis. This can involve anything from removing duplicates to transforming data into a different format. Data wrangling can be a time-consuming process, but it is essential for ensuring that data is ready for analysis. What…
Database Administrator
Definition of Database Administrator Database Administrator: A database administrator (DBA) is a professional who is responsible for the design, creation, installation, monitoring and maintenance of a database. DBAs are also responsible for ensuring that the database is available to users and that data is accurate and secure. What does a Database Administrator do? A Database…
Database Design
Definition of Database Design Database Design: Database design is the process of designing the structure of a database. This includes defining the tables, fields, and relationships between them. Why does Database Design matter? Database design is an important factor in any data science or machine learning environment. It lays the foundation for how data is…
Database Management System
Definition of Database Management System Database Management System: A Database Management System (DBMS) is a software system that enables users to create and manage databases. A DBMS typically consists of a graphical user interface (GUI), a database engine, and a storage mechanism. The GUI allows users to create and modify database structures, enter and edit…
Decision
Definition of Decision Decision: A decision is a choice that is made between two or more possibilities. What are Decisions used for? Decisions are used to make choices or reach conclusions. They are a crucial part of the data science and machine learning process, since they allow us to make decisions based on information collected…
Decision Tree
Definition of Decision Tree Decision Tree: A decision tree is a graphical representation of a decision process, used to help explain the logic of a decision. The tree has nodes, which represent choices, and branches, which represent the possible outcomes of each choice. The leaves of the tree represent the end results of the decision…
Deep Learning
Definition of Deep Learning Deep learning is a type of machine learning that uses multiple layers of nonlinear processing units, called neurons, to learn representations of data. Deep learning architectures can learn to represent data in ways that are more accurate and efficient than shallow architectures. What is Deep Learning used for? Deep learning is…
Dependent Variable
Definition of Dependent Variable A dependent variable (also called the response variable) is a variable whose value is determined by the value of one or more other variables (the independent variables). The dependent variable is usually the focus of a data analysis, since it is the quantity that is being measured or estimated. What is…
Dimension Reduction
Definition of Dimension Reduction Dimension reduction is a data pre-processing technique that reduces the number of dimensions in a dataset while preserving most of the information. It is often used to improve performance and scalability when working with high-dimensional datasets. What is Dimension Reduction used for? Dimension reduction is a type of data transformation technique…
Discrete Variable
Definition of Discrete Variable A discrete variable is a variable that can only take on certain, specific values. Contrast this with a continuous variable, which can take on any value within a given range. Discrete variables are often used in statistical models to represent outcomes that can only happen a certain number of times, or…
E
View Data Science & Machine Learning Terms beginning with the Letter “E”
E-commerce
Definition of E-commerce E-commerce: E-commerce is a term for the buying and selling of goods and services over the internet. It usually refers to the sale of goods and services by businesses to consumers, but can also describe the purchase of goods and services by businesses from other businesses. What things should be considered when…
Econometrics
Definition of Econometrics Econometrics is a quantitative research field in economics that uses mathematical and statistical methods to analyze economic data. What is Econometrics used for? Econometrics is an empirical branch of economics that uses statistical methods to analyze economic data and evaluate theories. It utilizes mathematical models and statistical concepts to study a variety…
Eigenvector
Definition of Eigenvector Eigenvector: An eigenvector is a particular type of vector that has a special property: it’s multiplied by a certain matrix (the “eigenvalue matrix”) in such a way that the result is always the same. What is an Eigenvector used for? An Eigenvector is a special type of vector used in linear algebra…
Embedding
Definition of Embedding Embedding: Embedding is a technique used in machine learning for representing high-dimensional data in a low-dimensional space. It is used to improve the performance of algorithms by reducing the number of parameters that need to be estimated or optimized. This is often done in order to improve the efficiency of machine learning…
Endpoint
Definition of Endpoint Endpoint: Endpoint refers to the point of contact between two systems, or in the context of data science, the point where data is received and sent. In a pipeline, the endpoint is typically the last stage before the data is output. What needs to be considered when dealing with different Endpoints? When…
Enrichment
Definition of Enrichment Enrichment: In data science, enrichment is the process of expanding or enhancing a dataset with additional information. This can be done in order to improve the accuracy of predictions or to gain a deeper understanding of the data. Enrichment can be performed manually, by adding new data points to the dataset, or…
Ensemble Learning
Definition of Ensemble Learning Ensemble Learning: Ensemble Learning is a technique that combines the predictions or classifications of multiple machine learning models in order to improve the accuracy of the predictions. What is Ensemble Learning used for? Ensemble Learning is a machine learning technique used to combine multiple models or algorithms together to produce better…
Ensembling
Definition of Ensembling Ensembling: Ensembling is a technique used in machine learning that consists of combining the predictions of multiple models in order to improve the accuracy of the predictions. What is Ensembling used for? Ensembling is a technique in data science and machine learning in which multiple prediction models are combined to produce a…
Entropy
Definition of Entropy Entropy: Entropy is a measure of the unpredictability of a system. In information theory, entropy ( ) is a measure of the uncertainty associated with a random variable. Entropy is defined as the average amount of information that is not known about the value of a random variable. When is Entropy used?…
Epoch
Definition of Epoch Epoch: Epoch is a term used in data science to denote a specific point in time. It may be used, for example, to indicate the beginning of a period when data is collected or the end of a period when data is analyzed. When is an Epoch used? An epoch is a…
Error
Definition of Error Error: An error is an incorrect result produced by a calculation. In data science, an error is an inconsistency or inaccuracy in data. Errors can be the result of incorrect measurements, incorrect entry of data, or simply a mistake. In order to ensure the accuracy of data, it is important to identify…
Euclidean Distance
Definition of Euclidean Distance Euclidean Distance: The Euclidean distance between two points is the length of the straight line between them. What is Euclidean Distance used for? Euclidean Distance is a mathematical tool used to measure the distance between two points in a multidimensional space. It is also known as the “straight line” or “as-the-crow-flies”…
Evaluation
Definition of Evaluation Evaluation: Evaluation is the process of assessing how well a model or system is performing, typically by measuring its accuracy, precision, recall, or some other performance metric. Evaluation is an important part of the data science process, as it allows you to determine whether your models are meeting your expectations and helping…
Exact Match
Definition of Exact Match Exact Match: Exact match is a term used in data science to describe a type of search algorithm that compares two strings of text and determines whether or not they are an exact match. What is an Exact Match used for? An Exact Match is a type of data matching algorithm…
Exact p-value
Definition of Exact p-value Exact p-value: The Exact p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. What is Exact p-value used for? Exact p-values are used to help assess the strength of evidence for a hypothesis…
Expectation Maximization
Definition of Expectation Maximization Expectation Maximization: Expectation Maximization (EM): A statistical algorithm used to find the maximum likelihood estimate of a parameter in a probabilistic model. EM iteratively maximizes the expected likelihood of the data under the model, by adjusting the model’s parameters. What is Expectation Maximization used for? Expectation Maximization (EM) is a statistical…
Exploratory Data Analysis
Definition of Exploratory Data Analysis Exploratory Data Analysis: Exploratory data analysis (EDA) is the examination of data to summarize, visualize, and discover patterns. EDA is used to identify which variables are important and to develop hypotheses about the relationships between variables. What is an Exploratory Data Analysis used for? An Exploratory Data Analysis (EDA) is…
F
View Data Science & Machine Learning Terms beginning with the Letter “F”
F-Measure
Definition of F-Measure F-Measure: F-Measure is a statistic used in machine learning to measure the effectiveness of a classification model. It is a harmonic mean of precision and recall, both of which are ratios of correct classifications to total number of classifications. What is an F-Measure used for? F-Measure is an important metric used to…
F-Test
Definition of F-Test F-test: An F-test is a statistical test used to determine the significance of a difference between two variances. What is an F-Test used for? An F-Test is a statistical test used to compare the variability between two population variances. It is used to determine if there is a significant difference between the…
Facet
Definition of Facet Facet: A facet is a property or attribute of an object that can be measured or quantified. In data science, facets are often used to group and filter data sets based on specific criteria. For example, a data set might be grouped by country of origin, age group, or income level. How…
False Negative
Definition of False Negative False Negative: False negative is a result of a test where a condition that is true is incorrectly reported as being false. What are the impacts of False Negatives? False Negatives can have a significant impact and be costly to data science and machine learning projects. False Negatives refer to instances…
False Positive
Definition of False Positive False Positive: False positive is a result that incorrectly identifies an event as being positive. What are False Positive used? False Positive is a term used in data science and machine learning that refers to an incorrect classification of an item as being positive when, in reality, it is negative. It…
Feature
Definition of Feature A feature is a characteristic of something, often used to describe data. In machine learning, features are the variables that are used to train and predict models. Good features are important for making accurate predictions, so it is important to select the most relevant features for your data set. What is a…
Feature Engineering
Definition of Feature Engineering Feature Engineering: Feature engineering is the process of transforming raw data into a form that is more amenable to analysis or machine learning. This can involve things like aggregating data, transforming variables, or creating new features from existing variables. Feature engineering is an important part of data science, as it can…
Feature Selection
Definition of Feature Selection Feature Selection: Feature selection is the process of choosing which features (or attributes) of a data set to use in order to solve a problem. This is an important step in data science, as it can help reduce the complexity of a problem and improve the accuracy of predictions or models….
Frequency Table
Definition of Frequency Table Frequency Table: A frequency table is a table that shows how often each value in a data set occurs. What are Frequency Tables used for? Frequency tables are used to display and quickly analyze the distribution of a set of data. Frequency tables display information in a tabular format that includes…
Frequentist
Definition of Frequentist Frequentist: A Frequentist is someone who believes that the only valid methods of statistics are those that rely on the law of large numbers, and the principle of population stabilization. What do Frequentist do? Frequentists are data scientists and machine learning specialists who use frequentist statistical methods for their work. Frequentist methods…
Function
Definition of Function Function: A function is a set of instructions that tells a computer what to do. Functions can be used to calculate things, or to make decisions. What are Functions used for? Functions are a key component in the programming language of data science and machine learning. A function is a block of…
Functional Programming
Definition of Functional Programming Functional Programming: Functional programming is a style of programming in which the programmer focuses on functions instead of objects. In functional programming, functions are treated as first-class citizens, meaning they can be passed around and used like any other variable. Functional programming languages typically emphasize simplicity and purity, meaning that functions…
G
View Data Science & Machine Learning Terms beginning with the Letter “G”
G-Model
Definition of G-Model G-Model: The G-Model is a data science model that is used to predict future events. It can be used to predict the behavior of customers, the sales of products, or the outcome of elections. What is a G-Model used for? A G-Model is a type of supervised machine learning algorithm used to…
GATE
Definition of GATE GATE is an acronym for “General Architecture for Text Engineering.” GATE is an open-source software development kit for research and development of advanced natural language processing (NLP) applications. What is GATE used for? GATE (General Architecture for Text Engineering) is a software framework developed by the University of Sheffield in 2000 that…
Gaussian Distribution
Definition of Gaussian Distribution Gaussian Distribution – A Gaussian or normal distribution is a type of probability distribution that is bell-shaped and symmetrical. This distribution is often used in statistics to model real-world data. What is a Gaussian Distribution used for? A Gaussian Distribution, more commonly known as a normal distribution or bell curve, is…
Gradient Boosting
Definition of Gradient Boosting Gradient Boosting: Gradient boosting is a machine learning technique that combines a number of weaker models to produce a stronger model. It does this by constructing a model that takes into account the predictions of the weaker models, and uses this information to make its own predictions. What is Gradient Boosting…
Gradient Descent
Definition of Gradient Descent Gradient descent is a popular optimization algorithm used in machine learning and data science. The goal of gradient descent is to find the minimum value of a function by systematically decreasing its value in each iteration. The algorithm takes as input an initial guess for the solution, and the derivative of…
Growth Curve
Definition of Growth Curve Growth Curve – A growth curve is a graphical representation of the change in some quantity over time. It can show how the quantity changes over the entire range of time being studied, or it can focus on more specific intervals of time. The most common type of growth curve is…
H
View Data Science & Machine Learning Terms beginning with the Letter “H”
Hadoop
Definition of Hadoop Hadoop: Hadoop is a software framework that supports the processing of large data sets in a distributed computing environment. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. What is a Hadoop used for? Hadoop is an open-source software framework used for…
HBase
Definition of HBase HBase: HBase is a column-oriented database system that provides high throughput access to large tables. What is a HBase used for? HBase is an open source, non-relational database used for distributed storage of large tables. It is designed to provide fast read and write access to large datasets in the form of…
HDFS
Definition of HDFS HDFS: HDFS (Hadoop Distributed File System) is a distributed file system that enables high throughput access to data across large clusters of commodity servers. HDFS is designed to scale to support very large data sets up to petabytes in size. What is a HDFS used for? HDFS, or Hadoop Distributed File System,…
Histogram
Definition of Histogram A histogram is a graphical representation of the distribution of data. It is created by dividing the range of data into a series of equal intervals, and then counting the number of data points that fall into each interval. What is a Histogram used for? A histogram is a graphical representation of…
HSQLDB
Definition of HSQLDB HSQLDB: HSQLDB, or HyperSQL Database, is a Java relational database management system (RDBMS). It is written in Java and implements the Java Database Connectivity (JDBC) API. It can be used to create and populate tables, to query data, and to update data. What is HSQLDB used for? HSQLDB, or HyperSQL Database, is…
Hyperplane
Definition of Hyperplane Hyperplane: A hyperplane is a mathematical concept used in linear algebra and machine learning. It is a subspace of a given vector space that is spanned by a set of vectors. In other words, it is a flat plane in a higher dimensional space. Hyperplanes can be used to separate different classes…
I
View Data Science & Machine Learning Terms beginning with the Letter “I”
Independent Variable
Definition of Independent Variable An independent variable is a variable that is manipulated by the experimenter in a scientific study. It is typically one factor that is changed while all other factors are kept constant. What is an Independent Variable used for? An independent variable is a type of variable used in statistical models and…
Inductive Reasoning
Definition of Inductive Reasoning Inductive Reasoning: Inductive reasoning is a type of inference that starts with specific observations and then draws general conclusions from them. What are some examples of when an Inductive Reasoning may be used? Inductive reasoning is an important tool used in data science and machine learning. It involves drawing conclusions from…
Inference
Definition of Inference Inference: Inference is the process of using known information to draw conclusions about something that is unknown. In data science, inference is often used to draw conclusions from data that has been collected. This can be used to identify trends or patterns in data, or to make predictions about future events. What…
Inferential Statistics
Definition of Inferential Statistics Inferential Statistics: Inferential statistics are a type of statistics that are used to make estimations about populations based on samples. This is done by using the data from the sample to calculate a statistic, which is then used to make an inference about the population. What are Inferential Statistics used for?…
Input
Definition of Input Input: Input refers to the data that is given to a machine learning algorithm in order to learn from it. The input can be in a number of formats, including text, images, and videos. What are the most common types of Input given to machine learning algorithms? The most common types of…
Insights
Definition of Insights Insights: Insights are the findings or conclusions that are drawn from data. Insights can be used to make better business decisions, understand customer behavior, and track progress on strategic initiatives. What are the types of Insights a business or organization would hope to gain from analyzing data? Data-driven insights can be used…
Instance
Definition of Instance Instance: In the context of data science, an instance refers to a single occurrence of a set of data. For example, if you have a table of data that includes information on customer orders, each row in the table would be considered an instance of that data.
Interactive
Definition of Interactive Interactive: Interactive refers to a mode of data analysis that allows the user to make changes to the data and see the results immediately. This is in contrast to a more traditional mode of data analysis, where the user makes changes to a model and then observes the results. What are the…
Interpretability
Definition of Interpretability Interpretability: Interpretability is a measure of how easily a model’s predictions can be explained to humans. Models that are easy to interpret are more likely to be trusted and used in decision-making processes. Why does Interpretability matter? Interpretability is an important factor in how effective data science and machine learning tools are….
Intuitive
Definition of Intuitive Intuitive: Intuitive is defined as easily understood or grasped. What are the key benefits of creating Intuitive processes? The key benefits of creating intuitive processes are manifold. Firstly, it allows for a more user-friendly experience, as users can quickly and easily understand what they need to do in order to complete the…
Iterative
Definition of Iterative Iterative: Iterative means “repeatedly doing something.” In data science, this usually refers to the process of repeatedly running a machine learning or deep learning algorithm on a dataset in order to improve the accuracy of the predictions made by the algorithm. When is an Iterative process used in Machine Learning or Data…
J
View Data Science & Machine Learning Terms beginning with the Letter “J”
Jaccard Index
Definition of Jaccard Index Jaccard Index: The Jaccard Index is a statistic used to measure the similarity of two sets. It is calculated by dividing the number of elements in both sets that are common to both sets by the total number of elements in both sets. What is Jaccard Index used for? The Jaccard…
Jacobi Matrix
Definition of Jacobi Matrix Jacobi Matrix: A Jacobi matrix is a square matrix used in the numerical solution of systems of linear equations. What is Jacobi Matrix used for? Jacobi Matrix is a type of matrix used in mathematics and data science to calculate partial derivatives. It is a square matrix formed by the partial…
Jacobian
Definition of Jacobian Jacobian: The Jacobian is a matrix that calculates the derivatives of a given function at a certain point in space. What is Jacobian used for? Jacobian is a matrix of partial derivatives used in calculus and vector calculus to help determine the local maxima or minima of a function. It is a…
JavaScript
Definition of JavaScript JavaScript is a programming language that enables developers to create complex websites and applications. JavaScript code is executed in the browser, which makes it a powerful tool for front-end development. Additionally, JavaScript can be used to create back-end functionality with Node.js. An Introduction to JavaScript in the Field of Data Science and…
Jitter
Definition of Jitter Jitter: Jitter is a measure of the variability of the time between samples in a data set. It’s a technique used to add random variation to data points in a time series in order to remove bias. What is Jitter used for? Jitter is a technique used in data science and machine…
Join
Definition of Join Join: A join is an operation that merges two data tables based on a common attribute. The result of the join is a new table that contains all the data from both input tables, with the duplicates removed. What is Join used for? Join is a technique used in data science and…
Joint Distribution
Definition of Joint Distribution Joint Distribution: A joint distribution is a way of representing the probability of two or more events occurring simultaneously. What is Joint Distribution used for? Joint Distribution is a statistical tool used to analyze the relationship between two or more random variables. It measures how one variable affects another in a…
Joint Probability
Definition of Joint Probability Joint Probability: Joint probability is a measure of the likelihood that two or more events will occur simultaneously. It is calculated by multiplying the individual probabilities of each event. For example, if there is a 50% chance of rain on any given day and a 40% chance of a thunderstorm, the…
Jordan Canonical Form
Definition of Jordan Canonical Form Jordan Canonical Form: A Jordan Canonical Form is a matrix representation of a square matrix that has the property that all of the eigenvalues are real and distinct. Jordan Canonical Form is a mathematical process used to find a rational solution to a polynomial equation. A Brief Overview of Jordan…
JSON
Definition of JSON JSON: JSON, or JavaScript Object Notation, is a lightweight data-interchange format. It is easy for humans to read and write, and for machines to parse and generate. JSON is based on a subset of the JavaScript language, which uses curly braces { } to enclose objects, and arrays are denoted by square…
K
View Data Science & Machine Learning Terms beginning with the Letter “K”
K-means clustering
Definition of K-means clustering K-means clustering: K-means clustering is a data mining algorithm used to partition a set of data points into k clusters. Data is divided into clusters based on the similarities of the points within each cluster. This algorithm is often used to segment customers into different groups for marketing purposes. How is…
K-Nearest Neighbors
Definition of K-Nearest Neighbors K-Nearest Neighbors (KNN) is a machine learning algorithm used to predict the output value of a target variable by finding the k nearest neighbors of a given input value. The algorithm assigns a weight to each neighbor, then uses a weighted average to predict the output value for the target variable….
Kappa Statistic
Definition of Kappa Statistic Kappa Statistic: Kappa statistic is a statistic used in machine learning and data science that measures the agreement between predicted values and observed values. Kappa statistic is calculated as the average of the absolute agreements (predicting the correct class) minus the average of the absolute disagreements (predicting the wrong class). How…
Kernel
Definition of Kernel Kernel: A kernel is a mathematical function that takes two inputs, typically vectors in a high-dimensional space, and outputs a scalar. The kernel function defines a similarity or distance between the two vectors. How is Kernel used? Kernel is the core component of a computer’s operating system that acts as an interface…
Kernel Density Estimation
Definition of Kernel Density Estimation Kernel Density Estimation: Kernel density estimation is a technique used to estimate the probability density function of a random variable. It is often used to smooth out noisy data or in cases where the exact distribution of the data is unknown. How is Kernel Density Estimation used? Kernel Density Estimation…
Kernel Regression
Definition of Kernel Regression Kernel Regression: Kernel regression is a type of nonlinear regression that uses a kernel function to calculate the weight of each input data point. This type of regression is often used for time series data, where the input points are close together in time. How is Kernel Regression used? Kernel Regression…
Kolmogorov-Smirnov Statistic
Definition of Kolmogorov-Smirnov Statistic Kolmogorov-Smirnov Statistic: The Kolmogorov-Smirnov statistic is a measure of the difference between two distributions. It is used to determine whether the two distributions are statistically different from each other. How is Kolmogorov-Smirnov Statistic used? The Kolmogorov-Smirnov Statistic (K-S Test) is a nonparametric test used to compare the distributions of two samples….
Kurtosis
Definition of Kurtosis Kurtosis: A measure of the peakedness or flatness of a distribution, Kurtosis is determined by calculating the fourth moment of the distribution about its mean. Distributions with high kurtosis are more peaked than those with low kurtosis, while distributions with low kurtosis are more spread out. How is Kurtosis used? Kurtosis is…
L
View Data Science & Machine Learning Terms beginning with the Letter “L”
Labeling
Definition of Labeling Labeling: Labelling is the process of attaching a label, or name, to a particular instance of data. This can be done manually, or through automated means. Labels can be used to help identify and group data, as well as to track changes over time. How is Labeling used? Labeling is an important…
Laplace Approximation
Definition of Laplace Approximation Laplace Approximation: The Laplace approximation is a method used in mathematics to approximate the value of a function. It is named after the mathematician Pierre-Simon Laplace, who first proposed it in 1774. The approximation is based on the assumption that the function is smooth, which means that it can be approximated…
LASSO
Definition of LASSO Lasso: A type of regression where the researcher deliberately chooses a subset of the independent variables for inclusion in the model. It is also known as the “Least Absolute Shrinkage and Selection Operator”, or “LASSO” for short. How is LASSO used? LASSO, or Least Absolute Shrinkage and Selection Operator, is an algorithm…
Latent Class Analysis
Definition of Latent Class Analysis Latent Class Analysis: A technique used to identify unobserved (latent) classes within a population. How is a Latent Class Analysis used? Latent Class Analysis (LCA) is a statistical technique used to identify latent classes within data sets. These latent classes are clusters of data points with similar characteristics, which can…
Latent Dirichlet Allocation
Definition of Latent Dirichlet Allocation Latent Dirichlet Allocation: Latent class analysis (LCA) is a technique used in statistics and data mining for the analysis of categorical data. LCA is a type of cluster analysis that seeks to identify a finite number of unobserved classes (clusters) within a population. The detected classes are latent, meaning they…
Latent Variable
Definition of Latent Variable Latent Variable: Latent Variable: In statistics, a latent variable is a hypothetical construct that explains the observed variability in a set of measured variables. Latent variables are unobservable or hidden variables that cannot be directly measured, but rather must be inferred from other observed variables. How is a Latent Variable used?…
Latin Hypercube Sampling
Definition of Latin Hypercube Sampling Latin Hypercube Sampling: Latin Hypercube Sampling (LHS) is a method for constructing a point sample from a probability distribution. The most common use case for LHS is in Monte Carlo simulations, where the goal is to approximate the distribution of a function by taking repeated samples from it. In order…
Leakage
Definition of Leakage Leakage: Leakage is when data that should be confidential or private is released to unauthorized individuals. What are the impacts of Leakage? The impacts of data leakage can be far-reaching and very damaging for individuals and organizations alike. Leaked data may include personal information, such as names, addresses, phone numbers, financial details…
Level of Detail
Definition of Level of Detail Level of Detail: A level of detail (LOD) is a measure of how much information is included in a data set. Larger data sets typically have more detail, while smaller data sets may only include basic information. When working with data, it is important to understand the level of detail…
Lift
Definition of Lift Lift is a measure of how much a model predicts the value of a target variable relative to the prediction of the same target variable by a random guess. Lift is often used as a measure of how good a model is at discriminating between different groups. How is Lift used? Lift…
Linear Algebra
Definition of Linear Algebra Linear Algebra: Linear Algebra is the study of mathematical problems that can be best explained in terms of linear equations. It is a powerful tool for solving problems in physics and engineering, and has many applications in data science. How is Linear Algebra used? Linear Algebra is widely used in the…
Linear Regression
Definition of Linear Regression Linear Regression: Linear Regression is a statistical technique that helps us understand how one variable (the dependent variable) changes when other variables (the independent variables) change. It does this by fitting a line through a set of data points, and then using the line to predict the value of the dependent…
Logarithm
Definition of Logarithm The logarithm is a mathematical function that calculates the power to which a base number must be raised to produce a given number. It is written as log(x) and is read “log of x.” How is Logarithm used? Logarithm is a mathematical operation used to calculate the power or exponent that a…
Logistic Regression
Definition of Logistic Regression Logistic regression is a machine learning algorithm used for classification and regression analysis. It is a type of linear regression, where the outcome variable is categorical rather than continuous. Logistic regression is used to predict the probability of a particular event occurring, such as whether or not a customer will churn….
M
View Data Science & Machine Learning Terms beginning with the Letter “M”
Machine Learning
Machine Learning: Machine learning is a subfield of artificial intelligence that enables computers to learn from data without being explicitly programmed. It focuses on the development of algorithms that can automatically learn to improve performance on a task as they are exposed to more data.
Machine Learning Model
Definition of Machine Learning Model A machine learning model is a representation of the data that is learned by a machine learning algorithm. The model can be used to predict the outcomes of future events, based on the data that is used to train the model. How is Machine Learning Model used? A Machine Learning…
Markov Chain
Definition of Markov Chain A Markov chain is a mathematical model for sequences of random events, where the probability of any event depends only on the immediately preceding event. How is Markov Chain used? Markov Chain is a type of machine learning algorithm that uses probability to model complex systems. It is based on the…
Mathematical Statistics
Definition of Mathematical Statistics Mathematical Statistics: Mathematical statistics is a branch of mathematics that deals with the collection, analysis, interpretation, presentation, and organization of data. It also provides techniques for estimating properties of populations from samples. Mathematical statistics is distinguished from other statistical techniques in that it relies on mathematical theory as a foundation. What…
MATLAB
Definition of MATLAB MATLAB is a software suite for high-performance numerical computation, visualization, and programming. It integrates mathematical computing, simulation, and graphical output into a single software environment. MATLAB is used extensively in engineering and scientific fields. How is MATLAB used? MATLAB is a high-level programming language developed by MathWorks for numerical computing and data…
Matrix
Definition of Matrix Matrix: A matrix (plural matrices) is a rectangular array of numbers, symbols, or other objects. The individual items in a matrix are called its elements. Related: “The Matrix” is an awesome movie from 1999. How is a Matrix used? A matrix is an array of numbers or other data used in mathematics…
Mean
Definition of Mean Mean: Mean is a mathematical calculation used to calculate the average of a set of numbers. How is the Mean used? The Mean is a statistical measure used in data analysis and machine learning to evaluate the central tendency of a given set of numerical values. It is calculated by adding up…
Mean Absolute Error
Definition of Mean Absolute Error Mean Absolute Error (MAE) is a measure of the accuracy of predictions made by a model. It is computed by taking the average of the absolute differences between the predicted values and the actual values for each observation. How is Mean Absolute Error used? Mean Absolute Error (MAE) is a…
Mean Squared Error
Definition of Mean Squared Error Mean Squared Error is a statistic used to measure the accuracy of predictions made by a machine learning model. It is calculated by taking the sum of the squared differences between the predicted values and the actual values for each data point, and dividing by the number of data points….
Measurement
Definition of Measurement Measurement: A measurement is a quantifiable value that is assigned to a variable. Measurement is the process of quantifying a property of an object. How are Measurements used? Measurements play an important role in data science and machine learning. They are used to help quantify and compare models, features, and data points….
Median
Definition of Median Median: The median is the middle value in a data set when it is sorted in ascending order. If there is an odd number of data points, the median is the middle value. If there is an even number of data points, the median is the average of the two middle values….
Meta-analysis
Definition of Meta-analysis Meta-analysis: A meta-analysis is a literature review of qualitative and quantitative studies that have been published on a specific topic. The goal of a meta-analysis is to summarize the findings of these studies and to identify patterns in the data. How is Meta-analysis used? Meta-analysis is a statistical technique used to combine…
Metadata
Definition of Metadata Metadata: Metadata is data that describes other data. It can include information such as the date a file was created, the author of a document, or the keywords used to identify a piece of content. Metadata can be used to help organize and find information, and to track changes to data over…
Mode
Definition of Mode Mode: Mode is a statistical measure of the typical value of a data set. It is calculated by taking the sum of the data values and dividing by the number of data points. Mode is most useful when dealing with nominal data. How is Mode used? Mode is a central measure of…
Model
Definition of Model A model is a representation of something in order to understand it or predict its behavior. In machine learning, a model is a mathematical function that is used to predict the value of a target variable, given a set of input variables. How is a Model used? A Model is a mathematical…
Modeling
Definition of Modeling Modeling: Modeling is the process of creating a mathematical representation of a real world phenomenon. Modeling is a process of using past data to understand and predict future outcomes. It can involve creating mathematical models to represent relationships in the data, or using machine learning techniques to build models from data that…
Monte Carlo method
Definition of Monte Carlo method The Monte Carlo method is a mathematical technique used to calculate the probability of an event by simulating many possible outcomes. How is the Monte Carlo method used? The Monte Carlo method is a numerical approach in which problems are solved by performing random sampling. It is used to solve…
Moving Average
Definition of Moving Average A moving average (MA) is a statistical measure that calculates the average value of a given set of data points over a designated amount of time. The MA is typically used to smooth out irregularities or fluctuations in the data and to help identify trends. The most common type of MA…
Multivariate
Definition of Multivariate Multivariate: Multivariate refers to a dataset or analysis that considers more than one variable at a time. For example, in a multivariate analysis, you might examine the relationship between height and weight in order to understand how they are related. How is Multivariate used? Multivariate analysis is a collection of techniques used…
Multivariate Analysis
Definition of Multivariate Analysis Multivariate Analysis: Multivariate analysis is a set of statistical techniques used to analyze data that has more than one variable. The goal of multivariate analysis is to identify relationships between the variables and to find patterns in the data. How is Multivariate Analysis used? Multivariate Analysis is a data science and…
Mutation
Definition of Mutation Mutation: A mutation in data science is an alteration of the data, which can be intentional or accidental. How is Mutation used? Mutation is a term that is used when talking about machine learning algorithms, particularly genetic algorithms. It refers to a process of randomly making changes or modifications to the data…
N
View Data Science & Machine Learning Terms beginning with the Letter “N”
N-gram
Definition of N-gram An n-gram is a contiguous sequence of n items from a given text or speech. N-grams are used to study linguistic patterns and to generate text samples.
Naive Bayes Classifier
Definition of Naive Bayes Classifier A naive Bayes classifier is a machine learning algorithm that relies on the assumption that the features of a given object are independent of each other. This algorithm is often used for text classification tasks, such as sorting email into spam and non-spam folders.
Natural Language Processing (NPL)
Definition of Natural Language Processing (NPL) Natural Language Processing (NPL): Natural language processing (NLP) is a field of computer science and linguistics that deals with the interaction between computers and human languages, and with the development of software that can understand natural language.
Network Analysis
Definition of Network Analysis Network Analysis: Network analysis is the process of studying relationships between entities in a network. Network analysis can be used to understand how information flows through a network, or to identify important nodes in a network.
Neural Network
Definition of Neural Network Neural Network: A neural network is a computer system modeled on the workings of the human brain. It is composed of interconnected processing nodes, or neurons, that can learn to recognize patterns of input data. A neural network is a type of machine learning algorithm that is used to model complex…
Non-Negative Matrix Factorization
Definition of Non-Negative Matrix Factorization Non-Negative Matrix Factorization: Non-negative Matrix Factorization (NMF) is a technique used to decompose a matrix into two parts: a lower triangular matrix (containing the “feature” values) and an upper triangular matrix (containing the “coefficients” associated with each feature).
Nonlinear
Definition of Nonlinear Nonlinear: Nonlinearity means that the output of a system is not proportional to the input. In data science, this can often manifest as curved lines when graphing data points, rather than a straight line. Nonlinear systems can be more difficult to model and predict future outcomes.
Normal Distribution
Definition of Normal Distribution A normal distribution is a type of bell-shaped distribution in which the majority of the data falls around the mean. This distribution is often used in statistics to model real-world data.
Normalization
Definition of Normalization Normalization: Normalization is the process of standardizing data so that it has a consistent meaning across different data sets. This can be done by ensuring that all values in a data set are within a certain range, or by converting all data to a single numerical representation. Normalization can make it easier…
Normalized
Definition of Normalized Normalized: Normalized is a statistic term referring to the process of adjusting a value to have a unit of measurement. For example, the value of weight can be normalized by dividing it by the unit of weight, such as kilograms.
NoSQL
Definition of NoSQL NoSQL is a type of database that does not conform to the traditional model of a relational database. It is schema-less, which means that you don’t have to predefine the fields and structure of your data when you add it to the database. This makes it easy to change and adapt as…
Null Hypothesis
Definition of Null Hypothesis The null hypothesis is a statement that is presumed to be true until evidence is presented to suggest otherwise. In statistics, the null hypothesis is typically contrasted with the alternative hypothesis, which is the statement that is posited if the null hypothesis is found to be false. The null hypothesis is…
Numerical
Definition of Numerical Numerical: Numerical data is data that can be represented as a number. This type of data is often used in statistics and machine learning.
O
View Data Science & Machine Learning Terms beginning with the Letter “O”
Objective Function
Definition of Objective Function An objective function (in machine learning) is a mathematical formula used to calculate the relative merit of each possible solution to a problem. The objective function takes into account the inputs and outputs of a problem, as well as any constraints that may exist. It then calculates a score for each…
Oblique Rotation
Definition of Oblique Rotation Oblique Rotation: Oblique rotation is a type of rotation that is not perpendicular to the plane of rotation.
Oblique Sampling
Definition of Oblique Sampling Oblique Sampling: Oblique sampling is a type of non-random sampling technique. It is used when the researcher wants to study a specific population but does not have access to all members of that population. Oblique sampling involves selecting units for the study in a way that is not completely random. The…
Observational Study
Definition of Observational Study Observational Study: Observational study is a study in which data is collected without affecting the participants.
Octile
Definition of Octile Octile: An octile is a statistical value that divides a set of data into eight equal parts.
One-hot encoding
Definition of One-hot encoding One-hot encoding: One-hot encoding is a technique used in machine learning to represent categorical variables as a vector of binary values. In one-hot encoding, each category is represented by a unique integer value, and the remaining values are set to 0. For example, if there are three categories, A, B, and…
Open Source
Definition of Open Source Open Source: Open source refers to software for which the original source code is made freely available and may be redistributed and modified.
Operational Definition
Definition of Operational Definition Operational Definition: An operational definition is a specific, quantitative definition of a term, which can be used to measure or calculate it. This definition is typically found in a laboratory setting, where a scientist can observe and measure the term’s properties. For example, the operational definition of the speed of light…
Optimization
Definition of Optimization Optimization: Optimization is the process of making something as good as possible. In data science, this usually means finding the best way to use the data to achieve a goal. This can involve choosing the right algorithm to use, or finding the right settings for that algorithm.
Outlier
Definition of Outlier Outlier: An outlier is a data point that is significantly different from the other points in the dataset. Outliers can be caused by errors in data collection or by natural variations in the data. They can be removed from a dataset before analysis, or they can be studied to learn more about…
Overfitting
Definition of Overfitting Overfitting: Overfitting is a phenomenon that can occur in machine learning when a model begins to “fit” the training data too closely, resulting in poorer performance on new data. This can be caused by excessive use of complex models or excessively large training sets, and can often be avoided by using more…
P
View Data Science & Machine Learning Terms beginning with the Letter “P”
P-Value
Definition of P-Value P-Value: The P-value is a statistic that is used to determine the significance of a result. It is the probability of obtaining a result as extreme or more extreme than the one that was actually observed, given that the null hypothesis is true.
PageRank
Definition of PageRank PageRank is a link analysis algorithm created by Google co-founder Larry Page. It assigns a numerical weight to each page in order to determine its importance within the web. The algorithm is based on the idea that important pages will be linked to by other important pages.
Pandas
Definition of Pandas Pandas is a software library written for the Python programming language that enables data analysis. It provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Pandas is free software released under the 3-clause BSD license.
Parametric
Definition of Parametric Parametric: Parametric models are a type of statistical model that rely on mathematical formulas to describe the relationship between the input and output variables. These models can be used to predict future values for the output variable based on past values of the input variable.
Perceptron
Definition of Perceptron A perceptron is a type of artificial neural network that can be used to learn and classify patterns. It is composed of a number of interconnected processing nodes, or neurons, that can be adjusted based on input data. The perceptron can be used to recognize patterns in data and make decisions accordingly.
Perl
Definition of Perl Perl is a programming language that is used for text processing, system administration, and web development. It is known for its powerful regular expression capabilities.
Pivot Table
Definition of Pivot Table A pivot table is a data analysis tool that allows you to reorganize and analyze your data in a new way. With pivot tables, you can group and summarize data by column, or calculate new values based on existing data.
Poisson Distribution
Definition of Poisson Distribution The Poisson distribution is a discrete probability distribution that models the number of events that occur in a given time interval. The Poisson distribution is often used to model the number of clicks on a website, the number of phone calls received, or the number of emails sent in a given…
Population
Definition of Population Population: A population is the complete set of all individuals or items under consideration. In statistics, a population is often described by its parameters, such as its mean and standard deviation. The population can also simply be defined as the entire set of items or cases to be studied.
Prediction
Definition of Prediction Prediction: A prediction is a statement made about the future, typically based on data and statistical models. In data science, this is often done through the use of machine learning algorithms that are trained on historical data.
Predictive Analytics
Definition of Predictive Analytics Predictive analytics is the practice of using data mining and machine learning techniques to make predictions about future events. Predictive analytics can be used to predict things such as how likely a customer is to churn, or what the probability is that a particular disease will occur in a population.
Predictive Modeling
Definition of Predictive Modeling Predictive Modeling: Predictive modeling is a technique used in data science to make predictions about future events. It involves building a model that can predict the likelihood of an event occurring based on certain factors.
Principal Component Analysis
Definition of Principal Component Analysis Principal component analysis (PCA) is a technique used to reduce the dimensionality of data. It does this by identifying the principal components of the data, which are the dimensions that account for the most variation in the data. PCA can be used to improve the performance of machine learning algorithms,…
Prior Distribution
Definition of Prior Distribution A prior distribution is a probability distribution that is used to assign a probability to each outcome in a problem before any data is observed. This distribution is usually chosen based on experience or intuition.
Probability
Definition of Probability Probability: Probability is a measure of how likely an event is to occur. It is calculated by dividing the number of times the event occurs by the total number of possible outcomes. It can be expressed as a number between 0 and 1, or as a percentage between 0% and 100%.
Python
Definition of Python Python is a high-level programming language with strong support for data science and machine learning. It has a wide variety of libraries for data analysis, machine learning, and scientific computing.
Q
View Data Science & Machine Learning Terms beginning with the Letter “Q”
Quality Control
Definition of Quality Control Quality Control: In the context of data science, quality control is the process of ensuring that data is accurate and reliable. This can be done through a variety of techniques, such as checking for inconsistencies, verifying the source of the data, and performing statistical tests. By making sure that data is…
Quant
Definition of Quant Quant: A quant is short for quantitative analyst.
Quantile
Definition of Quantile Quantile: Quantile refers to a specific percentage of data that is below and above a given value.
Quantitative
Definition of Quantitative Quantitative: Quantitative means numerical. Quantitative data is data that can be measured or counted. Quantitative refers to the use of numbers and mathematical models to understand and analyze data. It is a branch of statistics that deals with the measurement, analysis, interpretation, presentation, and organization of data.
Quantization
Definition of Quantization Quantization: Quantization is the process of reducing the number of unique values in a set of data. This is often done by dividing the data into bins and assigning a unique value to each bin. This technique is often used in data science to make sure that data can be processed and…
Quartile
Definition of Quartile Quartile: A quartile is a statistic that divides a data set into four equal parts. The first quartile is the lowest 25% of the data, the second quartile is the lower 50% of the data, the third quartile is the upper 50% of the data, and the fourth quartile is the highest…
Query
Definition of Query Query: A query is a question or request for information. In data science, a query is a request for data that meets certain criteria. For example, you might want to know how many people live in your city or what the average temperature is in December. To answer a query, you need…
Query Optimization
Definition of Query Optimization Query Optimization: Query optimization is a technique used by database administrators to improve the performance of database queries. It involves analyzing the structure of the query and the data in the database, and then choosing an execution plan that will produce the results as quickly as possible.
Querying
Definition of Querying Querying: Querying is the process of extracting information from data, or the act of performing a query.
Quintiles
Definition of Quintiles Quintiles: A quintile is a statistical division of a data set into five parts, each containing one-fifth of the data.
R
View Data Science & Machine Learning Terms beginning with the Letter “R”
R-squared
Definition of R-squared R-squared: R-squared is a statistic that measures how close the data points in a set are to a regression line (how well a model fits the data). It is a number between 0 and 1, with 1 indicating a perfect correlation and 0 indicating no correlation at all. It is calculated by…
Random Forest
Definition of Random Forest Random forest: A random forest is a type of decision tree learner that builds a number of decision trees, rather than just one. The individual decision trees are then combined to create the random forest. This approach helps to avoid overfitting the data.
Regression
Definition of Regression Regression: Regression is a technique used to model relationships between variables. It can be used to predict future values based on past values.
Reinforcement Learning
Definition of Reinforcement Learning Reinforcement Learning: Reinforcement learning is a type of machine learning that allows machines to learn by trial and error. In reinforcement learning, the machine is given feedback after each trial, which allows it to learn which actions lead to positive outcomes. Reinforcement learning is often used to train robots or other…
Resampling
Definition of Resampling Resampling: Resampling is a technique used in data science to create new datasets from existing ones. It involves selecting a subset of the data to be used in the new dataset, and then randomly selecting samples from that subset. This process is repeated multiple times to create a new dataset that is…
Root Mean Squared Error
Definition of Root Mean Squared Error Root Mean Squared Error: Root Mean Squared Error (RMSE) is a measure of the accuracy of predictions made by a model. It is calculated by taking the square root of the average of the squared differences between the actual values and the predicted values.
Ruby
Definition of Ruby Ruby is a programming language designed for web development. It has a syntax that is easy to learn and read, making it a popular choice for beginners. Ruby also has powerful features that make it suitable for more advanced applications.
S
View Data Science & Machine Learning Terms beginning with the Letter “S”
S curve
Definition of S curve S curve is a graphical representation of data points that display nonlinearity in the trend. The points on the graph typically follow a smooth curve, indicating that the trend changes at different points in time. The S curve is often used when describing growth or decay over time.
S-sampling
Definition of S-sampling S-sampling: S-sampling is the process of drawing a random sample from a population.
SaaS
Definition of SaaS SaaS: Service as a software (SaaS) is a software delivery model in which software and its associated data are hosted by the provider. Customers can access and use the software, typically through a web browser, while the provider manages the infrastructure and security.Service as a software (SaaS) is a software delivery model…
Samples
Definition of Samples Samples: Samples are a subset of the data that is used to train a machine learning model. The goal is to choose a dataset that will produce the best results on the test dataset.
Sampling
Definition of Sampling Sampling: Sampling is the process of selecting a subset of a population for study. This can be done randomly or using some other method such as stratified sampling.
SAS
Definition of SAS see Statistical Analysis System.
Scalar
Definition of Scalar A scalar is a numerical quantity that has a magnitude but no direction.
Scatterplot
Definition of Scatterplot Scatterplot: A scatterplot is a graphical representation of data in which the points are plotted on a coordinate plane. The data is usually displayed as a series of points, with each point representing a pair of values. The value on the x-axis is typically the independent variable, while the value on the…
Scientific Notation
Definition of Scientific Notation Scientific Notation: Scientific notation is a way of representing very large or very small numbers. In scientific notation, a number is written as a product of two factors: a coefficient and a power of 10. The coefficient is a number between 1 and 10 that multiplies the power of 10. The…
Scikit-learn
Definition of Scikit-learn Scikit-learn: Scikit-learn is a machine learning library for the Python programming language. It features a wide range of algorithms for data mining and machine learning, as well as API bindings for manipulating datasets and running models.
Scripting
Definition of Scripting A scripting language is a programming language designed for automating the execution of tasks. Scripting languages are often interpreted instead of compiled, and are used to create small, specific programs or to execute a series of commands.
SD
Definition of SD SD: SD stands for standard deviation.
SE
Definition of SE SE: SE stands for standard error. It is a measure of the variability of a dataset, and is calculated as the standard deviation of the set divided by the square root of the number of data points.
Serial Correlation
Definition of Serial Correlation Serial correlation is a measure of how much two variables are related to each other over time. Serial correlation is often used in time series analysis to determine if there is a trend or pattern in the data.
Shapiro-Wilk Test
Definition of Shapiro-Wilk Test Shapiro-Wilk Test: A Shapiro-Wilk Test is a method used to determine whether a set of data follows a normal distribution. The test is named after its creators, Maurice Shapiro and Abraham Wilk.
Shell
Definition of Shell A shell is a user interface that allows computer users to interact with the operating system. Shells usually provide a command-line interface in which the user can type commands, which the shell then executes.
Spatiotemporal Data
Definition of Spatiotemporal Data Spatiotemporal data is data that captures the relationship between spatial and temporal information. This can include things like GPS data, sensor readings, or images. Spatiotemporal data is often used in geography and climate studies, as well as in traffic analysis.
Spearman’s Rank Correlation Coefficient
Definition of Spearman’s Rank Correlation Coefficient Spearman’s Rank Correlation Coefficient: Spearman’s Rank Correlation Coefficient is a measure of the strength of a linear relationship between two sets of data. It is calculated by taking the sum of the products of the ranks of corresponding values in each data set, divided by the product of the…
SPSS
Definition of SPSS SPSS (Statistical Package for the Social Sciences) is a software package used for statistical analysis. It was originally designed for social scientists, but has been expanded to accommodate a wider range of data. SPSS can be used to conduct basic statistical analyses, as well as more complex procedures such as multivariate analysis…
SQL
Definition of SQL SQL: SQL is a programming language designed for managing data held in a relational database. It is used to query and update data, as well as to create and manage database objects such as tables, views, and indexes.
Standard Deviation
Definition of Standard Deviation Standard Deviation: Standard deviation is a measure of how spread out a set of data points are. It is calculated by taking the average of the distances of each data point from the mean, and then squaring them. This gives us a measure of how far each data point is from…
Standard Error Of The Mean
Definition of Standard Error Of The Mean Standard error of the mean: The standard error of the mean (SEM) is a measure of the standard deviation of the sample mean. It is used to estimate the standard deviation of a population from a sample.
Standard Normal Distribution
Definition of Standard Normal Distribution A normal distribution is a type of distribution that occurs frequently in nature and is symmetrical about its mean. The standard normal distribution is a specific type of normal distribution that has a mean of 0 and a standard deviation of 1.
Standardized Score
Definition of Standardized Score A standardized score is a score that has been adjusted to account for the distribution of scores in the data set. This adjustment allows comparisons of scores from different data sets.
Stata
Definition of Stata Stata is a software package used for statistical analysis. It is known for its powerful scripting language, which allows users to write custom code to carry out complex analyses. Stata also includes a wide range of built-in commands, making it easy to get started with data analysis.
Statistical Analysis System (SAS)
Definition of Statistical Analysis System (SAS) Statistical Analysis System (SAS): A Statistical Analysis System (SAS) is a software application used for statistical analysis. SAS is used to perform a variety of tasks, including data entry, data management, statistical analysis, report generation, and more. A Statistical Analysis System (SAS) is a software application used for statistical…
Statistics
Definition of Statistics Statistics: Statistics is the practice of collecting, analyzing, and drawing conclusions from data.
Strata, Stratified Sampling
Definition of Strata, Stratified Sampling Stratified sampling is a type of probability sampling in which the population is divided into strata and a simple random sample is taken from each stratum. This type of sampling is used when the population is heterogeneous and the researcher wants to ensure that all strata are represented in the…
Streaming Data
Definition of Streaming Data Streaming data: Streaming data is a type of data that is continuously updated. It can be used to track real-time changes in data, such as stock prices or website traffic.
Supervised Learning
Definition of Supervised Learning Supervised Learning: Supervised learning is a type of machine learning algorithm where the computer system is provided with a set of training data, and the task is to learn how to predict the correct output values for new data. The algorithm is “supervised” because each answer it provides is validated by…
Support Vector Machines
Definition of Support Vector Machines Support Vector Machines: A support vector machine (SVM) is a computer algorithm that is used to learn and predict patterns of behavior. SVM algorithms are used to create models that can be used to predict outcomes, or to classify data into discrete categories. It works by constructing a kernel matrix…
T
View Data Science & Machine Learning Terms beginning with the Letter “T”
T-distribution
Definition of T-distribution The t-distribution is a type of probability distribution that is used to calculate the likelihood of an event occurring. It is a bell curve-like distribution that is used to calculate the standard error of a statistic.
T-test
Definition of T-test T-test: A T-test is a type of statistic that is used to compare two groups of data. It can be used to determine whether the means of the two groups are statistically different from each other.
Tableau
Definition of Tableau Tableau is a data visualization software used to create charts and graphs from data. It can be used to find patterns and trends in the data, as well as to present the data in a visually appealing way.
Temporary table
Definition of Temporary table Temporary table: A temporary table is a table that exists for the duration of a session. Temporary tables are useful for storing intermediate results, or for querying data that is too large to fit in memory.
Time Series Data
Definition of Time Series Data A time series is a sequential set of data points, typically measured at successive points in time. Time series analysis is the process of analyzing time series data in order to identify patterns and forecast future values. Time series data can be used to measure and predict changes in everything…
Time-Series
Definition of Time-Series Time-Series: A time-series is a sequence of data points, usually measured at successive points in time. Time-series analysis is the process of examining these data points in order to identify and understand trends and patterns. Time-series data can be used to predict future events, track performance over time, or measure the impact…
Trend
Definition of Trend Trend: A trend is a general direction in which something is moving or changing. In the context of data science, trends can be observed in datasets over time and can be used to make predictions about the future, or observe how process changes influence outcomes.
Two-sample t-test
Definition of Two-sample t-test Two-sample t-test: A two-sample t-test is a statistical hypothesis test used to determine whether the means of two samples are statistically different from each other. The test statistic is a t statistic, which is the ratio of the differences between the means of the two samples to the standard error of…
Type I Error
Definition of Type I Error Type I error: A type I error, also known as a false positive error, is the incorrect rejection of a true null hypothesis.
Type II Error
Definition of Type II Error Type II error: A type II error, also known as a false negative error, is the incorrect acceptance of a false null hypothesis.
U
View Data Science & Machine Learning Terms beginning with the Letter “U”
UIMA
Definition of UIMA UIMA (Unified Information Management Architecture) is a framework for the development of software systems that analyze natural language content. It provides a collection of components for performing tasks such as tokenization, sentence segmentation, part-of-speech tagging, and named entity extraction. UIMA also includes a runtime environment for deploying and executing these components.
Unique
Definition of Unique Unique: Unique refers to the attribute of a data set that represents a distinct value. In other words, a value is unique if it does not occur more than once in the data set.
Univariate
Definition of Univariate Univariate: A univariate analysis is a data analysis technique that considers only one variable at a time.
Universality
Definition of Universality Universality: Universality is the property of being applicable to a class of objects or phenomena. In data science, this means that a model or algorithm can be used to solve a problem for a wide range of data sets. This makes universality an important property for any data science toolkit.
Unknown
Definition of Unknown Unknown: Unknown is a term used in data science to describe an attribute or value that has yet to be determined. This may be due to a lack of data or because the data is too noisy to be accurately analyzed. In either case, the goal of data science is to identify…
Unsampled
Definition of Unsampled Unsampled: Unsampled data is data that has not been selected or drawn from a population. This term is used in statistics and data science, where it usually refers to the selection of a random sample from a population.
Unsupervised Learning
Definition of Unsupervised Learning Unsupervised Learning: Unsupervised learning is a type of machine learning algorithm that does not rely on feedback from humans to learn how to identify patterns in data. These algorithms are typically used to identify patterns in data that have not been labeled or categorized by humans.
Untested
Definition of Untested Untested: Untested refers to an algorithm or technique that has not been used on any data set.
Update
Definition of Update Update: Update is the process of modifying an existing item in a dataset. This may be done to correct errors, add new information, or improve the accuracy of the data.
UPF
Definition of UPF UPF: UPF stands for Uniform Probability Function. It is a mathematical function that assigns a probability to each outcome in a given set of outcomes.
Upstream
Definition of Upstream Upstream: Upstream and Downstream are terms used in data science to describe the flow of data or what order events occur. Data is processed and becomes more refined as it moves through the various processes. Upstream is closer to the source. When discussing a specific process step, Downstream is a term that…
V
View Data Science & Machine Learning Terms beginning with the Letter “V”
Variable
Definition of Variable Variable: A variable is a named entity that can take on different values. In data science, variables are often used to represent the attributes of objects in a dataset. For example, the age of a person might be represented by a variable called “age”.
Variance
Definition of Variance Variance: Variance is a measure of how spread out a set of data points are. It is calculated by taking the difference between each data point and the mean, then squaring the results. This gives you a value that can be compared across different sets of data.
Vector
Definition of Vector Vector: A vector is a mathematical object consisting of a set of points in a given space, each identified by a unique coordinate.
Vector Space Model
Definition of Vector Space Model Vector Space Model: A vector space model is a mathematical model used in statistics, data mining, and machine learning to describe a set of objects in terms of their attributes and relationships between them.
Vowel Removal
Definition of Vowel Removal Vowel Removal: Vowel removal is a data preprocessing technique used to remove vowels from text. This can be helpful for improving the accuracy of text-based machine learning models, as well as reducing the size of the training dataset.
W
View Data Science & Machine Learning Terms beginning with the Letter “W”
Walking
Definition of Walking Walking: This is a term used in data science to describe the process of moving data from one location to another. It is often used when transferring data from a local machine to a remote server.
Waterfall Chart
Definition of Waterfall Chart Waterfall chart: A waterfall chart is a type of data visualization that shows how a starting value (in the y-axis) changes over time (in the x-axis) through a series of intermediary values. It’s often used to illustrate how different factors contribute to a final outcome.
Weight
Definition of Weight Weight: In machine learning, weight is a factor that is assigned to a particular input in order to influence the strength of the associated output.
Weighted Average
Definition of Weighted Average Weighted Average: Weighted average is a calculation that assigns a weight to each value in a set and then calculates the sum of those weights divided by the sum of the weights.
Window Function
Definition of Window Function Window Function: A window function is a set of mathematical functions that are applied to a group of neighboring data points in a dataset. They are used to calculate important statistics such as the average, median, and standard deviation of a set of data points.
Windowed Aggregation
Definition of Windowed Aggregation Windowed Aggregation: Windowed aggregation is the process of computing a statistic over a fixed-size window of data. The window slides along the data set, taking a fixed number of values from the start of the set and computing the statistic for each one. This can be done with any type of…
Word2vec
Word2vec: Word2vec is a technique used to create a “word embedding”, which is a vector representation of a word. This can be used to better understand the relationships between words, as well as to train machine learning models.
Work
Definition of Work Work: Work is the application of effort to achieve a desired outcome. In the context of data science, work refers to the tasks that need to be completed in order to produce insights or results. This might include data cleaning, analysis, modeling, and interpretation.
Worker
Definition of Worker Worker: A worker is a node in a deep learning network that is responsible for processing a particular set of inputs and generating an output.
X
View Data Science & Machine Learning Terms beginning with the Letter “X”
X Axis
Definition of X Axis X Axis – The X Axis is the horizontal line on a graph that indicates the independent variable.
XLM
Definition of XLM XLM: XLM is an abbreviation for “Extensible Markup Language.” It is a markup language that is used to define the structure of data. XLM is used to define the structure of data so that it can be easily accessed and processed by computers.
xlrd
Definition of xlrd xlrd: A library for reading data from Excel files
XOR
Definition of XOR XOR: XOR is a logical operator that produces a true result if and only if exactly one of its operands is true.
XY Scatterplot
Definition of XY Scatterplot XY scatterplot: A XY scatterplot is a graphical representation of data in which the data points are plotted on a two-dimensional coordinate system. The data points are usually symbolized by circles, squares, or crosses, and the coordinates of each point are shown as x- and y-values.
Y
View Data Science & Machine Learning Terms beginning with the Letter “Y”
Y-axis
Definition of Y-axis Y-axis: The Y-axis is the vertical axis in a graph. It represents the value of the data points plotted on the graph.
Y-combinator
Definition of Y-combinator Y-combinator: The Y-combinator is a function that takes a function and returns another function. The returned function takes two arguments: a number n and a list xs of length n. It returns the first element of xs if n is 0 or 1, and the product of the first two elements of…
Y-intercept
Definition of Y-intercept Y-intercept: The y-intercept is the point at which a line or curve crosses the y-axis. It is the point at which the line has a slope of zero.
Z
View Data Science & Machine Learning Terms beginning with the Letter “Z”
Z-score
Definition of Z-score Z-score: A z-score is a statistic that measures how far a particular observation is from the mean, in standard deviations.
Z-table
Definition of Z-table Z-table: A z-table is a table of values that is used to find the area under a normal curve.
Zero-Sum Game
Definition of Zero-Sum Game Zero-Sum Game: In game theory, a zero-sum game is a mathematical model of a situation in which each participant’s gain or loss of utility is exactly balanced by the losses or gains of the other participants. If the total gains from the interactions of all participants are net zero change, then…