Bayesian Network

A Bayesian network (BN) is a probabilistic graphical model for representing knowledge about an uncertain domain where each node corresponds to a random variable and each edge represents the conditional probability for the corresponding random variables [9].

From: Introduction to Algorithms for Data Mining and Machine Learning , 2019

Mathematical foundations

Xin-She Yang , in Introduction to Algorithms for Data Mining and Machine Learning, 2019

2.5 Bayesian network and Markov models

A Bayesian network (BN) is a probabilistic graphical model for representing knowledge about an uncertain domain where each node corresponds to a random variable and each edge represents the conditional probability for the corresponding random variables [9]. BNs are also called belief networks or Bayes nets. Due to dependencies and conditional probabilities, a BN corresponds to a directed acyclic graph (DAG) where no loop or self connection is allowed. For example, Fig. 2.3 is a BN.

Figure 2.3

Figure 2.3. A simple Bayesian network example.

Let us use an example to show how it works. This example is very similar to the "earthquake" example by Pearl [112] and the "chair" example by Ben-Gal [9]. In my office, there is an electric fan that I use often in summer and not in other seasons. Imagine a scenario that I try to switch on the fan, but it does not spin. The fan is plugged into an extension socket or plug, and there is a possibility of a plug failure. How do we figure out what the possible causes are?

The fan has a probability of 0.02 for failure, whereas the plug is very old and has the failure probability 0.2. I also have a mobile phone charger connected to the same plug. I found that the charger works well. What is the probability of the problem caused by a faulty fan?

We can represent this scenario as a simple Bayesian network, shown in Fig. 2.3. In this case, the parents of the random variable Fan are the nodes Faulty Fan and Faulty Plug, whereas the child of Fan is No Spin. The two variables Faulty Fan and Faulty Plug are marginally independent; however, they become conditionally dependent, given Fan.

The number required to completely specify the probability distributions for a network can be huge. For a set of n binary random variables, it requires 2 n 1 joint probability distributions [29]. Even for a small n = 20 , this number becomes 2 20 1 = 1048575 , which is a huge number. Thus, the complete specification and the exact solution, if possible, can be NP-hard. Therefore, approximate solutions are often sought in practice, and Monte Carlo simulations can be very useful in this case.

Though BNs are directed acyclic graphs of graphical models, however, probabilistic graphical models can have undirected edges, which become the Markov networks or Markov random fields. It is worth pointing out that many conventional machine learning techniques, such as artificial neural networks, Kalman filters, and hidden Markov models can all be considered as particular cases of Bayesian networks as pointed out by Gen-Gal et al. [9].

Bayesian networks have a diverse range of applications [9,29,84,106], and Bayesian statistics is relevant to modern techniques in data mining and machine learning [106–108]. The interested readers can refer to more specialized literature on information theory and learning algorithms [98] and Bayesian approach for neural networks [91].

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128172162000090

Classification methods

Shen Liu , ... Yang Xie , in Computational and Statistical Methods for Analysing Big Data with Applications, 2016

2.2.3 Bayesian networks

A Bayesian network is a probabilistic graphical model that measures the conditional dependence structure of a set of random variables based on the Bayes theorem:

P ( A | B ) = P ( B | A ) P ( A ) P ( B ) .

Pearl (1988) stated that Bayesian networks are graphical models that contain information about causal probability relationships between variables and are often used to aid in decision making. The causal probability relationships in a Bayesian network can be suggested by experts or updated using the Bayes theorem and new data being collected. The inter-variable dependence structure is represented by nodes (which depict the variables) and directed arcs (which depict the conditional relationships) in the form of a directed acyclic graph (DAG). As an example, the following DAG indicates that the incidence of two waterborne diseases (diarrhoea and typhoid) depends on three indicators of water samples: total nitrogen, fat/oil and bacteria count, each of which is influenced by another layer of nodes: elevation, flooding, population density and land use. Furthermore, flooding may be influenced by two factors: percolation and rainfall.

There are two components involved in learning a Bayesian network: (i) structure learning, which involves discovering the DAG that best describes the causal relationships in the data, and (ii) parameter learning, which involves learning about the conditional probability distributions. The two most popular methods for determining the structure of the DAG are the DAG search algorithm (Chickering, 2002) and the K2 algorithm (Cooper & Herskovits, 1992). Both of these algorithms assign equal prior probabilities to all DAG structures and search for the structure that maximizes the probability of the data given the DAG, that is, P ( data | DAG ) is maximized. This probability is known as the Bayesian score. Once the DAG structure is determined, the maximum likelihood estimator is employed as the parameter learning method. Note that it is often critical to incorporate prior knowledge about causal structures in the parameter learning process. For instance, consider the causal relationship between two binary variables: rainfall (large or small) and flooding (whether there is a flood or not). Apparently, the former has impact on the latter. Denote the four corresponding events:

R : Large rainfall,

R ¯ : Small rainfall,

F : There is a flood,

F ¯ : There is no flood.

In the absence of prior knowledge, the four joint probabilities P ( F , R ) , P ( F ¯ , R ) , P ( F , R ¯ ) and P ( F ¯ , R ¯ ) need to be inferred using the observed data; otherwise, these probabilities can be pre-determined before fitting the Bayesian network to data. Assume that the following statement is made by an expert: if the rainfall is large, the chance of flooding is 60%; if the rainfall is small, the chance of no flooding is four times as big as that of flooding. Then we have the following causal relationship as prior information:

P ( F | R ) = 0.6 , P ( F ¯ | R ) = 0.4 , P ( F | R ¯ ) = 0.2 , P ( F ¯ | R ¯ ) = 0.8 .

Furthermore, assume that meteorological data show that the chance of large rainfalls is 30%, namely, P ( R ) = 0.3 . Then the following contingency table is determined:

R R ¯
F P ( F , R ) = P ( F | R ) P ( R ) = 0.18 P ( F , R ¯ ) = P ( F | R ¯ ) P ( R ¯ ) = 0.14
F ¯ P ( F ¯ , R ) = P ( F ¯ | R ) P ( R ) = 0.12 P ( F ¯ , R ¯ ) = P ( F ¯ | R ¯ ) P ( R ¯ ) = 0.56

which is an example of pre-specified probabilities based on prior knowledge.

For the purpose of classification, the naïve Bayes classifier has been extensively applied in various fields, such as the classification of text documents (spam or legitimate email, sports or politics news, etc.) and automatic medical diagnosis (to be introduced in Chapter 8 of this book). Denote y the response variable which has k possible outcomes, that is, y { y 1 , , y k } , and let x 1 , , x p be the p features that characterize y . Using the Bayes theorem, the conditional probability of each outcome, given x 1 , , x p , is of the following form:

P ( y i | x 1 , , x p ) = P ( x 1 , , x p | y i ) P ( y i ) P ( x 1 , , x p ) .

Note that the naïve Bayes classifier assumes that x 1 , , x p are mutually independent. As a result, P ( y i | x 1 , , x p ) can be re-expressed as follows:

P ( y i | x i , , x p ) = P ( y i ) j = 1 p P ( x j | y i ) P ( x 1 , , x p ) ,

which is proportional to P ( y i ) j = 1 p P ( x j | y i ) . The maximum a posteriori decision rule is applied and the most probable label of y is determined as follows:

y ˆ = argmax i { 1 , , k } P ( y i ) j = 1 p P ( x j | y i ) .

An illustration of the naïve Bayes classifier is provided here. Let's revisit the 'German credit' dataset. This time, we consider the following features: duration in months, credit amount, gender, number of people being liable, and whether a telephone is registered under the customer's name. The setting of training and test sets remains the same. The following MATLAB code is implemented:

y = creditrisk; % Response variable

x = [duration, credit_amount, male, nppl, tele]; % Features

train = 800; % Size of training sample

xtrain = x(1:train,:); % Training sample

ytrain = y(1:train,:); % Labels of training sample

xtest = x(train+1:end,:); % Test set

ytest = y(train+1:end,:); % Labels of test set

nbayes = fitNaiveBayes(xtrain,ytrain); % Train the Naïve Bayes

ypredict = nbayes.predict(xtest); % Prediction of the test set

rate = sum(ypredict == ytest)/numel(ytest); % Compute the rate of correct classification

Again, 68.5% of the customers in the test set were correctly labelled. The results are summarized in the following table:

Predicted
Good Bad
Actual Good 121 18
Bad 45 16

For modelling time series data, dynamic Bayesian networks can be employed to evaluate the relationship among variables at adjacent time steps (Ghahramani, 2001). A dynamic Bayesian network assumes that an event has impact on another in the future but not vice versa, implying that directed arcs should flow forward in time. A simplified form of dynamic Bayesian networks is known as hidden Markov models. Denote the observation at time t by Y t , where t is the integer-valued time index. As stated by Ghahramani (2001), the name 'hidden Markov' is originated from two assumptions: (i) a hidden Markov model assumes that Y t was generated by some process whose state S t is hidden from the observer, and (ii) the states of this hidden process satisfy the first-order Markov property, where the rth order Markov property refers to the situation that given S t 1 , , S t r , S t is independent of S τ for τ < t r . The first-order Markov property also applies to Y t with respect to the states, that is, given S t , Y t is independent of the states and observations at all other time indices. The following figure visualizes the hidden Markov model:

Mathematically, the causality of a sequence of states and observations can be expressed as follows:

P ( Y 1 , S 1 , , Y T , S T ) = P ( S 1 ) P ( Y 1 | S 1 ) t = 2 T P ( S t | S t 1 ) P ( Y t | S t ) .

Hidden Markov models have shown potential in a wide range of data mining applications, including digital forensics, speech recognition, robotics and bioinformatics. The interested reader is referred to Bishop (2006) and Ghahramani (2001) for a comprehensive discussion of hidden Markov models.

In summary, Bayesian networks appear to be a powerful method for combining information from different sources with varying degrees of reliability. More details of Bayesian networks and their applications can be found in Pearl (1988) and Neapolitan (1990).

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128037324000027

Graphical Models

R.G. Almond , in International Encyclopedia of Education (Third Edition), 2010

Related Graphical Representations

Although this article describes graphical models used in the narrow sense (models on an undirected graph), the term is often used to describe any model that is factored according to a graph. Two other classes of models, Bayesian networks and chain graph models, use directed or semidirected graphs to expand the expressive power of the graphical models. Path diagrams, while related in intent, have slightly different rules for interpretation.

A Bayesian network is a probability model defined over an acyclic directed graph. It is factored by using one conditional probability distribution for each variable in the model, whose distribution is given conditional on its parents in the graph. Variables which are separated in the graph are still independent, but the simple graph separation used for the undirected graph is replaced with the more complicated d-separation, which takes into account the effect of competing explanations for observed values.

A Bayesian network can be converted into an undirected graphical model by connecting all of the nodes that are involved in each factor. This requires that the parents of each node be joined or married. The process of joining parents is known as moralization and the undirected graph corresponding to a given Bayesian network is called a moral graph. Computing the moral graph is the first step of many computational algorithms for Bayesian networks.

Chain graphs use a mixture of directed and undirected graphical edges to describe independence relationships among the variables. A chain graph is divided into a series of blocks; within a block, all edges are undirected, and between blocks, the edges are directed. The blocks often correspond to stages in an experiment; for example, one block could be measurements taken before the start of an educational intervention, and another measurements taken after. A third block could represent background variables about the students and schools. The directed edges flow in the temporal direction: from preintervention to postintervention measures and from background variables to either.

Path diagrams bear an obvious similarity to graphical models (especially the directed Bayesian networks) and, indeed, the work on path diagrams was an inspiration for the later work on graphical models. However, the there are some subtle and important differences. Some are obvious: path diagrams include explicit nodes for error terms, while these are usually implicit in graphical models and Bayesian networks; and path diagrams allow some kinds of expressions (double-headed arrows and reciprocal relationships) that are not allowed in graphical models. Other differences are more subtle. In particular, the discipline of structural equation modeling concentrates on modeling the covariance matrix, while graphical models concentrate on modeling the inverse covariance matrix. The implication is that the conditional independence assumptions implicit in the structural equation model are not always clearly expressed; hence, the Markov property may not hold for the path diagram. Most path diagrams have an equivalent representation as a graphical model, but there are some exceptions (which typically are models that are difficult to estimate).

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780080448947013348

Basic Algorithms for Data Mining

Robert Nisbet , ... Gary Miner , in Handbook of Statistical Analysis and Data Mining Applications, 2009

Additional Types of Neural Networks

Following are some additional types of neural nets:

Linear Networks: These networks have two layers: input and output layers. They do not handle complexities well but can be considered as a "baseline model."

Bayesian Networks: Networks that employ Bayesian probability theory which can be used to control model complexity, and can be used to optimize weight decay rates, and to automatically find the most important input variables.

Probabilistic Networks: These networks consist of three to four layers.

Generalized Regression: These networks train quickly but execute slowly. Probabilistic (PNN) and Generalized Regression (GRNN) neural networks operate in a manner similar to that of Nearest-Neighbor algorithms (see Chapter 12), except the PNN operates only with categorical target variables and the GRNN operates only with numerical target variables. PNN and GRNN networks have advantages and disadvantages compared to MLP networks (adapted from http://www.dtreg.com/pnn.htm):

It is usually much faster to train a PNN/GRNN network than an MLP network.

PNN/GRNN networks often are more accurate than MLP networks.

PNN/GRNN networks are relatively insensitive to outliers (wild points).

PNN networks generate accurate predicted target probability scores.

PNN networks approach Bayes optimal classification.

PNN/GRNN networks are slower than MLP networks at classifying new cases.

PNN/GRNN networks require more memory space to store the model.

Kohonen: This type of neural network is used for classification. It is sometimes called a "self-organizing" neural net. It iteratively classifies inputs, until the combined difference between classes is maximized. This algorithm can be used as a simple way to cluster data, if the number of cases or categories is not particularly large. For data sets with a large number of categories, training the network can take a very long time.

MLPs can be used to solve most logical problems, but only those in which the classes are linearly separable. Figure 7.14 shows a classification problem in which it is possible to separate the classes with a straight line in the space defined by their dimensions.

Figure 7.14. Two pattern classes that are linearly separable.

Figure 7.15 shows two classes that cannot be separated with a straight line (i.e., are not linearly separable).

Figure 7.15. Nonseparable classes.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123747655000073

Volume 3

S.D. Brown , A.J. Myles , in Comprehensive Chemometrics, 2009

3.17.6.2 Hybrid Classifiers

Decision tree modeling can also be effectively combined with other predictive classification modeling techniques. Generally, this involves the use of decision trees on the front-end of a hybrid or fused classification model, where the purpose of the decision tree is to reduce the representation of the training data by feature selection or discretion to accommodate a second classification technique. Decision trees can be used to discover informative features, or 'emerging patterns', in microarray data. 70 Emerging patterns can be regarded as the combination of two consecutive decision tree partitions. Discriminant analysis can then be applied to the reduced feature space defined by the emerging patterns – a hybrid classification technique.

3.17.6.2.1 Emerging patterns in the microarray example

For the purposes of illustration, the 'emerging patterns' idea can be applied to the microarray example, though the tree size is not really well suited because it is so small. Using the Entropy scoring criterion here (in contrast with the Gini scoring criterion with proportional priors), the LTC4S and PAX6 (feature #10) genes were selected. Then, linear discriminant and quadratic 71 discriminant models were developed on the two selected features. Figure 23 illustrates the linear and quadratic discriminants (dashed lines) compared to the decision tree selected partitions (solid lines). In this application, a decision tree was used only to identify important features, so that the actual classification accuracy achieved from the decision tree is not especially important.

Figure 23. Comparing classifier discriminant boundaries in microarray data. Reproduced with permission from Myles, A. J.; Feudale, R. N.; Liu, Y.; Woody, N. A.; Brown, S. D. An Introduction to Decision Tree Modeling. J. Chemom. 2004 , 18, 275–285. Copyright © 2004 John Wiley &amp; Sons Limited.

Decision trees can also be coupled with Bayesian networks. 72,73 For this hybrid classification approach, a decision tree is constructed from X T. The decision-tree-selected features are discretized based on the set of decision rules developed on the training set. The resulting simplified feature space representation (where the most relevant features are identified and mapped into the discretized feature space) is then used to train a discrete-node-based Bayesian network to provide predictive classification. The hybrid 'decision tree-Bayesian network' modeling approach was applied to the thyroid gland data set explored above. A difference between the 'decision-tree-Bayesian network' approach and the 'emerging patterns' technique is that the decision tree partitions were used as thresholds to label discrete regions within the feature space, simplifying the continuous-valued features for input to the discrete Bayesian net.

Decision trees can also be used on the back-end of hybrid or fused classification models, although these types of decision tree implementations are not as common as the front-end hybrid designs discussed earlier. The purpose of the decision tree on the back-end of a hybrid classification technique is to model an alternative representation of X T produced by the front-end classifier. For example, decision trees can be used to model the predictions made by a set of classifiers, 74,75 representing a form of high-level data fusion. The predictions of N classifiers on n training samples are encoded in a new X T representation. Prior to this point, the features explored have been continuous-valued (i.e., expression level, total thyroid activity), or at least discrete-valued (i.e., 'decision-tree-Bayesian network' modeling approach). The ability to model continuous-valued and categorically valued features is an attribute not shared by many other pattern recognition techniques. This capability means that decision trees can be used to model sets of multiclass predictions from a variety of diverse classifiers (a categorical representation of X T), effectively making the outputs of classifier models the input data to the decision tree.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444527011000259

Data mining techniques

Xin-She Yang , in Introduction to Algorithms for Data Mining and Machine Learning, 2019

6.6.2 Bayesian networks

A key assumption for naive Bayesian classifiers is that all variable values are conditionally independent, given the target classification. This assumption significantly reduces the complexity of the calculations of the objective functions in terms of posterior probabilities.

However, this assumption may not be true for certain applications such as text documents and speech signals. In this case, Bayesian belief networks can be a good alternative.

Bayesian belief networks (BBN) use a set of conditionally independent probabilities, but not imposing all variable values. We have seen some basic idea of Bayesian networks in Chapter 2, and here we focus on the classification problems. In a BBN, nodes represent variables that can be continuous or discrete, and arcs represent causality relationships in terms of conditional probabilities.

For a given structure of a BBN, not every variable is observable. Unobservable variables are called hidden or latent variables. The BBN model has both observable random variables X and hidden random variables Z, which means that the likelihood function p ( S | H ) becomes a function of X and Z, that is, L ( X , Z ) . The maximization of the likelihood L ( X , Z ) is equivalent to the maximization of expectation of the logarithmic likelihood function. Thus the method becomes the so-called expectation-maximization (EM) method, which consists of an E step and an M step [36]. In the E step, we define the expectation

(6.20) Q L = E [ log L ( X , Z ) ] ,

which starts with arbitrary values initially and repeatedly estimates their expectation. The aim in the M step is to use an optimizer to solve an optimization problem

(6.21) maximize Q L = E [ log L ( X , Z ) ] .

The optimizer can be a gradient-based search algorithm.

The exact form of the likelihood function can be difficult to write, depending on the structure of the BBN. In some cases, the network structure may not be known in advance, which makes it extremely difficult. In this case, some heuristic and metaheuristic approaches may be needed to learn the network structures. Interested readers can consult a more advanced literature.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128172162000132

Big data in healthcare applications

Shen Liu , ... Yang Xie , in Computational and Statistical Methods for Analysing Big Data with Applications, 2016

7.1.3 Data mining procedures in healthcare

A standard process for large-scale data analysis was proposed by Peng, Leek, and Caffo (2015), which is displayed by the following figure (Figure 7.1):

Figure 7.1. Standard process for large-scale data analysis, proposed by Peng, Leek, and Caffo (2015).

In the context of clinical medicine, Bellazzi and Zupan (2008) provided a guideline for predictive data mining. They claimed that predictive data mining methods originate from different research fields and often use very diverse modelling approaches. They stated that the following techniques are some of the most commonly used predictive data mining methods:

Decision trees

Logistic regression

Artificial neural networks

Support vector machines

Naive Bayes

Bayesian networks

The k-nearest neighbours

The reader is referred to Chapter 2 of this book for details of these methods.

Note that according to Bellazzi and Zupan (2008), the methods listed above are often an integral part of modern data mining procedures. For the purpose of model selection, they suggested considering both the predictive performance and interpretability of results. If two methods perform similarly, one should choose the one whose results are easier to interpret.

To implement predictive data mining in clinical medicine, the following tasks should be carried out (Bellazzi & Zupan, 2008):

Defining the problem, setting the goals. At the very beginning, the goal has to be set and the specific problem has to be defined with caution, pointing out the direction of subsequent work. Generally speaking, the aim of predictive data mining in clinical medicine is to make data-informed decisions that help physicians improve their prognosis, diagnosis or treatment planning procedures. To achieve this, preliminary tasks need to be performed. For example, one needs to pay attention to the balance between the predictive power and comprehensibility of a model in the first place, as in particular cases the transparency of data analysis is central to a physician. Decisions also need to be made upon what statistics to use for the assessment of model performance. All these considerations have impact on the subsequent tasks of data mining.

Data preparation. Clinical data usually come from dedicated databases which were purposely collected to study particular clinical problem (e.g. electrocardiogram database for heart disease research). Recently, large-scale data warehouses like hospital records and health insurance claims have been exploited to solve clinical problems. However, due to confidentiality restrictions, many of these databases are extremely difficult to access.

Modelling and evaluation. The objective is to apply a set of candidate methods to the observed clinical data, and to determine which method is most suitable. As mentioned earlier, each data mining method is evaluated by its predictive performance and comprehensibility. While the former is relatively easy to quantify, the latter is a subjective measure that is evaluated by participating domain experts. Note that algorithms working in a black-box manner (e.g. neural network) tend to be not preferred by physicians, even if they may achieve better predictive performance.

Deployment and dissemination. Most clinical data mining projects would terminate once the predictive model has been constructed and evaluated, whereas it is very rare to see reports on the deployment and dissemination of predictive models. One possible reason is that the complexity of data mining tools and the lack of user interface may impede the dissemination in clinical environments. Bellazzi and Zupan (2008) mentioned that the difficulty of deploying and disseminating predictive data mining methods is due to the lack of bridging between data analysis and decision support. Attempts have been made to solve this problem, for example, the Predictive Model Markup Language (PMML) which encodes the prediction models in Extensible Markup Language (XML)-based documents.

Now we have learnt the potential of big data in healthcare. In the next section, we undertake a case study to demonstrate how big data can contribute to healthcare in a real-life scenario.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128037324000076

Introduction

Shen Liu , ... Yang Xie , in Computational and Statistical Methods for Analysing Big Data with Applications, 2016

1.2 What is this book about?

Big data involves a collection of techniques that can help in extracting useful information from data. Aiming at this objective, we develop and implement advanced statistical and computational methodologies for use in various high impact areas where big data are being collected.

In Chapter 2, classification methods will be discussed, which have been extensively implemented for analysing big data in various fields such as customer segmentation, fraud detection, computer vision, speech recognition and medical diagnosis. In brief, classification can be viewed as a labelling process for new observations, aiming at determining to which of a set of categories an unlabelled object would belong. Fundamentals of classification will be introduced first, followed by a discussion on several classification methods that have been popular in big data applications, including the k -nearest neighbour algorithm, regression models, Bayesian networks, artificial neural networks and decision trees. Examples will be provided to demonstrate the implementation of these methods.

Whereas classification methods are suitable for assigning an unlabelled observation to one of several existing groups, in practice groups of data may not have been identified. In Chapter 3, three methods for finding groups in data will be introduced: principal component analysis, factor analysis and cluster analysis. Principal component analysis is concerned with explaining the variance-covariance structure of a set of variables through linear combinations of these variables, whose general objectives are data reduction and interpretation. Factor analysis can be considered an extension of principal component analysis, aiming to describe the covariance structure of all observed variables in terms of a few underlying factors. The primary objective of cluster analysis is to categorize objects into homogenous groups, where objects in one group are relatively similar to each other but different from those in other groups. Both hierarchical and non-hierarchical clustering procedures will be discussed and demonstrated by applications. In addition, we will study fuzzy clustering methods which do not assign observations exclusively to only one group. Instead, an individual is allowed to belong to more than one group, with an estimated degree of membership associated with each group.

Chapter 4 will focus on computer vision techniques, which have countless applications in many regions such as medical diagnosis, face recognition or verification system, video camera surveillance, transportation, etc. Over the past decade, computer vision has been proven successful in solving real-life problems. For instance, the registration plate of a vehicle using a tollway is identified from the picture taken by the monitoring camera, and then the corresponding driver will be notified and billed automatically. In this chapter, we will discuss how big data facilitate the development of computer vision technologies, and how these technologies can be applied in big data applications. In particular, we will discuss deep learning algorithms and demonstrate how this state-of-the-art methodology can be applied to solve large scale image recognition problems. A tutorial will be given at the end of this chapter for the purpose of illustration.

Chapter 5 will concentrate on spatial datasets. Spatial datasets are very common in statistical analysis, since in our lives there is a broad range of phenomena that can be described by spatially distributed random variables (e.g. greenhouse gas emission, sea level, etc.). In this chapter, we will propose a computational method for analysing large spatial datasets. An introduction to spatial statistics will be provided at first, followed by a detailed discussion of the proposed computational method. The code of MATLAB programs that are used to implement this method will be listed and discussed next, and a case study of an open-pit mining project will be carried out.

In Chapter 6, experimental design techniques will be considered in analysing big data as a way of extracting relevant information in order to answer specific questions. Such an approach can significantly reduce the size of the dataset to be analysed, and potentially overcome concerns about poor quality due to, for example, sample bias. We will focus on a sequential design approach for extracting informative data. When fitting relatively complex models (e.g. those that are non-linear) the performance of a design in answering specific questions will generally depend upon the assumed model and the corresponding values of the parameters. As such, it is useful to consider prior information for such sequential design problems. We argue that this can be obtained in big data settings through the use of an initial learning phase where data are extracted from the big dataset such that appropriate models for analysis can be explored and prior distributions of parameters can be formed. Given such prior information, sequential design is undertaken as a way of identifying informative data to extract from the big dataset. This approach is demonstrated in an example where there is interest to determine how particular covariates effect the chance of an individual defaulting of their mortgage, and we also explore the appropriateness of a model developed in the literature for the chance of a late arrival in domestic air travel. We will also show that this approach can provide a methodology for identifying gaps in big data which may reveal limitations in the types of inferences that may be drawn.

Chapter 7 will concentrate on big data analysis in the health care industry. Healthcare administrators worldwide are striving to lower the cost of care whilst improving the quality of care given. Hospitalization is the largest component of health expenditure. Therefore, earlier identification of those at higher risk of being hospitalized would help healthcare administrators and health insurers to develop better plans and strategies. In Chapter 7, we will develop a methodology for analysing large-scale health insurance claim data, aiming to predict the number of hospitalization days. Decision trees will be applied to hospital admissions and procedure claims data, which were observed from 242,075 individuals. The proposed methodology performs well in the general population as well as in subpopulations (e.g. elderly people), as the analysis results indicate that it is reasonably accurate in predicting days in hospital.

In Chapter 8 we will study how mobile devices facilitate big data analysis. Two types of mobile devices will be reviewed: wearable sensors and mobile phones. The former is designed with a primary focus on monitoring health conditions of individuals, while the latter are becoming the central computer and communication device in our lives. Data collected by these devices often exhibit high-volume, high-variety and high-velocity characteristics, and hence suitable methods need to be developed to extract useful information from the observed data. In this chapter, we will concentrate on the applications of wearable devices in health monitoring, and a case study in transportation where data collected from mobile devices facilitate the management of road networks.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128037324000015

A guide to gene regulatory network inference for obtaining predictive solutions: Underlying assumptions and fundamental biological and data constraints

Sara Barbosa , ... Ralf Takors , in Biosystems, 2018

3.1 Bayesian networks

Bayesian networks provide flexible frameworks to combine different data types and prior knowledge, to avoid overfitting and to handle incomplete and noisy data ( Hecker et al., 2009). Due to the high number of parameters, this type of approach is usually not appropriate for large-scale GRN inference (Sławek and Arodź, 2013; Wu et al., 2016). They aim at finding the directed acyclic graph that describes the causal interactions among the different gene network components and also the local joint probability distributions that convey these interactions (Albert, 2007). The general rule for Bayesian networks is that the probability of a target gene can be described as the conditional probability of the set of parents of that gene (Albert, 2007; Auliac et al., 2008). Markov chain Monte Carlo (MCMC) simulation has been one of the main heuristic search procedures in the context of Bayesian networks. BUGS (https://www.mrc-bsu.cam.ac.uk/software/bugs/), JAGS (http://mcmc-jags.sourceforge.net/), PyMC3 (https://pymc-devs.github.io/pymc3/index.html) and Stan (http://mc-stan.org/) are a few examples of software and packages that rely on MCMC sampling for Bayesian analysis.

Several other tools intended to deal with Bayesian networks have been proposed in the literature. For example, a C++ toolkit for Bayesian analysis, BCM (Thijssen et al., 2016), provides the implementation of eleven algorithms for sampling from posterior probability distributions and evaluation of marginal likelihoods. The R package, bnlearn (Scutari, 2010), allows to infer Bayesian networks with either discrete or continuous variables, implements constrain-/score-based algorithms and is parallelized. Another example is the B-Course (Mylly Aki et al., 2002) a web-based application that uses a combination of stochastic and greedy search heuristics, that automatically discretizes variables and handles missing data.

In order to compare different methods, Werhli et al. (2006) evaluated three methods: relevance networks (RNs), graphical Gaussian models (GGMs) and Bayesian networks. Regarding the Bayesian network, MCMC methods were adopted. The authors showed that Bayesian networks outperformed GGMs and RNs, when considering interventional data (perturbation on nodes). However, no significant difference was found with observational data (nodes unperturbed).

Standard Bayesian networks do not allow to describe closed loops and feedback mechanisms, since they rely on acyclic graphs (Hecker et al., 2009). Dynamic Bayesian networks (DBNs) overcome such limitation. When compared to static Bayesian networks, DBNs have the main limitation of increased computational complexity (Lee and Tzou, 2009). Several approaches based on DBNs have also been proposed for GRN inference. For example, Vinh et al. (2011) presented the GlobalMIT, a Matlab toolbox for learning optimal DBNs. An information theoretic-based scoring metric, the mutual information test (MIT) is used as a scoring metric for learning Bayesian networks. Another example is the ARTIVA (Auto Regressive Time Varying regulatory models) algorithm (Lèbre et al., 2010), that simultaneously infers the GRN topology and temporal evolution. The method uses a combination of DBNs and reversible jump MCMC. The proposed inference approach simultaneously considers all the possible combinations of change points and topologies within different phases. A novel influence score for DBNs was developed by Yu et al. (2004). The score tries to estimate the relative magnitude and the sign (activation or repression) of the interactions and can also prune false positive links.

Due to their high number of parameters, Bayesian networks are normally restricted to small-scale GRN inference. Liu et al. (2016) proposed one possible way to tackle this problem. Their approach uses Conditional Mutual Information (CMI) to construct an initial network and then series of local Bayesian networks (LBNs) are generated, by selecting the k-nearest neighbors of each gene as the candidate regulators. It allows reducing significantly the search space from all possible GRNs. The use of CMI to construct the initial network and the creation of series of LBNs allows dealing with large-scale networks.

Read full article

URL:

https://www.sciencedirect.com/science/article/pii/S0303264718302843

Computational intelligence techniques for medical diagnosis and prognosis: Problems and current developments

Afzal Hussain Shahid , M.P. Singh , in Biocybernetics and Biomedical Engineering, 2019

4.1.6 Bayesian networks

Bayesian networks (BNs) [69], also known as probabilistic causal networks or belief networks, have come into existence by merging the concept of probability theory and graph theory to express the relationships between variables [70]. The origin of BNs lies within DM and ML techniques [71,72] which can capture the induced probabilistic influences in big data sets. It is considered to be a robust knowledge representation technique as well as an effective technique for reasoning in the presence of uncertainty [73]. BNs can be represented by a directed acyclic graph where the nodes represent the variables and the directed edge represents the causality [74]. BNs can effectively discover the relationship between variables that distinguishes direct and indirect dependencies [75,76]. BN modeling is extensively used in many fields. For instance, clinical decision support system [77,78], system biology [79,80], analysis of complex disease [81], multiple disease interaction [82] and influenza and human immunodeficiency virus (HIV) research [83,84]. Recently, Dutta et al. [85] used the BN model to develop a diagnostic device that estimates the grasp and grip efficiency of the patients who survived stroke.

Liu et al. [86] identified the impact of each feature to diagnose BC with the help of BN modeling. To construct the BNs, statistical computational methods and K2 learning algorithm [72] were used. For ultrasound (US) dataset, the cell shape was identified as the most important feature for the BC diagnosis. However, for fine needle aspiration cytology (FNAC) dataset, bare nuclei were the most significant feature that distinguishes malign and benign breast tumors. The authors inferred that there is a strong interdependency between cell size and cell shape. The discovery of probabilistic dependencies between different features assists the clinicians to make an accurate diagnosis by using only fewer features.

Fuster-Parra et al. [87] applied BN modeling to understand and discover the relationships between several cardiovascular risk factors (CVRF). They found the relationship between thirteen epidemiological features related to the heart age. This knowledge was used for analyzing cardiovascular risk score (CVRS), cardiovascular lost years (CVLY), and metabolic syndrome (MetS). The resultant BNs were used for making the inference by using different reasoning methods (e.g. evidential reasoning, causal reasoning, and intercausal reasoning). Many direct and indirect relations were discovered between different CVRF. The graphical representation of the relations between different CVRF produced by BNs was found intuitive and transparent.

Read full article

URL:

https://www.sciencedirect.com/science/article/pii/S0208521619300452