High-throughput methodologies and machine learning have been central in developing systems-level importance in molecular biology. Unfortunately, performing such integrative analyses has traditionally been reserved for bioinformaticians. This is now changing with the appearance of resources to help bench-side biologists become skilled at computational data analysis and handling large omics data sets.
With the increasing use of advanced technology and the exploding amount of data in bioinformatics, it is vital to introduce effective and efficient methods to handle Big data using the distributed and parallel computing technologies. Big data analytics essentially can examine large data sets, analyze and correlate genomic and proteomic information.
Supervised, unsupervised, and hybrid machine learning approaches are the most widely used tools for descriptive and predictive analytics on big data. Apart from that, many techniques from mathematics have been used in big data analytics. The problem of big data volume can be somewhat minimized by dimensionality reduction. Linear mapping methods, such as principal component analysis (PCA) and singular value decomposition, as well as non-linear mapping methods, such as Sammon’s mapping, kernel principal component analysis, and laplacian eigenmaps, have been widely used for dimensionality reduction. Another important tool used in big data analytics is mathematical optimization. Subfields of optimization, such as constraint satisfaction programming, dynamic programming, and heuristics & metaheuristics are widely used in AI and machine learning problems. Other important optimization methods include multi-objective and multi-modal optimization methods, such as optimization and evolutionary algorithms, respectively. Statistics is considered as a counterpart to machine learning; differentiated by data model versus algorithmic model respectively. The two fields have subsumed ideas from each other. Statistical concepts, such as expectation-maximization and PCA, are widely adopted in machine learning problems. Similarly, machine learning techniques, such as probably approximately correct learning are used in applied statistics. However, both of these tools have been heavily used for big data analytics. Big data analytics has a close proximity to data mining approaches. Mining big data is more challenging than traditional data mining due to massive data volume. The common practice is to extend the existing data mining algorithms to cope with massive datasets, by executing on samples of big data and then merging the sample results. This kind of clustering algorithms include;
- CLARA (Clustering LARge Applications)
- BIRCH (Balanced Iterative Reducing using Cluster Hierarchies)
Researchers and biologists have also emphasized on the reduction of computational complexity of data mining algorithms.
Architecture for Big Data Analytics
- Fault tolerant graph architecture
- Streaming graph architecture
To apply a traditional or enhanced a new machine learning method to analyze big data essentials are;
Scalable to high volume: The method should be able to handle large chunks of data with low space complexity and fewer disks overhead.
Robust with high velocity: The method should have low time complexity and be able to digest and process stream data in real time without any degradation in performance.
Transparent to variety: Big data can be semi-structured or unstructured in nature.
Incremental: Typically, machine learning methods operate on entire datasets at once without accounting for the situation where the dataset dynamically grows over time. Distributed: A machine learning method should allow distributed processing on partial data and merging of the partial results. With big data sources distributed around the world, all data may not be available at a single location for big data analytics.