**I plan to implement the Random Forest Classification Algorithm on an FPGA. I will then compare
energy consumption and performance of the algorithm vs current CPU implementations.**

### Background

A Random Forest is an ensemble classifier using many decision tree models. The training of this algorithm involves taking a subset of the sample data with replacement in order to grow decision trees that have levels directly related to the input variables. Random Forest Classifiers are highly accurate, scale well to large datasets, and give back information about variable interaction or importance as well.

The purpose of my project is to speedup the classification part of the random forest algorithm (as opposed to the growth of decision trees).

// Brief overview of classification algorithm for (sample in sample_data) { sample_result = 0; for (tree in forest) { local_result = 0; current_node = tree.base_node; //possible 0 for (level < max_tree_depth) { node_index = NUM_NODES_PER_TREE * treeIdx + current_node node_threshold = get_threshold (node_index); sample_var_idx = get_corresponding_var (node_index); sample_var_data = sample.getVar(sample_var_idx); //If at a branching node in the tree if (level < max_tree_depth) { branchLeft = (sample_var_data < node_threshold); current_node = next_node (branchLeft); //If at a leaf node in the tree } else { local_result = get_result (node_index); } } sample_result += local_result; } sample_responses[sample] = process_result (sample_result); }

The classification algorithm above is parallelizable at both the tree level and the sample level. I plan to to make use of pipelining on the FPGA in order to maximize the parallelism I can get from both ends.

### Challenge

At first glance, the random forest classification algorithm may not look like it has too much to gain from implementation on an FPGA or hardware device. For any given given decision tree in the forest, travesing each level of the tree requires only one comparison compared to one lookup of the next appropriate state of the tree. Therefore, the computation to communication ration is inherently low in this algorithm. Each decision also requires that the decision in the previous level has already been made. Another problem is that the many decision trees spit out by the tree-growing algorithms can greatly vary in size and shape so any fixed width solution (such as SIMD parallelsim) can be made inefficient by the data- dependent nature of the method. In addition, machine learning algorithms have to deal with increasingly larger data sets that might not be able to be completely stored on one FPGA memory. While, FPGAs have very fast internal memory bandwidth, they have limited capacity to pass data around between FPGAs. However, the algorithm is massively data parallel. The processing of a sample through each decision tree is completely independent

# Resources

For this task I will be implementing the algorithm in System Verilog on an Altera Cyclone IV FPGA chip. For comparison, I will use a standard C++ implementation of the algorithm using libraries such as OpenMP or AVX vector intrinsics in order to maximize performance of the algorithm on a CPU.

### Goals

From information about current performance gains from dedicated ASIC units on image processing on bolie devices, I expect to be able to deliver performance gains of about one to two orders of magnitude in respect to both time and energy. I expect to provide detailed charts that plot energy consumption and time consumption against dataset size.

Overall, I hope to gain information about the possible performance benefits that can be gained from implementing computer vision algorithms on ASIC chips. I expect to gain information about the possible pitfalls to implementations on the FPGA units and how performance of the algorithm scales as the dataset increases beyond the scope of one FPGA

A possible extension to this project is to explore performance of this algorithm on a GPU.

### Platform Choice

Heterogenous system design already is widely used in order to anticipate workloads that are more relevant to real world application of devices. For example, ASIC units in mobile devices today provide much of the image and video processing capabilities. The frequent use of these features as well as the energy and performance boosts gained by using a fixed function unit outweigh the lack of functionality the ASIC unit is capable of performing.

In order to fuel the trend towards computation intellegence in everyday devices, these devices must be able to perform basic machine learning and computer vision algorithms quickly and efficiently. Since these devices must constantly be aware of their surroundings, the resource drain of running these algorithms is non-trivial. Therefore, it makes sense to use an ASIC unit on these devices to perform common algorithms, such as the random forest classification.

The FPGA is a natural testbed upon which the design for an ASIC unit can be implemented and finely tuned. Therefore, the platform upon which I will implement my optimized solution is the FPGA. However, for comparison I will use an optimized CPU solution.

### Schedule

Week | Expected Goal | Actual Achievement |

3/28 - 4/4 | Come Up With Project Proposal | Decided to implement a Random Forest Classification algorithm on an FPGA |

4/4 - 4/11 | Understand Random Forest Classification Algorithm. Meet with Professor Kitani in order to understand the algorithm and identify parallelism. Meet with ECE T.A. in order to better understand how to program algorithms onto an FPGA device with memory requirements. Research current work in the field. | Researched random forest classification algorithm through online sources. Met with Professor Kitani and understood the steps needed to implement the algorithm on an FPGA. Updated the website with a control flow diagram for my FPGA implementation of the algorithm. Talked with Professor Thomas about System Verilog resources available to me as well as access to the FPGA units in Hammershlag Hall. |

4/11 - 4/18 | Create a C++ implementation of the random forest classification algorithm. Analyze the cost tradeoffs from multi-core parallelism and SIMD parallelsim on performance, as well as exploiting different levels of parallelism in the algorithm. Run the implementation on many CPU devices in order to normalize system performance. | Made a simple C++ implmentation of the random forest classification algorithm. |

4/18 - 4/25 | Implement the algorithm in System Verilog on the Altera FPGA | |

4/25 - 5/2 | Finish implementation onto the Altera FPGA and begin to analyze performance given different datasets | |

5/2 - 5/6 | Finalize Project Report and update site. Finish analysis on data |