Random Forest Classification on an FPGA

15418 Final Project Site by Vivek Krishnan


Project maintained by vrkrishn
Proposal
Checkpoint
Design Documents
Presentation

Summary

I implemented a random forest classification system on an FPGA that can output samples at a rate of one sample per clock cycle. In addition, I made an alternate system that can signal and wait for memory responses from off-board memory. To analyze performance gains, I created a CPU implementation of the algorithm using OpenMP and compared execution cycles and power consumption.

Background

A Random Forest is an ensemble classifier using many decision tree models. The training of this algorithm involves taking a subset of the sample data with replacement in order to grow decision trees that have levels directly related to the input variables. Random Forest Classifiers are highly accurate, scale well to large datasets, and give back information about variable interaction or importance as well.

The purpose of my project is to speedup the classification part of the random forest algorithm (as opposed to the growth of decision trees).

        // Brief overview of classification algorithm
//independent iterations
for (sample in sample_data) {
        
    sample_result = 0;
            
    //independent iterations
    for (tree in forest) {
        local_result = 0;
        current_node = tree.base_node;
                
        for (level < max_tree_depth) {
                 
            node_index = NUM_NODES_PER_TREE * treeIdx + current_node
            node_threshold = get_threshold (node_index);
            sample_var_idx = get_corresponding_var (node_index);
            sample_var_data = sample.getVar(sample_var_idx);
                 
            //If at a branching node in the tree
            if (level < max_tree_depth) {
                branchLeft = (sample_var_data < node_threshold);
                current_node = next_node (branchLeft);
                //If at a leaf node in the tree
                } else {
                    local_result = get_result (node_index);
                }
                
            }
            sample_result += local_result;
        }
            
        sample_responses[sample] = process_result (sample_result);
}
       

The classification algorithm above is parallelizable at both the tree level and the sample level. I plan to to make use of pipelining on the FPGA in order to maximize the parallelism I can get from both ends.

The main structure in the random forest algorithm is the decision tree, where each node contains a sample index and a threshold through which samples can decide whether to proceed down the left branch or the right branch. For each decision tree, the input to the tree is a sample of data and the output is a data value determined by (weight * samp[idx] + offset) where weight, idx,and offset are determined by the leaf node that the sample ends up at.

Approach

My current design implementation is on the design page

I targeted the Altera DE-2 FPGA boards with Cyclone IV FPGA units that are available in the labs of Hammershlag Hall. To draft my designs onto the board, I decided to use the Verilog HDL and synthesize on the board using Quartus Pro.

At first glance, the random forest classification algorithm may not look like it has too much to gain from implementation on an FPGA or hardware device. For any given given decision tree in the forest, travesing each level of the tree requires only one comparison compared to one lookup of the next appropriate state of the tree. Therefore, the computation to communication ration is inherently low in this algorithm. Each decision alsorequires that the decision in the previous level has already been made.

In addition, machine learning algorithms have to deal with increasingly larger data sets that might not be able to be completely stored on one FPGA memory. While, FPGAs have very fast internal memory bandwidth, they have limited capacity to pass data in and out of an FPGA

However, the algorithm is massively data parallel. The processing of a sample through each decision tree is completely independen.

It was interesting to actually design the parallel architecture components for the algorithm execution; I did not have to worry about mapping code to architecture and seeing unforseen compilation implications limit the parallel performance of my code.

The only referance code I used in my solution was the modules that control the data input and output from the USB blaster that is provided on the board.

Results

In order to gain a good standard for the performance of the algorithm on different devices, I decided to use a metric of samples/clock cycle. For data, I consistently used 32 bit data to model operations on floating point integers. For varying data, I varied the number of trees in the sample and the number of samples I put through the machine. To gain information about clock cycles, I used a cycle timer to get execution time on the CPU version and coverted the time to clock cycles using information about the machine clocks on the machines. Power was similarly calculated on the CPU using performance statistics from manufacturer websites while a multiometer was used for the FPGA units.

Here I provide some graphs of speedup and energy efficiency that I recorded.

What limited performance on the CPU edition is that there was a little communication-to-computation ratio, and the problem size for each tree was too small to justify creating a new thread for each tree. I found that parallelizing on just one variable yielded better performance that parlalelizing both samples and trees. In addition, almost all of the execution time on the CPU version was spent on summing the results of the performance together in the reduction phase.

On the FPGA, the limiting factor in parallel performance was the I/O on the FPGA unit that had limited capacity to trasmit data. Therefore, a better solution for the problem was to cap the bandwidth of the system such that it can be modified to accomodate changing I/O performances on chips. Most of the execution time for this system is spent in the forest computation. Because the sample input is consitent, I did not have to use a series of locks on the summing tree in order to sum up all of the results of the computation. I could just add the input at a bus at each interval and then increment the tree the bus was pointing to. Therefore summing up the values took as long as execution time took for the random forest.