Tuesday, 22 November 2016

Use Faster RCNN and ResNet codes for object detection and image classification with your own training data

I have recently uploaded two repositories to GitHub, both based on publicly available codes for state-of-the-art (1) object detection and (2) image classification. I would like to leave a few notes here, though.

(1) Faster RCNN for object detection (GitHub Link).

You can use your own PASCAL VOC formatted data to train an object detector. Check out how to alter the network parameters as shown in the example files located in:
person_detection_voc2012/py-faster-rcnn/models/pascal_voc/ZF/faster_rcnn_alt_opt/*.pt
In particular, you want to change the following settings in stage1_fast_rcnn_train.pt and stage2_fast_rcnn_train.pt:
num_class:2 # in our example, person detection only has two classes: person vs background
In cls_score -- num_output:2
In bbox_pred -- num_output:8 # this value is 4*num_class
Also in stage1_rpn_train.pt and stage2_rpn_train.pt:
num_class:2
Finally, in fast_rcnn_test.pt:
In cls_score -- num_output:2
In bbox_pred -- num_output:8 # this value is 4*num_class
Additionally, you need to modify lib/datasets/pascal_voc.py:
self._classes = ('__background__', # always index 0
                 'person')
And then recompile from python prompt:
importpy_compile
py_compile.compile(r'pascal_voc.py')
You can then follow instructions from this page to train your model.

(2) Fine tuning ResNet for image classification (GitHub Link).

This one is simple to use, and you may check this out before attempting to fine tune a ResNet model.

Example scripts can be found in: finetune-resnet-flower/caffe/examples/flower463/

Network parameters can be found in: finetune-resnet-flower/caffe/models/resnet_flower463/

Note that the parameters in solver50.prototxt may not be optimal (at least for my task at hand). For better performance (of course, slower training), you can try to increase stepsize as shown below:
test_iter: 2000
test_interval: 1000
base_lr: 0.001
lr_policy: "step"
gamma: 0.1
stepsize: 100000
display: 500
max_iter: 1000000
momentum: 0.9
weight_decay: 0.0005
Also, set the batch size appropriately to reflect the graphic memory capability of your system.

Caffeonspark with Hadoop YARN Cluster from scratch

In this blog post, I would like to summarise the steps necessary for installing Caffeonspark from scratch -- that is to assume you do not have either Hadoop or Spark installed, and would like to set up everything from scratch (i.e., with a clean installation of Ubuntu only).

The main benefit of being able to use Caffeonspark to train a CNN in a distributed manner involves increasing the training mini-batch size hence improving model quality. In addition, you can save training time by using several machines.

Note that the main steps are also available from this page, however we make a complete step-by-step guide by including Hadoop installation and setup etc.

We assume a Ubuntu 14.04 LTS installation for the rest of this post.

1 Installing Caffe

The steps for installing Caffe can be found from its official website. To ensure you have installed Caffe correctly, run 'make runtest' from a bash prompt to make sure Caffe passes all shipped tests. This step ensures you have things like CUDA and libcudnn configured correctly.

2 Configure a Hadoop cluster, and download & compile Caffeonspark

We assume that we have two computers each with one CUDA-enabled GPU. We call them hdpc1 and hdpc2. hdpc1 is the Master node and hdpc2 is a Slave node. For a configuration with more machines, simply add more Slave nodes.

You would need to run the following bash commands on both hdpc1 and hdpc2:

2.1 Create Hadoop users and groups and provide sudo access:
user@hdpc1:~$ sudo addgroup hadoop_group
user@hdpc1:~$ sudo adduser --ingroup hadoop_group hduser1
user@hdpc1:~$ sudo adduser hduser1 sudo
2.2 Switch to hduser1 and use hduser1 for all remaining commands:
user@hdpc1:~$ su hduser1
2.3 Install JDK:
hduser1@hdpc1:~$ sudo apt-get install default-jdk
2.4 In your bashrc file (i.e., ~/.bashrc), make sure you have the correct JAVA_HOME path set, for example:
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64
export PATH=${JAVA_HOME}/bin:${PATH}
2.5 Check if Java has been configured correctly, and its version:
hduser1@hdpc1:~$ java -version
2.6 Next we would like to ensure the hduser1 accounts on both computers are able to login to each other remotely via SSH without a password. You can follow the steps outlined in this page. Make sure you have tested this on both machines as this will also add the RSA fingerprints of both computers to their peers.

2.7 Download the Caffeonspark source code:
hduser1@hdpc1:~$ git clone https://github.com/yahoo/CaffeOnSpark.git --recursive
And several paths to be set in the bashrc file:
export CAFFE_ON_SPARK=/home/hduser1/caffeonspark/CaffeOnSpark
export HADOOP_HOME=/home/hduser1/caffeonspark/CaffeOnSpark/scripts/hadoop-2.6.4
export PATH=${HADOOP_HOME}/bin:${PATH}
export SPARK_HOME=/home/hduser1/caffeonspark/CaffeOnSpark/scripts/spark-1.6.0-bin-hadoop2.6
export PATH=${SPARK_HOME}/bin:${PATH}
2.8 Install Maven:
hduser1@hdpc1:~$ sudo apt-get install maven
And Maven's path in the bashrc file:
export MAVEN_HOME=/usr/share/maven
export PATH=${MAVEN_HOME}/bin:${PATH}
2.9 Add the host names to the /etc/hosts file on both machines. Note that here hdpc1 and hdpc2 are the host names in the Hadoop cluster, and OptiPlex-1000 and OptiPlex-1001 are Ubuntu host names. Both must be added:
192.168.46.212 hdpc1 OptiPlex-1000
192.168.46.39 hdpc2 OptiPlex-1001
It is important to make sure there are no additional items relating to the two machines above in /etc/hosts. Otherwise, Spark programs may hang while running. For example, a typical hosts file on hdpc1 will look like this:
127.0.0.1 localhost
127.0.1.1 OptiPlex-1000

192.168.46.212 hdpc1 OptiPlex-1000
192.168.46.39 hdpc2 OptiPlex-1001

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
2.10 Set environment variables in the bashrc file. Here we have two spark workers, i.e., SPARK_WORKER_INSTANCES=2, and one GPU for each machine, i.e., DEVICES=1. These settings may be altered when you submit your spark job later:
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
export LD_LIBRARY_PATH=${CAFFE_ON_SPARK}/caffe-public/distribute/lib:${CAFFE_ON_SPARK}/caffe-distri/distribute/lib:${LD_LIBRARY_PATH}
export MASTER_URL=spark://hdpc1:7077
export SPARK_WORKER_INSTANCES=2
export DEVICES=1
export CORES_PER_WORKER=1
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop
And load the settings:
hduser1@hdpc1:~$ source ~/.bashrc
2.11 Install Hadoop and Spark. The second command below copies the Hadoop configuration files to the Hadoop configuration directory:
hduser1@hdpc1:~$ ${CAFFE_ON_SPARK}/scripts/local-setup-hadoop.sh
hduser1@hdpc1:~$ cp ${CAFFE_ON_SPARK}/scripts/*.xml ${HADOOP_HOME}/etc/hadoop
hduser1@hdpc1:~$ ${CAFFE_ON_SPARK}/scripts/local-setup-spark.sh
2.12 Create a local directory for HDFS and set access permissions:
hduser1@hdpc1:~$ sudo mkdir -p /app/hadoop/tmp
hduser1@hdpc1:~$ sudo chown hduser:hadoop /app/hadoop/tmp
hduser1@hdpc1:~$ sudo chmod 750 /app/hadoop/tmp
2.13 Modify the Hadoop configuration files in ${HADOOP_HOME}/etc/hadoop as follows:

core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hdpc1:9000</value>
    </property>
    <property>
      <name>hadoop.tmp.dir</name>
      <value>/app/hadoop/tmp</value>
      <description>A base for other temporary directories.</description>
    </property>
</configuration>

hdfs-site.xml (if you have N Datanodes then you can set the dfs.replication value to a maximum of N-1. The minimum is 1 and the recommended value for a cluster of 3 or more machines is 3.)

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
</configuration>

mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
   <property>
      <name>mapred.job.tracker</name>
      <value>hdpc1:9001</value>
   </property>
</configuration>

yarn-site.xml (we enabled YARN logging here so we can see the full logs if we wish to)

<configuration>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hdpc1</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
      <name>yarn.nodemanager.disk-health-checker.enable</name>
      <value>false</value>
    </property>
    <property>
      <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
      <value>98</value>
    </property>
    <property>
      <name>yarn.nodemanager.vmem-check-enabled</name>
      <value>false</value>
    </property>
    <property>
      <name>yarn.log-aggregation-enable</name>
      <value>true</value>
    </property>
    <property>
      <description>Where to aggregate logs to.</description>
      <name>yarn.nodemanager.remote-app-log-dir</name>
      <value>/home/hduser1/yarn-logs</value>
    </property>
    <property>
      <name>yarn.log-aggregation.retain-seconds</name>
      <value>259200</value>
    </property>
    <property>
      <name>yarn.log-aggregation.retain-check-interval-seconds</name>
      <value>3600</value>
    </property>
    <!--property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>8G</value>
    </property>
    <property>
        <name>yarn.nodemanager.resource.cpu-vcores</name>
        <value>8</value>
    </property-->
</configuration>

2.14 Set the Master and Slaves in the masters and slaves file in ${HADOOP_HOME}/etc/hadoop as follows:

masters

hdpc1

slaves

hdpc1
hdpc2

This makes hdpc1 as the Master, and both hdpc1 and hdpc2 have Datanodes so they participate in calculations.

2.15 Compile the Caffeonspark source codes:
hduser1@hdpc1:~$ cd ${CAFFE_ON_SPARK}/caffe-public/
hduser1@hdpc1:~$ cp Makefile.config.example Makefile.config
hduser1@hdpc1:~$ echo "INCLUDE_DIRS += ${JAVA_HOME}/include" >> Makefile.config
And then open Makefile.config and edit as appropriate:

Makefile.config

CPU_ONLY := 1  #if you have CPU only
USE_CUDNN := 1 #if you want to use CUDNN

And then run from bash prompt:
hduser1@hdpc1:~$ cd ${CAFFE_ON_SPARK}
hduser1@hdpc1:~$ make build
3. Start the Hadoop YARN cluster and submit a Spark job

You only need to run the following commands on the Master (hdpc1) machine.

3.1 Format the HDFS file system:
hduser1@hdpc1:~$ ${HADOOP_HOME}/bin/hdfs namenode -format
In case you find the Namenode is unable to start normally on hdpc1, try cleaning up /app/hadoop/tmp (rm -Rf /app/hadoop/tmp/*) and use the command above to re-format the HDFS file system. If you still cannot start the Namenode, you can set the access permissions of /app/hadoop to 777.

3.2 Start DFS:
hduser1@hdpc1:~$ ${HADOOP_HOME}/sbin/start-dfs.sh
You should be able to open https://hdpc1:50070 and see a Hadoop web UI like this:


You should see two live nodes in the UI, as well as information such as remaining disk space on the DFS etc.

3.3 Start YARN:
hduser1@hdpc 1:~$ ${HADOOP_HOME}/sbin/start-yarn.sh
You should be able to use the 'jps' command to see if the Namenode and Datanodes started. For example:

On hdpc1:

hduser1@hdpc1:~/caffeonspark/CaffeOnSpark/scripts/hadoop-2.6.4/etc/hadoop$ jps
13171 Jps
2338 SecondaryNameNode
2523 ResourceManager
2664 NodeManager
2143 DataNode
1983 NameNode

On hdpc2:

hduser1@hdpc2:~/caffeonspark/CaffeOnSpark/scripts$ jps
7132 Jps
1892 DataNode
2031 NodeManager

You can also open http://hdpc1:8088 to see the YARN web UI:


3.4 Copy data to HDFS (we use MNIST as an example here):
hduser1@hdpc1:~$ ${CAFFE_ON_SPARK}/scripts/setup-mnist.sh
hduser1@hdpc1:~$ hadoop fs -put -f ${CAFFE_ON_SPARK}/data/mnist_*_lmdb hdfs:/projects/machine_learning/image_dataset/
You can check if the files have been copied by accessing the Hadoop web UI.

3.5 Begin training on MNIST:
hduser1@hdpc1:~$ hadoop fs -rm -f hdfs:///mnist.model
hduser1@hdpc1:~$ hadoop fs -rm -r -f hdfs:///mnist_features_result
hduser1@hdpc1:~$ spark-submit --master yarn --deploy-mode cluster \
    --num-executors ${SPARK_WORKER_INSTANCES} \
    --files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \
    --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
    --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
    --class com.yahoo.ml.caffe.CaffeOnSpark  \
    ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
        -train \
        -features accuracy,loss -label label \
        -conf lenet_memory_solver.prototxt \
        -devices ${DEVICES} \
    -connection ethernet \
        -model hdfs:///mnist.model \
        -output hdfs:///mnist_features_result
When you started training, you can see from the YARN web UI that the state of your application turns to ACCEPTED. You may also use ApplicationMaster to open a Spark Web UI for tracking the progress of your applications:


You can also check the training progress via Container logs:


When training is done, you can see from YARN web UI that your application's FinalStatus turns to SUCCEEDED, and relevant information such as StartTime and FinishTime:


If there is an error, then the FinalStatus above will be FAILED, and you can be able to see the error messages. If the program hangs due to incorrect configurations etc. you can use the following command to kill the program:
hduser1@hdpc1:~$ yarn application -kill application_1479197586987_0011
You can use the following command to dump application logs to a text file when your program finishes:
hduser1@hdpc1:~$ yarn logs -applicationId application_1479197586987_0009 >> ~/log09.txt
Finally, you can also check the trained model and results:
hduser1@hdpc1:~$ hadoop fs -ls hdfs:///mnist.model
hduser1@hdpc1:~$ hadoop fs -cat hdfs:///mnist_features_result/*