Tuesday, 25 October 2016

Dealing with compiler problems when installing XGBoost on mac os x 10.12


XGBoost is a sexy library in machine learning, currently performing very well in the last kaggle competitions. This post doesn't intend to describe the machinery of the XGBoost, but rather to relate the issues I faced during the installation of the XGBoost python package. 

The holy pip command


I like very much working with a linux kernel because installing python libs is made easy trough the pip command.
According to the XGBoost main page:

pip  install xgboost

should do the work... but it didn't, at least in my environment. Got the following error:

Obtaining file:///Users/greghor/anaconda2/lib/python2.7/site-packages/xgboost/python-package
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/Users/greghor/anaconda2/lib/python2.7/site-packages/xgboost/python-package/setup.py", line 19, in <module>
        LIB_PATH = libpath['find_lib_path']()
      File "/Users/greghor/anaconda2/lib/python2.7/site-packages/xgboost/python-package/xgboost/libpath.py", line 46, in find_lib_path
        'List of candidates:\n' + ('\n'.join(dll_path)))
    __builtin__.XGBoostLibraryNotFound: Cannot find XGBoost Library in the candidate path, did you install compilers and run build.sh in root path?
    List of candidates:
    /Users/greghor/anaconda2/lib/python2.7/site-packages/xgboost/python-package/xgboost/libxgboost.so
    /Users/greghor/anaconda2/lib/python2.7/site-packages/xgboost/python-package/xgboost/../../lib/libxgboost.so
    /Users/greghor/anaconda2/lib/python2.7/site-packages/xgboost/python-package/xgboost/./lib/libxgboost.so
    /Users/greghor/anaconda2/xgboost/libxgboost.so
---------------------------------------- Command "python setup.py egg_info" failed with error code 1 in /Users/greghor/anaconda2/lib/python2.7/site-packages/xgboost/python-package/



I started to look at this egg_info error, consequently updating the corresponding components but it didn't fix my issues. 

Manual installation to account for multithreading

The main difficulty when you don't have a background in CS is that all the logs look so cryptic. I must honestly say that I often do debugging via trial and errors without really understanding the underlying logic... this was my approach here and I end up following a tutorial explaining on how installing xgboost manually. 
This approach has the benefit to account for multithreading (which was seemingly not the case if one uses pip install). Following the steps described here seem to work for 99% of the people,  unfortunately not for me.
I was still stuck with the same error. Google told me that the roots of evil were probably related to a problem with the C++ compiler. There are different compilers available out there, I personally use gcc (for C) and g++ (for C++). XGBoost requires to be compiled with updated versions of gcc and g++ (XGBoost didn't work with gcc-4.9 and g++-4.9). While my gcc and g++ were up to date, the compiler still used the old versions as default... and this was the key point. It took me a while to figure this out, but you also need to update the symbolic links pointing to the compiler:

cd bin/usr/
rm gcc
ln -s gcc-6 gcc //set default gcc to gcc-6$
ln -s g++-6 gcc //set default g++ to g++-6

This last step finally fixed my issues. XGBoost is now working like a charm! Kaggle folks, watch out your ass! Here I come!


Tuesday, 16 February 2016

Learning D3.js with wine production in France

I wanted to learn data visualization for a long time. Indeed, being able to communicate results in clear and aesthetic manner is crucial for scientists. Results that cannot be communicated simply do not exist!

The D3.js javascript library, mainly developed by Mike Bostock during his Ph.D. at Stanford Visualization Group, appeared in the early 2010 and quickly became a reference tool for data visualization (refered as data viz by hipsters).  D3.js produces interactive and aesthetically very pleasing charts. On top of that, it is said to be relatively easy to use.  These was enough elements for motivating me to learn it.

 

Where to start?

 

This post is not designed as a tutorial for D3.js but I rather want to briefly describe my learning curve, starting from (almost) scratch, up to my first data viz. There are plenty of excellent tutorials out there, I especially recommend Scott Muray’s tutorial as well as the Dashing D3.js website

As mentioned above, D3.js is a javascript library primarily created for web design. As a physicist, I was not skilled in javascript, which, I must say, was an additional difficulty when learning D3.js.

My objective was to implement a data viz displaying geographical data in an interactive manner. Obviously, it makes sense to choose a map as the support for the data. After few minutes on Google, I ended up with a dataset representing the production of wine in France for each department (French departments are the equivalent of UK counties). Since I am a wine lover, I think this is a wonderful topic for my first data viz!

 

1) Get the French department’s data

 

Geographical data can be easily manipulated in D3.js with the geoJson and topoJson format.

Original dataset are freely available online on the IGN website. However, data cannot be directly downloaded in a D3.js’friendly format and needs to be first converted to geo- or topoJson. Hopefully, I found another tutorial where the department’s data were ready to use. This website, created by data scientists based in Paris, was of invaluable help for me. The map presented below is closely inspired by their work. I am really grateful to these persons; they were extremely helpful and always ready to answer questions.

 

2) Link the wine production data to each department

 

D3.js offers the possibility to colour departments according to the value of the wine production. To do so, we just need to link the wine production to each department by using a common key. Then, we associate a colour that scales with the wine production (the darker is the region, the more wine is produced). Finally, a tooltip was added to display the exact wine production when the mouse’s pointer goes over the location of the department.

The results are presented below. Santé!

 

Wine production in France (units: 1000 hl)

5001,0001,500