Our Data Science and Artificial Intelligence team has earned a silver medal in a worldwide Kaggle challenge, sponsored by Elo. We qualified in the top 4%, achieving the 160th position out of 4,200 teams. We were also the best ranked among the Brazilian entrees!
Our Data Science and Artificial Intelligence team rocks! From left to right: Luiz Nonenmacher, Marcelo de Almeida, Matheus Gonzaga, and Alan Delgado. Photo: Matheus Mignoni for Poatek (2019).
Kagglers were assigned the duty to develop algorithms that could help identify and serve the most relevant opportunities to customers of Elo – one of the largest payment companies in Brazil. The goal was to improve customers’ lives and help Elo reduce unwanted campaigns by offering the best experience to their clients based on their profile. This should be done by determining a given customer’s loyalty to a given merchant or category of merchants.
The data available was an anonymized sample from Elo’s own database. It included train and test datasets with customers’ transactions time series indicating, with scaled values, the amount in each transaction, in addition to other metadata and information on which merchant it was made with. It also included a separate dataset with information on individual merchants, such as the average amount spent by its clients, or the average number of sales (both also scaled). This last dataset made things interesting, because the teams had to use their own judgement on how to extract features from it.
The data was spread on five tables, so we had to first perform a preprocessing step to handle missing values and outliers and aggregate information into one dataset. After that, we extracted features for each column according to the data types. The features ranged from simple descriptive statistic values to information inferred using Brazilian holidays.
To perform the classification, we trained a Light Gradient Boosting Machine (LightGBM) regression model. Using a 7-fold cross-validation training method, we achieved a good performance, but not enough. In order to achieve a better performance, we annotated outliers in the dataset (easily recognizable as instances with a target value below -30) and trained a LightGBM model without the interference of those. Using a tuned threshold, we replaced some values predicted by the model with the values of the original model, trained with the outliers. This allowed us to achieve a significantly better performance, increasing in rank on the second part of the competition, which has a private test dataset, when other models with higher overfitting failed.
We found there’s a lot to learn from Kaggle’s competitions. We had to tackle the problem as a team, brainstorm and run our own experiments, but also read the discussion forums and see what others participants had discovered. We learned techniques and tricks that will be put to use in future projects and competitions — in which we surely plan to take part again.
Kaggle is the world’s largest community of Data Scientists and Machine Learning Engineers. Owned by Google LLC, the platform allows users to find and publish datasets, explore and build models in an online Data Science environment, participate in competitions and collaborate and discuss with other professionals.