A One-Year Self-Reflection on scikit-learn

Disclosure: I used ChatGPT to proofread earlier drafts of this blog post.

It's been a year since I joined the scikit-learn team, so I decided to write a blog post to document some of my thoughts.

Why Did I Stay in scikit-learn?

I have had a great experience working with the scikit-learn community. When I first started, I knew very little about programming and software development. Because of that, the absence of certain computer science and engineering common sense created quite a few awkward situations when working with the maintainers.

The maintainers did not mind this and patiently mentored me, helping me understand the contribution workflow and the code base. This mentorship helped me learn new skills and boosted my confidence, which motivated me to take on more responsibilities in the community, such as reviewing pull requests and triaging issues.

Moreover, I realised that it is possible to transform challenges into situations where everyone wins, which I find extremely rewarding and fulfilling.

Furthermore, I have the opportunity to work with a fantastic group: the scikit-learn team.

What Did I Do?

I received the invitation to join the scikit-learn team in December 2024. Prior to this, I was already engaging with the community by authoring pull requests, reviewing other contributions, and participating in issue triaging.

Over the past year, I have contributed to documentation improvements, feature development, and community processes. I have also participated in reviewing pull requests and helping first-time contributors navigate the project. One of the features I later helped implement was temperature scaling for probability calibration, which will be available in scikit-learn 1.8 (See my previous blog post on it).

Some examples of my involvement before joining the team include:

scikit-learn/scikit-learn/27913: Added link to plot_adaboost_multiclass.py example

My first pull request for scikit-learn. Maren Westermann helped me navigate the codebase and understand the CI workflow, which gave me a solid foundation for later contributions. Even though I am now more experienced with the contributing workflow, I still revisit that PR from time to time to remind myself how to support first-time contributors.
scikit-learn/scikit-learn/29709: ENH add support for array API to various metrics

My first involvement with scikit-learn's array API project, which quickly became a mid-term goal for my work in the project.
scikit-learn/scikit-learn/30059: DOC fix back references to removed example

One of the first pull requests I reviewed for scikit-learn. Together with Guillaume Lemaitre and Charlie Xiao, we fixed a bug on the scikit-learn website. From reporting the issue on the tracker, analysing the root cause, creating a patching pull request, and merging it into the main branch, the entire process took less than an hour.
scikit-learn/scikit-learn/30076: Error on the scikit-learn algorithm cheat-sheet?

One of the first scikit-learn issues I helped triage.

Creating Pull Requests

I originally learned programming by working on online problem sets. When I was working on scikit-learn/scikit-learn/27913, I had no idea what linting was. I saw the GitHub Actions bot's warning and a large cross in the CI/CD workflow, but I did not know how to address the issue, or even whether I needed to. Maren spent a lot of time helping me navigate the codebase and understand the CI/CD messages, and the PR was eventually merged into the main branch. The merge boosted my confidence and motivated me to actively search for other issues where I could help. This led me to scikit-learn/scikit-learn/29709.

It was reported that the input validation logic in both sklearn.metrics.root_mean_squared_log_error and sklearn.metrics.mean_squared_log_error was supposed to check whether the inputs lie inside the domain of \(y = \log(1 + x)\). However, the implementation at that time was checking \(y = \log(x)\) instead. This turned out to be one of the relatively few issues in scikit-learn that I was able to solve. I commented on the issue thread to confirm the problem and volunteered to work on it. Adrin provided the first round of review. Because the fix was relatively straightforward, it did not take long for him to give the initial approval. He then asked me to add support for the array API to those functions.

At that time, I had no idea what the array API was. From the merged PR, it seemed that the objective was simply to replace the NumPy abbreviation np with a more abstract term xp. However, I did not understand what this change meant for the scikit-learn codebase or why it was an objective. I looked up the meta-issue scikit-learn/scikit-learn/26024, which helped a little. Fortunately, ChatGPT was available at the time, so I used it to ask a few questions and better understand what the array API project was about. Together with Thomas Fan's presentation scikit-learn on GPUs with Array API from PyData NYC 2023, I learned about the purpose of the array API project and immediately became interested because I found it meaningful and impactful.

The array API project quickly became a mid-term goal for my work in scikit-learn, and I hope to see it completed. Under the mentorship of Adrin Jalali, Olivier Grisel, and Omar Salman, scikit-learn/scikit-learn/29709 was successfully merged into the main branch, and my future PRs gradually improved.

Reviewing Pull Requests

From Adrin during the Code 4 Thought interview scikit-learn: Software is People, I learned that every scikit-learn PR requires two approvals. This helped me realise that contributors can support the project not only by writing code but also by reviewing and mentoring others.

At that time, I was still familiarising myself with the project standards and the contributing workflow, so I cherry-picked some simpler PRs to review. Fortunately, Adrin was managing the meta-issue scikit-learn/scikit-learn/26927, which aimed to onboard first-time contributors. Having gone through the same process while working on scikit-learn/scikit-learn/27913, I was able to provide constructive feedback to other first-time contributors by mimicking the feedback I had received from Maren and Adrin.

Another PR I helped review was scikit-learn/scikit-learn/30059. It was reported that a broken image appeared on the scikit-learn website because some examples had been removed in a previously merged PR. Together with Guillaume Lemaitre and Charlie Xiao, we resolved the issue, from identifying the root cause to creating a fix and merging the PR within an hour.

This kind of collaboration created a strong sense of accomplishment and encouraged me to become more involved in the community, not only by creating PRs but also by reviewing them and participating in issue triaging. This created opportunities for me to interact regularly with both the core maintainers and the wider community.

The Future

I would like to see the completion of array API support in scikit-learn, and I know the best way to help achieve this is to stay actively involved in the project. Thanks to the tremendous work of Olivier Grisel, Omar Salman, Tim Head, Lucy Liu, and many other contributors, the project is progressing rapidly. Many scikit-learn estimators and functions now support GPU-backed arrays, with additional support on the way.

I would also like to become more involved in the CI/CD processes of the project, as well as the public release workflow.

Conclusion

In the Code 4 Thought interview scikit-learn: Software is People, Gael Varoquaux mentioned that he believes diversity of opinion leads to better software. scikit-learn demonstrates this principle well. When collaborating, the team consistently prioritises the long-term interests of the project—such as maintainability, numerical stability, and backward compatibility—over personal ambition or ego.

This is a quality I deeply admire. I am grateful that the team welcomed me and trusted me with greater responsibilities, and I look forward to continuing to contribute to the project and the community in the years ahead.