Disclosure: I used ChatGPT to proofread earlier drafts of this blog post.
It's been a year since I joined the scikit-learn team, so I decided to write a blog post to document some of my thoughts.
Why Did I Stay in scikit-learn?
I have had a great experience working with the scikit-learn community. When I first started, I knew very little about programming and software development. Because of that, the absence of certain computer science and engineering common sense created quite a few awkward situations when working with the maintainers.
The maintainers did not mind this and patiently mentored me, helping me understand the contribution workflow and the code base. This mentorship helped me learn new skills and boosted my confidence, which motivated me to take on more responsibilities in the community, such as reviewing pull requests and triaging issues.
Moreover, I realised that it is possible to transform challenges into situations where everyone wins, which I find extremely rewarding and fulfilling.
Furthermore, I have the opportunity to work with a fantastic group: the scikit-learn team.
What Did I Do?
I received the invitation to join the scikit-learn team in December 2024. Prior to this, I was already engaging with the community by authoring pull requests, reviewing other contributions, and participating in issue triaging.
Over the past year, I have contributed to documentation improvements, feature development, and community processes. I have also participated in reviewing pull requests and helping first-time contributors navigate the project. One of the features I later helped implement was temperature scaling for probability calibration, which will be available in scikit-learn 1.8 (See my previous blog post on it).
Some examples of my involvement before joining the team include:
-
scikit-learn/scikit-learn/27913: Added link to plot_adaboost_multiclass.py exampleMy first pull request for scikit-learn. Maren Westermann helped me navigate the codebase and understand the CI workflow, which gave me a solid foundation for later contributions. Even though I am now more experienced with the contributing workflow, I still revisit that PR from time to time to remind myself how to support first-time contributors.
-
scikit-learn/scikit-learn/29709: ENH add support for array API to various metricsMy first involvement with scikit-learn's array API project, which quickly became a mid-term goal for my work in the project.
-
scikit-learn/scikit-learn/30059: DOC fix back references to removed exampleOne of the first pull requests I reviewed for scikit-learn. Together with Guillaume Lemaitre and Charlie Xiao, we fixed a bug on the scikit-learn website. From reporting the issue on the tracker, analysing the root cause, creating a patching pull request, and merging it into the
mainbranch, the entire process took less than an hour. -
scikit-learn/scikit-learn/30076: Error on the scikit-learn algorithm cheat-sheet?One of the first scikit-learn issues I helped triage.
Creating Pull Requests
I originally learned programming by working on online problem sets. When I was
working on
scikit-learn/scikit-learn/27913,
I had no idea what linting was. I saw the GitHub Actions bot's warning and a large
cross in the CI/CD workflow, but I did not know how to address the issue, or even
whether I needed to.
Maren
spent a lot of time helping me navigate the codebase and understand the CI/CD
messages, and the PR was eventually merged into the main branch.
The merge boosted my confidence and motivated me to actively search for other issues
where I could help. This led me to
scikit-learn/scikit-learn/29709.
It was reported that the input validation logic in both
sklearn.metrics.root_mean_squared_log_error
and
sklearn.metrics.mean_squared_log_error
was supposed to check whether the inputs lie inside the domain of
\(y = \log(1 + x)\). However, the implementation at that time was checking
\(y = \log(x)\) instead.
This turned out to be one of the relatively few issues in scikit-learn that I was
able to solve. I commented on the issue thread to confirm the problem and volunteered
to work on it.
Adrin
provided the first round of review. Because the fix was relatively straightforward,
it did not take long for him to give the initial approval. He then asked me to add
support for the
array API
to those functions.
At that time, I had no idea what the
array API
was. From the merged PR, it seemed that the objective was simply to replace the
NumPy abbreviation np with a more abstract term xp. However, I did not understand
what this change meant for the scikit-learn codebase or why it was an objective.
I looked up the meta-issue
scikit-learn/scikit-learn/26024,
which helped a little. Fortunately, ChatGPT was available at the time, so I used it
to ask a few questions and better understand what the
array API project
was about.
Together with
Thomas Fan's
presentation
scikit-learn on GPUs with Array API
from PyData NYC 2023, I learned about the purpose of the
array API project
and immediately became interested because I found it meaningful and impactful.
The array API project quickly became a mid-term goal for my work in scikit-learn,
and I hope to see it completed. Under the mentorship of
Adrin Jalali,
Olivier Grisel,
and
Omar Salman,
scikit-learn/scikit-learn/29709
was successfully merged into the main branch, and my future PRs gradually improved.
Reviewing Pull Requests
From Adrin during the Code 4 Thought interview scikit-learn: Software is People, I learned that every scikit-learn PR requires two approvals. This helped me realise that contributors can support the project not only by writing code but also by reviewing and mentoring others.
At that time, I was still familiarising myself with the project standards and the
contributing workflow, so I cherry-picked some simpler PRs to review. Fortunately,
Adrin
was managing the meta-issue
scikit-learn/scikit-learn/26927,
which aimed to onboard first-time contributors.
Having gone through the same process while working on
scikit-learn/scikit-learn/27913,
I was able to provide constructive feedback to other first-time contributors by
mimicking the feedback I had received from
Maren
and
Adrin.
Another PR I helped review was
scikit-learn/scikit-learn/30059.
It was reported that a broken image appeared on the scikit-learn website because
some examples had been removed in a previously merged PR. Together with
Guillaume Lemaitre
and
Charlie Xiao,
we resolved the issue, from identifying the root cause to creating a fix and merging
the PR within an hour.
This kind of collaboration created a strong sense of accomplishment and encouraged me to become more involved in the community, not only by creating PRs but also by reviewing them and participating in issue triaging. This created opportunities for me to interact regularly with both the core maintainers and the wider community.
The Future
I would like to see the completion of array API support in scikit-learn, and I know the best way to help achieve this is to stay actively involved in the project. Thanks to the tremendous work of Olivier Grisel, Omar Salman, Tim Head, Lucy Liu, and many other contributors, the project is progressing rapidly. Many scikit-learn estimators and functions now support GPU-backed arrays, with additional support on the way.
I would also like to become more involved in the CI/CD processes of the project, as well as the public release workflow.
Conclusion
In the Code 4 Thought interview scikit-learn: Software is People, Gael Varoquaux mentioned that he believes diversity of opinion leads to better software. scikit-learn demonstrates this principle well. When collaborating, the team consistently prioritises the long-term interests of the project—such as maintainability, numerical stability, and backward compatibility—over personal ambition or ego.
This is a quality I deeply admire. I am grateful that the team welcomed me and trusted me with greater responsibilities, and I look forward to continuing to contribute to the project and the community in the years ahead.