[pystatsmodels] Contributing to Statsmodels

bksahu

2018-12-10 14:14:19 UTC

Thank you for helping me out.

Hi there. I am a junior undergrad in cs having some experience in python
with a limited knowledge of stats. I have used basic functions
of statsmodels like OLS and really loved it. So, I would like to contribute
to this project. Apart from the docs, could you please give me some
pointers where I should be looking into to get started with statsmodels.

Hi,
Most issues right now are pretty difficult. We have a "Good as First PR"
label but it is not well maintained, and it is difficult to guess what is
easy with different backgrounds
https://github.com/statsmodels/statsmodels/issues?q=is%3Aopen+is%3Aissue+label%3A%22Good+as+First+PR%22
A very useful and not too difficult way to contribute would be to improve
unit test coverage.
Around 40% of our code are unit tests, but there are still gaps in unit
test coverage.
When I have to fix bugs, I often end up changing 1 to 5 lines of code and
adding 30 to 50 lines of unit tests.
We have two areas where unit tests are lacking.
One is in less used functions and options that are not verified against
other packages like R. This would be good to improve with some basic
knowledge of R or another statistics package.
The second is input validation. In many cases our code "assumes" that we
get the correct inputs by the user or caller. We support pandas Series and
DataFrames in general, but outside of the main models and function we often
don't have unit tests for it. In some cases we handle pandas data but don't
take advantage of the additional information that is provided, e.g. names
and indices.
After browsing the unit test coverage for a bit, it looks better than I
remembered in terms of line coverage
All main subdirectories have around 90% coverage (more on coveralls, less
on codecov)
https://coveralls.io/github/statsmodels/statsmodels
https://codecov.io/gh/statsmodels/statsmodels/tree/master/statsmodels
One currently untested/unsupported module is
statsmodels.stats.descriptivestats which is a "fix or delete" case. The
main reason to keep it is if we can support things that pandas `describe`
does not support, but even before adding enhancements it needs cleanup and
unit tests.
https://github.com/statsmodels/statsmodels/issues/2630
Josef

Regards,
bksahu