About a quarter ago (April), I posted my first regular update on all of the various projects I'm working on. As side projects tend to go, some fall into and out of favor, and occasionally new ones crop up. As I develop on projects, I post regular updates, but it's helpful to me (and hopefully some of you), to do an occasional 30,000 ft view of all of them in one place.
The vast majority of these projects are open source, and I'm trying to be better about actually including outside contributors, so if any of these projects pique your interest, please reach out, here or on github.
First off, side projects.
Open Source Projects
My open source projects have also continued to plug along, with some new ones being added, and an increased effort in actually presenting them and getting help with them. You can check out a couple of presentations I've given to that end here and here.
To simplify things, I'll break up these projects into 3 categories:
- Actively Developing: these are projects that me and others are regularly commiting new features to, and are seeking more contributors and users for.
- Maintenance Mode: these are stable projects without any new features planned, but are being actively maintained and will continue to be updated as is required over time.
- Defunct / Toy Projects: these are projects that have fallen into disrepair or have been superseded by another project, or are just one off toy projects not intended for real long-term use.
- git-pandas: Still my favorite of the bunch, git-pandas continues to be developed. At this stage, most of the work has been moving towards a v2 release, which has two primary goals: clarity and performance. To that end, I've done two main additions, of unified glob-style syntax and parallelizing the cumulative blame function. The next big step is going to be parallelizing at the Repo level. All project directories have loops that iterate over calls to their constituent repos that can be parallelized quite cleanly, speeding things up drastically. I'd love help with this, so if you're interested in high performance numeric python, joblib and the sort, reach out.
- twitter-pandas: Inspired by git-pandas, twitter-pandas is intended to provide a clean, clear interface to twitter API data with a pandas DataFrame based construct. There's 3 of us working on this seriously, and we've gotten most of the tweepy API replicated, so now it's onto the more interesting parts: making it actually useful. This encompasses two main thrusts: making the inputs to the methods less complex, and reducing the number of methods into a smaller set of obvious functionality. We do this by picking out use-cases, and working through them with those two goals in mind, refactoring along the way. Look out for a blog post on how this was applied to the friendship methods next week to make it trivial to find the people you follow that don't follow you back.
- categorical_encoding: category_encoders started out as a pair of blog posts about the concept of encoding categorical variables, and eventually grew into a production grade pip-installable library that is compatible with scikit-learn. It's being used in production, and over the past quarter it's gotten some stability and consistency upgrades, better testing, better support for edge cases and missing values, and a bunch of other sort of boring improvements. It's now also available on conda as well as pypi, so check it out. In the next quarter I'd like to get a good benchmarking written for both performance in terms of time/cpu/memory, but also quality of encoding. If you're a data scientist interested in that sort of thing, of course please reach out.
- DummyRDD: DummyRDD has continued to be extremely useful for unit testing pyspark based software. We've got a bunch of the RDD functionality supported (no DataFrame or DataSet support yet), so if you find it useful, let me know, if we're missing something that would help you, let me know, and if you want to help out, there's still a ton to do.
- petersburg: Totally just my personal little experiment, petersburg continues to get development. We support frequency and mixed-mode frequency/classification based estimation now, but I haven't really found a usecase with open data where it is extremely useful. Still neat to hack on.
- pypi-publisher: used in production a good bit, apparently (search "pypi-publisher" in the code on github), and by me nearly daily. It does what it's supposed to and meets my needs perfectly, but could use support for things like bdist or conda. If anyone wants to work on that sort of thing, I'd welcome it, but I probably won't personally any time soon.
- gitnoc: I use this about weekly, currently it's a bit slow, so it's on the backburner until the repo-level parallelism in git-pandas is done.
- cookiecutter-flask: people seem to be using this still pretty consistently, which is great. I haven't touched it in a while.
- pygeohash: this gets used in production at at least a couple of companies, is stable, fast, and well supported. I'm not sure what of value I could add to this other than continued maintenance, but if you have ideas, let me know.
- pyculiarity: this code is still kind of ugly, but it works and is being used in a few places. I'd really like to go through it at some point and simplify it a lot, but don't have plans to do so in the near term.
Defunct / Toy Projects
- RogerRoger: this is a little monitoring API written in Scala. It's mostly just so that I can learn some Scala and D3, so nothing to worry about at this stage. Later on it may be a neat little utility, but don't hold your breath.
- flink-python-examples: I think this has finally fallen out of sync with flink, so I'm not sure any of the examples still work. I'm not using flink for anything, so haven't had a free day to devote to updating it.
- incomprehensible: still kinda fun, but served it's purpose.
- sklearn-extensions: I think there should be a single reference for scikit-learn style 3rd party packages, but now think that it may be better off as a website or something like that than as a package. It's a lot of work and a licensing nightmare to keep it all in one package. To that end, scikit-learn-contrib is interesting, check that out.