Dummy-Spark Release: pure python mock of spark for testing

In a previous post, I mentioned a little project to provide a pure-python mock of apache spark's RDD object for testing and quick prototyping.  Thanks to some help from contributors, we've made a bit of progress and now a good bit of the RDD API is supported, including using the newHadoopAPI with elasticsearch-hadoop, and pulling files from s3.

I've just published the v0.0.2 release, which can be installed as:

pip install dummyrdd==0.0.2

And used like:

from dummy_spark import SparkContext, SparkConf

sconf = SparkConf()
sc = SparkContext(master='', conf=sconf)
rdd = sc.parallelize([1, 2, 3, 4, 5])

print(rdd.count())
print(rdd.map(lambda x: x**2).collect())

In the new release, we've added two small bits of functionality:

  • newHadoopAPI support for elasticsearch-hadoop functions, mocked using elasticsearch-py.  Should be 1-to-1 functionality and format returned for testing out pyspark programs that query ES into RDDs.
  • repartition implemented for all RDDs

These are in addition to the large list of implemented methods that can be found in the readme on github.

https://github.com/wdm0006/DummyRDD

Will

Will has a background in Mechanical Engineering from Auburn, but mostly just writes software now. He was the first employee at Predikto, and is currently building out the premiere platform for predictive maintenance in heavy industry there as Chief Scientist. When not working on that, he is generally working on something related to python, data science or cycling.

Leave a Reply