In a previous post, I mentioned a little project to provide a pure-python mock of apache spark's RDD object for testing and quick prototyping. Thanks to some help from contributors, we've made a bit of progress and now a good bit of the RDD API is supported, including using the newHadoopAPI with elasticsearch-hadoop, and pulling files from s3.
I've just published the v0.0.2 release, which can be installed as:
pip install dummyrdd==0.0.2
And used like:
from dummy_spark import SparkContext, SparkConf sconf = SparkConf() sc = SparkContext(master='', conf=sconf) rdd = sc.parallelize([1, 2, 3, 4, 5]) print(rdd.count()) print(rdd.map(lambda x: x**2).collect())
In the new release, we've added two small bits of functionality:
- newHadoopAPI support for elasticsearch-hadoop functions, mocked using elasticsearch-py. Should be 1-to-1 functionality and format returned for testing out pyspark programs that query ES into RDDs.
- repartition implemented for all RDDs
These are in addition to the large list of implemented methods that can be found in the readme on github.