Tuesday, November 8, 2016

Bypassing Box Upload Limits by API

Box for large files

Box offers 10gb of online storage for free, double what anyone else offers, with an individual max size of 2gb, but you can only upload 250mb files. So how do you upload that 2gb file? The Box API that's how, either with regular ol' requests or their fancy smancy sdk. First follow the Getting Started instructions, sign up for a developer account and create a temporary key. Then in Python, try this out:

# import the requests package
import requests

# copy your token here
TOKEN = "<your developer token>"

# try to get the top level folder, id: "0", using this command exactly as below:
r = requests.get(url='https://api.box.com/2.0/folders/0',
                 headers={'Authorization': 'Bearer %s' % TOKEN})

# check the response
r
#  <Response [200]>
# success!

# get the output
r.json()
# lots of stuff

# upload a file, using the commands exactly as below, except put the actual id number
# of the desired folder
FILES = {'file': open('path/to/myfile','rb')}
PAYLOAD = {'attributes': '{"name":"myfile", "parent":{"id":"<id # of desired folder>"}}'}
r = requests.post(url='https://upload.box.com/api/2.0/files/content',
                  headers={'Authorization': 'Bearer %s' % TOKEN},
                  files=FILES,
                  data=PAYLOAD)

# check the response
r
#  <Response [201]>
# success!

References

Check the online Content API reference for full documentation.

Monday, November 7, 2016

Panda Pop

Pandas Offset Aliases

Memorize this table - or just bookmark this link: Pandas Offset Aliases

Offset Aliases

A number of string aliases are given to useful common time series frequencies. We will refer to these aliases as offset aliases (referred to as time rules prior to v0.8.0).

Alias Description
B business day frequency
C custom business day frequency (experimental)
D calendar day frequency
W weekly frequency
M month end frequency
SM semi-month end frequency (15th and end of month)
BM business month end frequency
CBM custom business month end frequency
MS month start frequency
SMS semi-month start frequency (1st and 15th)
BMS business month start frequency
CBMS custom business month start frequency
Q quarter end frequency
BQ business quarter endfrequency
QS quarter start frequency
BQS business quarter start frequency
A year end frequency
BA business year end frequency
AS year start frequency
BAS business year start frequency
BH business hour frequency
H hourly frequency
T, min minutely frequency
S secondly frequency
L, ms milliseconds
U, us microseconds
N nanoseconds

Tuesday, November 1, 2016

robotic releases

Basic Auto-Versioning from Git

If you're using the winning workflow and the recommended Python project layout then you've set up a CI server to build releases when you tag them in Git, and you set your version in the __init__.py file of your package. But, "Oh, No!" you did it again. You created the Git tag, but forgot to update your code's __version__ string.

Okay, there is a Python package called Versioneer that handles this for you, and it's pretty awesome. But it turns out it's also pretty easy to roll your own, especially if you're just using Git, because Python has a Git implementation called Dulwich that can do this in just a few lines. Maybe it will get integrated into a future version of Dulwich - I've submitted a PR (#462). Anyway, for now, the easiest way to use this is to copy this file into your package at the top level, import it and then set:

__version__ = get_recent_tags()[0][0][1:]

assuming your tags all start with the letter "v" as in "v0.3". Enjoy!

Monday, October 31, 2016

Carousel Cotton Candy

Version 0.3

I'm super excited to announce the Cotton Candy release of Carousel, version 0.3 on PyPI and GitHub. There were a few more issues I really wanted to close with this release, but I decided to push it forward anyway. So the remaining milestones for v0.3 will get pushed to v0.3.1.

  • issue #62 use Meta class for all layers - currently usage is spotty and inconsistent. I wanted to keep the number of commits to close PR #68 to a minimum (following the winning workflow) so I only implemented Meta classes where I had to. In fact it's not even implemented in the DataSource example below.
  • issue #25 move all folders into project package - this is already how I have set up the PVPower demo. It just makes more sense with new style models to have them all in the same package.
  • issue #63 and issue #22 split calculations into separate parameters - I knew this couldn't be done by v0.3, it was a stretch goal, but I'm super excited about this. By moving dependencies to an attribute of each parameter, the DAG shows which calculations are orthogonal, so we can run them simultaneously. According to issue #22, I had this idea already, but when I saw this presentation at PyData SF 2016 on Airflow by Matt Davis I was even more motivated to make it happen.
  • issue #59 and issue #73 which don't seem like they're relevant, but they both have to do with implementing a Calculator class whose job it is to crunch through the calculations, somewhat similar to what DataReaders and FormulaImporter do for their layers. Then the boiler plate uncertainty propagation code in the static class could be applied to any calculator such as a dynamic calculator, a linear system solver for an acyclic DAG of linear equations or a non-linear system solver for an acyclic DAG of non-linear equations.

The Parameter class

What's new in Carousel-0.3 (Cotton Candy)? The biggest difference is the introduction of the Parameter class which is now used to specify parameters for data, formulas, outputs, calculations and simulation settings. For example, previously data parameters would be entered as a dictionary of attributes.

Bicycle Bears

class PVPowerData(DataSource):
    """
    Data sources for PV Power demo.
    """
    data_reader = ArgumentReader
    latitude = {"units": "degrees", "uncertainty": 1.0}
    longitude = {"units": "degrees", "uncertainty": 1.0}
    elevation = {"units": "meters", "uncertainty": 1.0}

Cotton Candy

class PVPowerData(DataSource):
    """
    Data sources for PV Power demo.
    """
    latitude = DataParameter(units="degrees", uncertainty=1.0)
    longitude = DataParameter(units="degrees", uncertainty=1.0)
    elevation = DataParameter(units="meters", uncertainty=1.0)

    class Meta:
        data_reader = ArgumentReader

Why the change? The Bicycle Bear version did not have any way to distinguish parameters, like latitude from attributes of the DataSource like data_reader. This had two unfortunate side-effects:

  • each layer attribute had to be hardcoded in the base metaclass so that they wouldn't be misinterpreted as parameters
  • and users could not define any custom class attributes, because they would be misinterpreted as parameters and stripped from the class by the metaclass.

The Cotton Candy version makes it easy for the metaclass to determine which class attributes are parameters, which are attributes of the layer and then leaves everything else alone. Every layer now has a corresponding Parameter subclass which also defines some base attributes corresponding to that layer. Any extra parameter attributes are saved in extras. Attributes that apply to the entire layer are now specified in the `Meta` class attribute, similar to Django, Marshmallow and DRF. The similarities are completely intentional as I have been strongly inspired by those project code bases. Unfortunately, the Meta class is only partially implemented, but will be the major focus of v0.3.1.

Separated Formulas

The Formula class is also improved. Now each formulas is a Parameter with attributes, rather than a giant dictionary. This improvement is still on the roadmap for the Calculation class. As I said above, it was a stretch goal for this release.

Wednesday, October 12, 2016

Winning Workflow

Intro

There are many blog posts on the topic of effective Git workflows, SO questions and answers, BitBucket tutorials and GitHub guides and an article in the BBC. So why another post on git workflow? None of these workflows seemed right for us, but recently it's just clicked, and I feel like we've finally found the process that works for us. The key was finding the simplest workflow that included the most valuable best practices. In particular, we found that complicated multi-branch strategies were unnecessary, but test driven development (TDD) and continuous integration (CI) were a must.

Winning Workflow


Setting up Remotes

We start with the assumption that all of collaborators fork the upstream repository to their personal profile. Then each person clones their profile to their laptop as origin and adds another remote pointing to the upstream repository. For convenience, they may also create remotes to the forks of their most frequent collaborators.

[myusername@mycomputer ~/Projects]
$ git clone git@github.com:myusername/myrepo.git
[myusername@mycomputer ~/Projects]
$ cd myrepo
[myusername@mycomputer ~/Projects/myrepo]
$ git remote add upstream git@github.com:mycompany/myrepo.git
[myusername@mycomputer ~/Projects/myrepo]
$ git remote add mycollaborator git@github.com:mycollaborator/myrepo.git
[myusername@mycomputer ~/Projects/myrepo]
$ git remote show
  origin
  upstream
  mycollaborator

Ground Rules

The next assumption is that we all keep our version of master synchronized with upstream master. And we never work out of our own master branch! Basically this means at the start of any new work we do the following:

  1. I like to do git fetch --all to get the lay of the land. This combined with
    git log --all --graph --date=short --pretty=format:"%ad %h %s%d [%an]"
    let's me know what everyone is working on, assuming that I've made remotes to their forks.
  2. Then I pull from upstream master to get the latest nightly or release,
  3. and push to origin master to keep my fork current.

Recommended Project Layout

I'm also going to assume that everyone is following the recommended project layout. This means that their project has all dependencies listed in requirements.txt, is developed and deployed in its own virtual environment, includes testing and documentation that aims for >80% coverage, has a boilerplate design that allows testing, documentation and package data to be bundled into a distribution and enables use with a test runner with self discovery, and is written with docstrings for autodocumentation. Nothing is ever perfect, so being diligent of path clashes, aware of the arcana of Mac OS X1 or Windows2 and able to use Stack Overflow to find answers is still important.

Branching, Testing, Pull Requests and Collaboration

  1. Now I switch to a new feature branch with a meaningful name - I'll delete this branch everywhere later so it can be verbose.
  2. The very first code I write is a test or two that demonstrates more or less exactly what we want the feature or bug fix to do. This is one of the most valuable steps because it clearly defines the acceptance criteria. Although it's also important to be thoughtful and flexible - just because your tests pass doesn't necessarily mean the feature is implemented as intended. Some new tests or adjustments may be needed along the way.
  3. Now, before I write any more code, is when I submit a pull request (PR) from my fork's feature branch to upstream/master. So many people are surprised by this. Many collaborators have told me they thought that PR's should be submitted after their work is complete and passing all tests. But in my opinion that defeats the entire point of collaborating on a short iteration cycle.
    • If you wait until the end to submit your work you risk diverging from the feature's intended goals especially if the feature's requirements shift or you've misinterpreted the goals even slightly.
    • Waiting also means you're missing out on collaborating with your teammates and soliciting their feedback mid-project.
    On the other hand, by submitting your PR right after you write your tests means:
    • Every push to your fork will trigger a build that runs your tests.
    • Your teammates will get continuous updates so they can monitor your progress in real-time but also on their time so you won't have to hold a formal review, since collaborators can review your work anytime as the commits will all be queued in the PR.
    I think the reason people wait until the end to submit PR's is the same reason they like to write tests at the end. I used to hate seeing my tests fail because it made me feel like I was failing. I think people delay submitting their PR's because they're nervous about having incomplete work reviewed out of context and receiving unfair criticism or harsh judgment. IMO, punitive behavior is dysfunctional and a collaboration killer and should be rooted out with a frank discussion about what mutual success looks like. I also think some people aren't natural collaborators and don't want other's interfering with their work. Again, a constructive discussion can help promote new habits, although don't expect people to change overnight. You can take a hard stance on punitive behavior but you can't expect an introvert to feel comfortable sharing themselves freely without some accommodations.
  4. Now comes the really fun part. We hack and collaborate until the tests all pass. But we don't have too much fun - there should be at most 10 commits before we realize we've embarked on an epic that needs to be re-organized, otherwise the PR will become difficult to merge. That will sap moral and waste time. So keep it simple.
  5. The final bit of collaboration is the code review and merging the PR into upstream master. This is fairly easy, since there are
    • already tests that demonstrate what the code should do,
    • only a few commits,
    • and all of the collaborators have been following the commits as they've been queuing in the PR.
    So really the review and merge is a sanity check. Do these tests really demonstrate the feature as intended? Anything else major would have stood out already.
  6. Whoever the repository owner or maintainer is should add the tag and push it to upstream. This triggers the CI to test, build and deploy a new release.

Continuous Integration

This is key. Set up Travis, Circle, AppVeyor or Jenkins on upstream master to test and build every commit, every commit to an open PR and to deploy on every tag. Easy!

Wrapping Up

There are some features of this style that stand out:

  • There is only one master branch. Using CI to deploy only on tags eliminates our need for a dev or staging branch because any commits on master not tagged are the equivalent of the bleeding edge.
  • This method depends heavily on an online hosted Git repo like GitHub or BitBucket, use of TDD, strong collaboration and a CI server like Travis.

Happy Coding!


footnotes

  1. On Mac OS X matplotlib will not work in a virtual environment unless a framework interpreter is used. The easiest way to do this is to run python as PYTHONHOME=/home/you/path/to/project/venv/ python instead of using source venv/bin/activate.
  2. On Windows pip often creates an executable for scripts that is bound to the Python interpreter it was installed with. If the virtual environments was created with system site packages or if the package is not installed in the virtual environment then you may get a confusing path clash. For example running the nosetests script will use your system Python and therefore the Python path will not include your virtual environment. The solution is to never use system site packages and install all dependencies directly in your virtual environment.

Tuesday, July 19, 2016

Derived Django Database Field

The trick to this is creating a custom field and overloading pre_save. Pay special attention to the self.attname member that is set to the value. The source for DateField is a good example. Make sure that if you add any new attributes to the field in it's __init__ method you also add a corresponding deconstruct method.

Monday, July 18, 2016

Mocking Django App

I'm sure this is completely wrong. I needed a Django model for testing, but I don't have a Django app or even a Django project. I'm developing a Django model reader for Carousel, and so I needed a model to test it out with. Sure I could have created a quick django project, but that seemed silly, and my first instinct was to import django.db.models, make a model and use it, but this raised:

ImproperlyConfigured: Requested setting DEFAULT_INDEX_TABLESPACE, but settings are not
                      configured. You must either define the environment variable
                      DJANGO_SETTINGS_MODULE or call settings.configure() before accessing
                      settings.

Most normal people would turn back now, but instead I imported django.conf.settings and called settings.configure() just like it said to do. Now I got this error:

AppRegistryNotReady: Apps aren't loaded yet.

So now I felt like I was getting somewhere. But where? Googling told me to import django and run setup which I did and that raised:

RuntimeError: Model class __main__.MyModel doesn't declare an explicit app_label and isn't
              in an application in INSTALLED_APPS.

Wow! Normally RuntimeError is a scary warning, like you dumped your core, but this just said I needed to add the app to settings.INSTALLED_APPS, which makes perfect sense, and it also complained that my model wasn't actually part of an app and even explained how to explicitly declare it. Some more Googling and I discovered that app_label is a model option that can be set in class Meta. So I did as told, and it worked!

from django.db import models
from django.conf import settings
import django

MYAPP = 'myapp.MyApp'
settings.configure()
django.setup()
settings.INSTALLED_APPS.append(MYAPP)


class MyModel(models.Model):
    air_temp = models.FloatField()
    latitude = models.FloatField()
    longitude = models.FloatField()
    timezone = models.FloatField()
    pvmodule = models.CharField(max_length=20)

    class Meta:
        app_label = MYAPP


mymodel = MyModel(air_temp=25.0, latitude=38.0, longitude=-122.0,
                  timezone=-8.0, pvmodule='SPR E20-327')

mymodel.__dict__
#{'_state': <django.db.models.base.ModelState at 0x496b2b0>,
# 'air_temp': 25.0,
# 'id': None,
# 'latitude': 38.0,
# 'longitude': -122.0,
# 'pvmodule': 'SPR E20-327',
# 'timezone': -8.0}

Caveats

So I should stop here and point out that that evidently the order of these commands matters, because if I add the fake app to INSTALLED_APPS before calling django.setup() then I get this:

ImportError: No module named myapp

And unfortunately, I just figured this out now, in this post. But this isn't what I originally did. Yes, I'm completely crazy. First I added a fake module called 'myapp' to sys.modules setting it to a mock object, but that didn't work. I got back TypeError: 'Mock' object is not iterable because, as I found out later, there has to be an AppConfig subclass in the app module. But since I didn't know that yet, I did the only logical thing and put the module in a list. What? Yes, did I mention I'm an idiot? This nonsense yielded the following stern warning:

ImproperlyConfigured: The app module [] has no filesystem location, you
                      must configure this app with an AppConfig subclass with a 'path' class
                      attribute.

But this is where I found out about AppConfig in the Django docs which is covered quite nicely. Following the nice directions, I did as told and subclassed AppConfig, added path and also name which I learned from the docs, monkeypatched my mock module with it, and used the dotted name of the app myapp.MyApp now. I felt like I was getting closer, since I only got: AttributeError: __name__ which seemed like a problem with my pretend module. Another monkeypatch and we have my final ludicrously ridiculous hack.

from django.db import models
from django.conf import settings
import django
from django.apps import AppConfig
import sys
import mock

class MyApp(AppConfig):
    """
    Apps subclass ``AppConfig`` and define ``name`` and ``path``
    """
    path = '.'  # path to app
    name = 'myapp'  # name of app


# make a mock module with ``__name__`` and ``MyApp`` member
myapp_module = mock.Mock(__name__='myapp', MyApp=MyApp)
MYAPP = 'myapp.MyApp'  # full path to app
sys.modules['myapp'] = myapp_module  # register module
settings.configure()
settings.INSTALLED_APPS.append(MYAPP)
django.setup()


class MyModel(models.Model):
    air_temp = models.FloatField()
    latitude = models.FloatField()
    longitude = models.FloatField()
    timezone = models.FloatField()
    pvmodule = models.CharField(max_length=20)

    class Meta:
        app_label = MYAPP


mymodel = MyModel(air_temp=25.0, latitude=38.0, longitude=-122.0,
                  timezone=-8.0, pvmodule='SPR E20-327')

mymodel.__dict__
#{'_state': <django.db.models.base.ModelState at 0x496b2b0>,
# 'air_temp': 25.0,
# 'id': None,
# 'latitude': 38.0,
# 'longitude': -122.0,
# 'pvmodule': 'SPR E20-327',
# 'timezone': -8.0}

Yay?

Fork me on GitHub