Synchronizing a local .tsv file to a google docs spreadsheet

How to share your hard-earned locally obtained/maintained data with knobheads.

Every once in a while one of those knobheads who don’t use org-mode (=rest of the world) require access to your data that’s beyond the occasional one-off export.

These people, being goal driven and sociable, tend to be a real nuisance. They would call & IM you, destroying what’s left of your precious attention span, and incur the wraths of the attention context switch upon you.

Realistically, you’ve got two options - switch to using something like google docs, thus cementing your existsance as no more than a slave to the corporate plankton. Or - you could come up with a way to push data from where it’s convenient for you to edit/generate/obtain/work it and over to their enterprisey la-la land.

Enter sheet_sync.py, a proof of concept for one way synchronization of tabular data in .TSV format locally, to a google docs spreadsheet.

Tested & found to be working on a sheet of > 400 rows.

There were a few points of pain while implementing this technological marvel:

  1. OAuth authentication. In order to access the sheet, you need a google service account, one you could create at the API console. Once you do, don’t forget to save the credentials .json locally (auth.json) & give that account permissions to whatever sheet you’re about to upload to.
  2. Rate limiting. Google’s pretty aggressive about that. Therefore the @rate_limited decorator to prevent you from shooting yourself in the foot.
  3. Synchronization logic. The approach chosen for this baby is by matching rows by composite keys. A composite key, in our case, is when a row matches by multiple specified columns. When extra rows are encountered - they are mercilessly deleted.

 You, dear reader, are more than welcome to use this as foundation for your own instruments of knobhead empowerment.

P.S - Be careful. If the sheet is re-sorted, filtered or otherwise manipulated during data sync, shit hits the fan. This is, for the most part, not meant as a tool for two-way communication.

module.exports = "#!/usr/bin/env python\n\n\"\"\"\nsynchronize data from a local file to a google spreadsheet, where rows are located according to keys.\nachtung! lots of case-specific bits (marked with FIXME). use with caution.\ntested on python 3.6.6\n\"\"\"\nimport gspread,json,sys\n\nimport threading,time # for rate limiting\nfrom functools import wraps\nfrom random import shuffle\n\nshid = sys.argv[2] # google sheet to write to\n\ncfn='auth.json'\n\nfrom oauth2client.service_account import ServiceAccountCredentials\n\nscope = ['https://spreadsheets.google.com/feeds',\n         'https://www.googleapis.com/auth/drive']\n\ndo_not_update = ['notes'] # FIXME: case-specific. the list of columns not to update on the destination sheet.\n\ncredentials = ServiceAccountCredentials.from_json_keyfile_name(cfn, scope)\n\ngc = gspread.authorize(credentials)\nsh = gc.open_by_key(shid)\nworksheet = sh.sheet1\nprint('spreadsheet',sh.id)\nmakeheader=True\nfrom fabfile import sslheader as sheetheader # FIXME: case-specific. this is a list of all columns\nlastcol = chr(ord('A')+len(sheetheader)-1)\n\ndef rate_limited(max_per_second: int):\n    \"\"\"Rate-limits the decorated function locally, for one process.\"\"\"\n    lock = threading.Lock()\n    min_interval = 1.0 / max_per_second\n\n    def decorate(func):\n        last_time_called = time.perf_counter()\n\n        @wraps(func)\n        def rate_limited_function(*args, **kwargs):\n            lock.acquire()\n            nonlocal last_time_called\n            try:\n                elapsed = time.perf_counter() - last_time_called\n                left_to_wait = min_interval - elapsed\n                if left_to_wait > 0:\n                    time.sleep(left_to_wait)\n\n                return func(*args, **kwargs)\n            finally:\n                last_time_called = time.perf_counter()\n                lock.release()\n\n        return rate_limited_function\n\n    return decorate\n\nif makeheader:\n    # establish headers if absent\n    header = worksheet.row_values(1)\n    if header!=sheetheader:\n        print(header,'!=',sheetheader)\n        rng = 'A1:%s1'%lastcol\n        print('rng',rng)\n        cl = worksheet.range(rng)\n        i=0\n        for c in cl:\n            print(i)\n            c.value = sheetheader[i]\n            print(c,'=',sheetheader[i])\n            i+=1\n        worksheet.update_cells(cl)\n        print('headers established')\n    else:\n        print('headers ok')\nelse:\n    header = sheetheader\n    \n# walk over each output row and look for its keys in existing worksheet\ncnt=0\nkeys = ['repo','hostname'] # FIXME: case-specific . these are the keys by which row uniqueness is identified\ncolvals={}\n# obtain key column values\nfor k in keys:\n    kpos = header.index(k)\n    vals = worksheet.col_values(kpos+1)[1:]\n    colvals[k]=vals\n\n\n\nrows = open(sys.argv[1],'r').read().split(\"\\n\")[1:]\nshuffle(rows)\n\nfor row in rows:\n    cnt+=1\n    r = row.split(\"\\t\")\n    if r==['']: continue\n    #assert len(header)==len(r),Exception(header,r)\n    try:\n        rv = dict([(header[i],\n                    (len(r)-1>i and r[i] or None)) \\\n                   for i in range(len(header))])\n    except IndexError as e:\n        print('could not expand',r,'by',header)\n        raise\n\n    # determine if we have row in spread for this\n\n    def calcrowsfound():\n        rowsfound={}\n        for k in keys:\n            rowsfound[k] = [i for i in range(len(colvals[k])) if colvals[k][i]==rv[k]]\n        return rowsfound\n    rowsfound = calcrowsfound()\n    # find the intersection of all these sets\n\n    isect = set.intersection(*[set(i[1]) for i in rowsfound.items()])\n    if not len(isect):\n        @rate_limited(1)\n        def apnd():\n            worksheet.resize(cols=len(header)) # otherwise append misbehaves..\n            worksheet.append_row(r)\n            print('appending',r)\n        apnd()\n    else:\n        while len(isect)>1:\n            todel = isect.pop()\n            print('deleting row',todel,[rv[k] for k in keys])\n            for k in keys:\n                print('colval',k,':',colvals[k][todel])\n                del colvals[k][todel]\n            tdr = todel+1\n            worksheet.delete_row(tdr)\n            rowsfound = calcrowsfound()\n            isect = set.intersection(*[set(i[1]) for i in rowsfound.items()])\n            print('deleted row',tdr,worksheet.row_values(tdr))\n            #isect,[rv[k] for k in keys])\n        rownum = list(isect)[0]+2\n        rng = 'A%s:%s%s'%(rownum,lastcol,rownum)\n        #print('going to update',rng,'=>',r)\n        @rate_limited(1)\n        def rngf():\n            cl = worksheet.range(rng)\n            return cl\n        cl = rngf()\n        i=0\n        changed=False\n        for c in cl:\n            if header[i] in ['valid_fr','valid_to']: # FIXME: case-specific\n                cmpf = lambda x: x.replace(' 0',' ')\n            else:\n                cmpf = lambda x: x\n                \n            if (len(r)-1>i) and cmpf(c.value)!=cmpf(r[i]) and header[i] not in do_not_update:\n                jkey=(isect,'::'.join([rv[k] for k in keys])) \n                print(jkey,'.',c,'=',r[i],'(',i,header[i],')')\n                assert c.value!=r[i], Exception('but why?',c.value==r[i],header[i],type(c.value),type(r[i]))\n                c.value= r[i]\n                changed=True\n            i+=1\n        if changed:\n            @rate_limited(1)\n            def updc():\n                worksheet.update_cells(cl)\n            updc()\n    print(cnt,'/',len(rows))\n"

© 2023 Web GMA R&D Ltd.