[07:01] <edsu_> soultcer: hi :-) i was looking at your tinyarchive/tinyback code and was wondering at the intended workflow of the admin scripts for building a release
[07:03] <edsu> soultcer: also, looking at the problem the tracker is having right now, lots of sporadic database locked errors, and thinking about switching it over to postgres instead
[07:04] <edsu> soultcer: luckily that appears to be a relatively easy change given the way you used web.py
[07:04] <edsu> soultcer: little matter of porting the sqlite database to postgres ...
[07:07] <edsu> soultcer: and ... just wanted to say how it was nice code to read :)
[08:51] <edsu> pft: if you are curious about postgres changes i made https://github.com/edsu/tinyarchive/commit/1f6e95ed60b0655d01d38e7c49fd99b24d0f6bb1
[08:52] <edsu> pft: if you (or anyone else) want to help me kick the tires on it, point your tinyback at http://ec2-54-204-142-70.compute-1.amazonaws.com:8080/
[09:16] <edsu> opened an issue ticket to track the work too https://github.com/ArchiveTeam/tinyarchive/issues/2
[10:31] <GLaDOS> edsu: btw, you now have commit access to the urlteam repos
[10:34] <edsu> thanks!
[10:43] <edsu> still need to figure out how to generate a release
[15:20] <soultcer> edsu: To be honest, I think the tracker code is horrible. The whole tracker + sqlite thing was more of a test implementation and then it worked "okay enough" to let it be
[15:20] <soultcer> About the release process: There are two separate parts to the tinyarchive project: The tracker, and the main database
[15:21] <soultcer> The tracker hands out tasks and temporarily stores their results. The main database holds all the shortlinks. Periodically the finished tasks from the tracker are removed, and imported into the main db
[15:22] <soultcer> The advantage is that you can have a well-connected server with little storage run the tracker, while the main db previously was stored on my home server (unstable internet connection, but way better hardware/storage)
[15:23] <soultcer> The main database is a collection of berkeleydb databases, one database for each "service" (= url shortener). The implementation for that is in the tinyarchive/database.py file.
[15:24] <soultcer> If you want to create your own db, you'll need to import the last release from http://urlte.am/ first. Download it, then use release_import.py to import it. The DB needs about 500 GB if I remember correctly.
[15:25] <soultcer> Then you can import tasks using import.py.
[15:25] <soultcer> And every 6 months or so a new release is created using create_release.py
[15:27] <soultcer> To make it easier to get a new release via BitTorrent, it is a good to have unchanged files if no shortcode has changed. Because the files are also xz-compressed, this requires some tricks.
[15:28] <soultcer> create_release.py compares the newly created files to the files from the previous release (--old-release= switch). If the uncompressed versions match, it just copies the compressed file from the old release instead of recompressing it.
[15:28] <soultcer> This saves time (xz compression is slow) and allows BitTorrent to recognize the file as unchanged.
[15:29] <soultcer> Okay, now some more things about the tracker:
[15:30] <soultcer> I saw a mention in scrollback about allowing "admin rights" for connections from 127.0.0.1: web.py understands the x-forwarded-for header and it is crucial that whatever proxy runs in front of the tracker sends this header.
[15:31] <soultcer> a) Because otherwise every user could create/retrieve/delete tasks and b) because the tracker only hands out 1 task per service per IP address
[15:31] <soultcer> So if the tracker only sees 127.0.0.1 as IP address, it will not hand out tasks correctly.
[15:32] <soultcer> Also, there should only be one thread running for the tracker. I used nginx to make sure that only 1 connection was forwarded at a time.
[15:33] <soultcer> If there is more than 1 connection, it should _not_ corrupt the database, but it might lead to the same task being handed out multiple times, and then one user will not be able to submit his finished task. So it won't break anything, but cause needless work to be done.
[15:35] <soultcer> Now, since you have ported the tracker to postgres, you could use database locking to ensure no tasks are handed out multiple times. Or you could improve the database schema so that handed out tasks are stored in a separate table.
[15:35] <soultcer> Then multiple web.py threads could run at the same time and locking will only be done for a short time, during handing out tasks
[16:59] <pft> this is good stuff
[17:10] <pft> edsu: testing engaged
[17:10] <pft> edsu: if things get out of hand and you need me to shut down the testing just send me a msg, those forward to my phone
[21:33] <edsu> pft: thank you sir
[21:34] <edsu> pft: seems to be doing ok http://ec2-54-204-142-70.compute-1.amazonaws.com:8080/
[21:34] <edsu> there were just a few little wrinkles to getting the sql dump loaded into postgres
[21:34] <edsu> i was contemplating changing the timestamp colums to datetimes
[21:34] <edsu> but that would involve some more work
[21:35] <edsu> maybe that could be an incremental improvement for down the road?
[21:35] <edsu> columns
[21:35] <edsu> bauruine: thanks for helping out w/ the testing too :)
[21:37] <bauruine> edsu, np
[21:41] <pft> yeah, datetimes might be better
[21:41] <pft> might be better gains to be found by doing some of what soultcer said
[21:50] * edsu looks at scrollback
[21:58] <edsu> lots of good info
[21:58] <edsu> so i'm not testing behind varnish, which is a big difference
[22:01] <edsu> i see, the varnish config does forward the ip, which is good :)
[22:04] <edsu> soultcer: you have any suspicions about the locking errors from sqlite? https://github.com/ArchiveTeam/tinyarchive/issues/2
[22:06] <soultcer> That's what happens when two processes try to access the sqlite db at the same time. Ideally that should not happen.
[22:07] <soultcer> Though it does not cause any db corruption. I think there is a config option for sqlite to wait for a couple of milliseconds for the lock instead of throwing an exception
[22:08] <edsu> soultcer: current the tracker seems to block so much varnish thinks it is dead
[22:08] <edsu> currently
[22:09] <soultcer> If varnish is configured to only let 1 concurrent connection through, then this exception should never happen (unless varnish is somehow bypassed)
[22:09] <edsu> i guess the tracker itself is multithreaded, i didn't realize web.py could do that, but top definitely seems to show multiple threads
[22:10] <soultcer> It depends on how web.py is run. webpy is just the web framework, you can run it using pretty much any threaded or non-threaded wsgi-compatible server
[22:10] <edsu> seems to be run without any indication of processes currently
[22:10] * edsu double checks
[22:11] <edsu> yea, python tracker.py 127.0.0.1:7091
[22:11] <edsu> soultcer: if you are curious you can take a peek yourself, i can add your pubkey
[22:12] <edsu> pft, baurine: i have varnish running now in front of the tracker, if you want to point your clients at it
[22:12] <edsu> http://ec2-54-204-142-70.compute-1.amazonaws.com/
[22:12] <soultcer> The easiest way is to just tell varnish to only pass 1 connection at a time through.
[22:15] <edsu> .max_connections
[22:16] <edsu> i guess we can try that, but it still might be nice to upgrade to postgresql
[22:18] <edsu> but perhaps it's a bit more work than i suspected, with the task creation bits
[22:23] <edsu> well .max_connections = 1 does make the database lock errors go away :)
[22:25] <edsu> but it means that varnishncsa is throwing a 503 most of the time, because the queue is full
[22:31] <soultcer> Hehe that's the next problem with sqlite
[22:32] <soultcer> It's slow especially with that borked db layout I made
[22:36] <edsu> sqlite was masking a aggregate query bug that running of postgres showed
[22:36] <edsu> in principle the schema is pretty straightforward and fine imho
[22:37] <edsu> you are too hard on yourself, the code is quite nice too ; the admin workflow is a bit difficult to grok, but i've seen so much worse :-)