#urlteam 2013-11-10,Sun

↑back Search

Time Nickname Message
07:01 🔗 edsu_ soultcer: hi :-) i was looking at your tinyarchive/tinyback code and was wondering at the intended workflow of the admin scripts for building a release
07:03 🔗 edsu soultcer: also, looking at the problem the tracker is having right now, lots of sporadic database locked errors, and thinking about switching it over to postgres instead
07:04 🔗 edsu soultcer: luckily that appears to be a relatively easy change given the way you used web.py
07:04 🔗 edsu soultcer: little matter of porting the sqlite database to postgres ...
07:07 🔗 edsu soultcer: and ... just wanted to say how it was nice code to read :)
08:51 🔗 edsu pft: if you are curious about postgres changes i made https://github.com/edsu/tinyarchive/commit/1f6e95ed60b0655d01d38e7c49fd99b24d0f6bb1
08:52 🔗 edsu pft: if you (or anyone else) want to help me kick the tires on it, point your tinyback at http://ec2-54-204-142-70.compute-1.amazonaws.com:8080/
09:16 🔗 edsu opened an issue ticket to track the work too https://github.com/ArchiveTeam/tinyarchive/issues/2
10:31 🔗 GLaDOS edsu: btw, you now have commit access to the urlteam repos
10:34 🔗 edsu thanks!
10:43 🔗 edsu still need to figure out how to generate a release
15:20 🔗 soultcer edsu: To be honest, I think the tracker code is horrible. The whole tracker + sqlite thing was more of a test implementation and then it worked "okay enough" to let it be
15:20 🔗 soultcer About the release process: There are two separate parts to the tinyarchive project: The tracker, and the main database
15:21 🔗 soultcer The tracker hands out tasks and temporarily stores their results. The main database holds all the shortlinks. Periodically the finished tasks from the tracker are removed, and imported into the main db
15:22 🔗 soultcer The advantage is that you can have a well-connected server with little storage run the tracker, while the main db previously was stored on my home server (unstable internet connection, but way better hardware/storage)
15:23 🔗 soultcer The main database is a collection of berkeleydb databases, one database for each "service" (= url shortener). The implementation for that is in the tinyarchive/database.py file.
15:24 🔗 soultcer If you want to create your own db, you'll need to import the last release from http://urlte.am/ first. Download it, then use release_import.py to import it. The DB needs about 500 GB if I remember correctly.
15:25 🔗 soultcer Then you can import tasks using import.py.
15:25 🔗 soultcer And every 6 months or so a new release is created using create_release.py
15:27 🔗 soultcer To make it easier to get a new release via BitTorrent, it is a good to have unchanged files if no shortcode has changed. Because the files are also xz-compressed, this requires some tricks.
15:28 🔗 soultcer create_release.py compares the newly created files to the files from the previous release (--old-release= switch). If the uncompressed versions match, it just copies the compressed file from the old release instead of recompressing it.
15:28 🔗 soultcer This saves time (xz compression is slow) and allows BitTorrent to recognize the file as unchanged.
15:29 🔗 soultcer Okay, now some more things about the tracker:
15:30 🔗 soultcer I saw a mention in scrollback about allowing "admin rights" for connections from 127.0.0.1: web.py understands the x-forwarded-for header and it is crucial that whatever proxy runs in front of the tracker sends this header.
15:31 🔗 soultcer a) Because otherwise every user could create/retrieve/delete tasks and b) because the tracker only hands out 1 task per service per IP address
15:31 🔗 soultcer So if the tracker only sees 127.0.0.1 as IP address, it will not hand out tasks correctly.
15:32 🔗 soultcer Also, there should only be one thread running for the tracker. I used nginx to make sure that only 1 connection was forwarded at a time.
15:33 🔗 soultcer If there is more than 1 connection, it should _not_ corrupt the database, but it might lead to the same task being handed out multiple times, and then one user will not be able to submit his finished task. So it won't break anything, but cause needless work to be done.
15:35 🔗 soultcer Now, since you have ported the tracker to postgres, you could use database locking to ensure no tasks are handed out multiple times. Or you could improve the database schema so that handed out tasks are stored in a separate table.
15:35 🔗 soultcer Then multiple web.py threads could run at the same time and locking will only be done for a short time, during handing out tasks
16:59 🔗 pft this is good stuff
17:10 🔗 pft edsu: testing engaged
17:10 🔗 pft edsu: if things get out of hand and you need me to shut down the testing just send me a msg, those forward to my phone
21:33 🔗 edsu pft: thank you sir
21:34 🔗 edsu pft: seems to be doing ok http://ec2-54-204-142-70.compute-1.amazonaws.com:8080/
21:34 🔗 edsu there were just a few little wrinkles to getting the sql dump loaded into postgres
21:34 🔗 edsu i was contemplating changing the timestamp colums to datetimes
21:34 🔗 edsu but that would involve some more work
21:35 🔗 edsu maybe that could be an incremental improvement for down the road?
21:35 🔗 edsu columns
21:35 🔗 edsu bauruine: thanks for helping out w/ the testing too :)
21:37 🔗 bauruine edsu, np
21:41 🔗 pft yeah, datetimes might be better
21:41 🔗 pft might be better gains to be found by doing some of what soultcer said
21:50 🔗 * edsu looks at scrollback
21:58 🔗 edsu lots of good info
21:58 🔗 edsu so i'm not testing behind varnish, which is a big difference
22:01 🔗 edsu i see, the varnish config does forward the ip, which is good :)
22:04 🔗 edsu soultcer: you have any suspicions about the locking errors from sqlite? https://github.com/ArchiveTeam/tinyarchive/issues/2
22:06 🔗 soultcer That's what happens when two processes try to access the sqlite db at the same time. Ideally that should not happen.
22:07 🔗 soultcer Though it does not cause any db corruption. I think there is a config option for sqlite to wait for a couple of milliseconds for the lock instead of throwing an exception
22:08 🔗 edsu soultcer: current the tracker seems to block so much varnish thinks it is dead
22:08 🔗 edsu currently
22:09 🔗 soultcer If varnish is configured to only let 1 concurrent connection through, then this exception should never happen (unless varnish is somehow bypassed)
22:09 🔗 edsu i guess the tracker itself is multithreaded, i didn't realize web.py could do that, but top definitely seems to show multiple threads
22:10 🔗 soultcer It depends on how web.py is run. webpy is just the web framework, you can run it using pretty much any threaded or non-threaded wsgi-compatible server
22:10 🔗 edsu seems to be run without any indication of processes currently
22:10 🔗 * edsu double checks
22:11 🔗 edsu yea, python tracker.py 127.0.0.1:7091
22:11 🔗 edsu soultcer: if you are curious you can take a peek yourself, i can add your pubkey
22:12 🔗 edsu pft, baurine: i have varnish running now in front of the tracker, if you want to point your clients at it
22:12 🔗 edsu http://ec2-54-204-142-70.compute-1.amazonaws.com/
22:12 🔗 soultcer The easiest way is to just tell varnish to only pass 1 connection at a time through.
22:15 🔗 edsu .max_connections
22:16 🔗 edsu i guess we can try that, but it still might be nice to upgrade to postgresql
22:18 🔗 edsu but perhaps it's a bit more work than i suspected, with the task creation bits
22:23 🔗 edsu well .max_connections = 1 does make the database lock errors go away :)
22:25 🔗 edsu but it means that varnishncsa is throwing a 503 most of the time, because the queue is full
22:31 🔗 soultcer Hehe that's the next problem with sqlite
22:32 🔗 soultcer It's slow especially with that borked db layout I made
22:36 🔗 edsu sqlite was masking a aggregate query bug that running of postgres showed
22:36 🔗 edsu in principle the schema is pretty straightforward and fine imho
22:37 🔗 edsu you are too hard on yourself, the code is quite nice too ; the admin workflow is a bit difficult to grok, but i've seen so much worse :-)

irclogger-viewer