#urlteam 2013-11-10,Sun

↑back Search

Time	Nickname	Message
07:01 ^🔗	edsu_	soultcer: hi :-) i was looking at your tinyarchive/tinyback code and was wondering at the intended workflow of the admin scripts for building a release
07:03 ^🔗	edsu	soultcer: also, looking at the problem the tracker is having right now, lots of sporadic database locked errors, and thinking about switching it over to postgres instead
07:04 ^🔗	edsu	soultcer: luckily that appears to be a relatively easy change given the way you used web.py
07:04 ^🔗	edsu	soultcer: little matter of porting the sqlite database to postgres ...
07:07 ^🔗	edsu	soultcer: and ... just wanted to say how it was nice code to read :)
08:51 ^🔗	edsu	pft: if you are curious about postgres changes i made https://github.com/edsu/tinyarchive/commit/1f6e95ed60b0655d01d38e7c49fd99b24d0f6bb1
08:52 ^🔗	edsu	pft: if you (or anyone else) want to help me kick the tires on it, point your tinyback at http://ec2-54-204-142-70.compute-1.amazonaws.com:8080/
09:16 ^🔗	edsu	opened an issue ticket to track the work too https://github.com/ArchiveTeam/tinyarchive/issues/2
10:31 ^🔗	GLaDOS	edsu: btw, you now have commit access to the urlteam repos
10:34 ^🔗	edsu	thanks!
10:43 ^🔗	edsu	still need to figure out how to generate a release
15:20 ^🔗	soultcer	edsu: To be honest, I think the tracker code is horrible. The whole tracker + sqlite thing was more of a test implementation and then it worked "okay enough" to let it be
15:20 ^🔗	soultcer	About the release process: There are two separate parts to the tinyarchive project: The tracker, and the main database
15:21 ^🔗	soultcer	The tracker hands out tasks and temporarily stores their results. The main database holds all the shortlinks. Periodically the finished tasks from the tracker are removed, and imported into the main db
15:22 ^🔗	soultcer	The advantage is that you can have a well-connected server with little storage run the tracker, while the main db previously was stored on my home server (unstable internet connection, but way better hardware/storage)
15:23 ^🔗	soultcer	The main database is a collection of berkeleydb databases, one database for each "service" (= url shortener). The implementation for that is in the tinyarchive/database.py file.
15:24 ^🔗	soultcer	If you want to create your own db, you'll need to import the last release from http://urlte.am/ first. Download it, then use release_import.py to import it. The DB needs about 500 GB if I remember correctly.
15:25 ^🔗	soultcer	Then you can import tasks using import.py.
15:25 ^🔗	soultcer	And every 6 months or so a new release is created using create_release.py
15:27 ^🔗	soultcer	To make it easier to get a new release via BitTorrent, it is a good to have unchanged files if no shortcode has changed. Because the files are also xz-compressed, this requires some tricks.
15:28 ^🔗	soultcer	create_release.py compares the newly created files to the files from the previous release (--old-release= switch). If the uncompressed versions match, it just copies the compressed file from the old release instead of recompressing it.
15:28 ^🔗	soultcer	This saves time (xz compression is slow) and allows BitTorrent to recognize the file as unchanged.
15:29 ^🔗	soultcer	Okay, now some more things about the tracker:
15:30 ^🔗	soultcer	I saw a mention in scrollback about allowing "admin rights" for connections from 127.0.0.1: web.py understands the x-forwarded-for header and it is crucial that whatever proxy runs in front of the tracker sends this header.
15:31 ^🔗	soultcer	a) Because otherwise every user could create/retrieve/delete tasks and b) because the tracker only hands out 1 task per service per IP address
15:31 ^🔗	soultcer	So if the tracker only sees 127.0.0.1 as IP address, it will not hand out tasks correctly.
15:32 ^🔗	soultcer	Also, there should only be one thread running for the tracker. I used nginx to make sure that only 1 connection was forwarded at a time.
15:33 ^🔗	soultcer	If there is more than 1 connection, it should _not_ corrupt the database, but it might lead to the same task being handed out multiple times, and then one user will not be able to submit his finished task. So it won't break anything, but cause needless work to be done.
15:35 ^🔗	soultcer	Now, since you have ported the tracker to postgres, you could use database locking to ensure no tasks are handed out multiple times. Or you could improve the database schema so that handed out tasks are stored in a separate table.
15:35 ^🔗	soultcer	Then multiple web.py threads could run at the same time and locking will only be done for a short time, during handing out tasks
16:59 ^🔗	pft	this is good stuff
17:10 ^🔗	pft	edsu: testing engaged
17:10 ^🔗	pft	edsu: if things get out of hand and you need me to shut down the testing just send me a msg, those forward to my phone
21:33 ^🔗	edsu	pft: thank you sir
21:34 ^🔗	edsu	pft: seems to be doing ok http://ec2-54-204-142-70.compute-1.amazonaws.com:8080/
21:34 ^🔗	edsu	there were just a few little wrinkles to getting the sql dump loaded into postgres
21:34 ^🔗	edsu	i was contemplating changing the timestamp colums to datetimes
21:34 ^🔗	edsu	but that would involve some more work
21:35 ^🔗	edsu	maybe that could be an incremental improvement for down the road?
21:35 ^🔗	edsu	columns
21:35 ^🔗	edsu	bauruine: thanks for helping out w/ the testing too :)
21:37 ^🔗	bauruine	edsu, np
21:41 ^🔗	pft	yeah, datetimes might be better
21:41 ^🔗	pft	might be better gains to be found by doing some of what soultcer said
21:50 ^🔗	*	edsu looks at scrollback
21:58 ^🔗	edsu	lots of good info
21:58 ^🔗	edsu	so i'm not testing behind varnish, which is a big difference
22:01 ^🔗	edsu	i see, the varnish config does forward the ip, which is good :)
22:04 ^🔗	edsu	soultcer: you have any suspicions about the locking errors from sqlite? https://github.com/ArchiveTeam/tinyarchive/issues/2
22:06 ^🔗	soultcer	That's what happens when two processes try to access the sqlite db at the same time. Ideally that should not happen.
22:07 ^🔗	soultcer	Though it does not cause any db corruption. I think there is a config option for sqlite to wait for a couple of milliseconds for the lock instead of throwing an exception
22:08 ^🔗	edsu	soultcer: current the tracker seems to block so much varnish thinks it is dead
22:08 ^🔗	edsu	currently
22:09 ^🔗	soultcer	If varnish is configured to only let 1 concurrent connection through, then this exception should never happen (unless varnish is somehow bypassed)
22:09 ^🔗	edsu	i guess the tracker itself is multithreaded, i didn't realize web.py could do that, but top definitely seems to show multiple threads
22:10 ^🔗	soultcer	It depends on how web.py is run. webpy is just the web framework, you can run it using pretty much any threaded or non-threaded wsgi-compatible server
22:10 ^🔗	edsu	seems to be run without any indication of processes currently
22:10 ^🔗	*	edsu double checks
22:11 ^🔗	edsu	yea, python tracker.py 127.0.0.1:7091
22:11 ^🔗	edsu	soultcer: if you are curious you can take a peek yourself, i can add your pubkey
22:12 ^🔗	edsu	pft, baurine: i have varnish running now in front of the tracker, if you want to point your clients at it
22:12 ^🔗	edsu	http://ec2-54-204-142-70.compute-1.amazonaws.com/
22:12 ^🔗	soultcer	The easiest way is to just tell varnish to only pass 1 connection at a time through.
22:15 ^🔗	edsu	.max_connections
22:16 ^🔗	edsu	i guess we can try that, but it still might be nice to upgrade to postgresql
22:18 ^🔗	edsu	but perhaps it's a bit more work than i suspected, with the task creation bits
22:23 ^🔗	edsu	well .max_connections = 1 does make the database lock errors go away :)
22:25 ^🔗	edsu	but it means that varnishncsa is throwing a 503 most of the time, because the queue is full
22:31 ^🔗	soultcer	Hehe that's the next problem with sqlite
22:32 ^🔗	soultcer	It's slow especially with that borked db layout I made
22:36 ^🔗	edsu	sqlite was masking a aggregate query bug that running of postgres showed
22:36 ^🔗	edsu	in principle the schema is pretty straightforward and fine imho
22:37 ^🔗	edsu	you are too hard on yourself, the code is quite nice too ; the admin workflow is a bit difficult to grok, but i've seen so much worse :-)

irclogger-viewer