#urlteam 2016-10-31,Mon

↑back Search

Time	Nickname	Message
01:06 ^🔗		JesseW has joined #urlteam
03:35 ^🔗	bwn	jessew: I've got one more test run finishing up, I'll throw it into crontab tonight
03:35 ^🔗	bwn	next month should happen automagically
03:54 ^🔗		GLaDOS has quit IRC (Quit: Oh crap, I died.)
04:22 ^🔗	JesseW	excellent!
04:22 ^🔗	JesseW	Now we just need to think of other places to automatically dump a copy...
04:22 ^🔗	JesseW	bwn:
04:24 ^🔗	bwn	hrm
04:27 ^🔗	JesseW	Google Drive? Does Azure offer free space?
04:28 ^🔗	JesseW	Are there pastebins that accept 50 MB files?
04:29 ^🔗	JesseW	Once you've got it in the crontab, feel free to mention it in the ArchiveLabs slack channel.
04:37 ^🔗	bwn	I can't really think of a good place that would be persistent
04:41 ^🔗	JesseW	nothing is persistent :-) the primary value is to have copies out of control of IA, so if they were pressured into doing something skeevy, they would be able to resist by pointing out it would be discovered
04:42 ^🔗	JesseW	and I think various places are likely to survive various things that might temporarily disrupt IA materials
04:46 ^🔗	bwn	spraying them around google drive, etc seems a decent start for that
04:47 ^🔗	JesseW	nods
04:48 ^🔗	bwn	ia.bak too
04:48 ^🔗	JesseW	yep, although I'm not sure how that would work exactl
04:48 ^🔗	JesseW	y
04:49 ^🔗	bwn	individual items might make that easier
04:50 ^🔗	JesseW	The schedule I'm thinking of overall would be something like a monthly list of items, and a twice-a-year census of all the union of all the existing item lists, with the results filtered of non-public details before publishing.
04:50 ^🔗	JesseW	It might -- I'm still very uncertain about the tradeoffs there.
04:58 ^🔗	bwn	nod
05:43 ^🔗		tyzoid has joined #urlteam
05:52 ^🔗		Start has quit IRC (Quit: Disconnected.)
05:55 ^🔗		Start has joined #urlteam
06:27 ^🔗	tyzoid	So I have a small url shortener (4706 URLs), so I'm wondering how to best format it for ingestion? Or is it out of scope of URLteam?
06:30 ^🔗		JesseW has quit IRC (Ping timeout: 370 seconds)
06:50 ^🔗		tyzoid has quit IRC (Quit: Page closed)
08:15 ^🔗		WinterFox has joined #urlteam
08:42 ^🔗		W1nterFox has joined #urlteam
08:48 ^🔗		WinterFox has quit IRC (Read error: Operation timed out)
13:39 ^🔗		W1nterFox has quit IRC (Read error: Operation timed out)
15:48 ^🔗		JesseW has joined #urlteam
16:26 ^🔗		j has joined #urlteam
16:26 ^🔗		j has quit IRC (Client Quit)
16:43 ^🔗		jon_ has joined #urlteam
16:48 ^🔗		JesseW has quit IRC (Ping timeout: 370 seconds)
16:55 ^🔗	jon_	Hi, I arrived here because we noticed that you've started archiving our short urls at www.theguardian.com/p/{shortcode}
16:55 ^🔗	jon_	did anyone try and contact us? can we assist?
16:56 ^🔗	jon_	(I work for the guardian)
17:04 ^🔗	xmc	hi jon_!
17:04 ^🔗	xmc	glad to hear from you, sorry i'm not the main driver on urlteam
17:05 ^🔗	xmc	stick around or leave an email address for JesseW (usa/pacific time) to contact you at, he's doing most of the work in urlteam these days
17:22 ^🔗	jon_	thanks for the info!
17:22 ^🔗	xmc	sure thing :)
17:23 ^🔗	jon_	I can be reached at jonathan.soul at theguardian.com
17:24 ^🔗	xmc	ok! i'll make sure JesseW gets the message
17:24 ^🔗	xmc	do you have an agenda or do you just want to help make it go smoothly?
17:27 ^🔗	jon_	No particular agenda, we just noticed an increase in traffic
17:27 ^🔗	xmc	kool
17:27 ^🔗		JW_work has joined #urlteam
17:27 ^🔗	jon_	if we can provide any information that will make the process easier or not needed then I can probably find out
17:28 ^🔗	xmc	are your url shorteners incremental as we've observed? because if so i think most we need is not being blocked
17:28 ^🔗	xmc	also we can tune the speed up or down if you have database load issues
17:29 ^🔗	jon_	not sure off the top of my head, I'll have to ask the team responsible
17:30 ^🔗	jon_	as far as load goes, I think the current level is fine
17:30 ^🔗	jon_	enough for us to notice, but not an issue
17:31 ^🔗	xmc	ok! :)
17:51 ^🔗	JW_work	hi, I'm here now!
17:52 ^🔗	JW_work	jon_: Glad the load is appropriate. One bit of info that would be nice to know is: what's the current highest shortcode assigned? Once we get there, I'll make sure to shut it off so we don't waste your resources.
17:53 ^🔗	JW_work	On a longer term basis, if it was easy for you to provide a full dump of mappings between short URLs and full story URLs, that'd be nice to have.
17:54 ^🔗	JW_work	tyzoid (if you read the logs): The preferred format is described on the wiki page — but really anything that maps short URLs to full URLs is fine.
17:55 ^🔗	JW_work	we don't currently use the resulting data for anything; we're just storing it for now.
17:55 ^🔗	JW_work	If your shortener is incremental, probably the easiest way is just for us to scrap it ourselves.
18:02 ^🔗		jon_ has quit IRC (Ping timeout: 268 seconds)
18:09 ^🔗		bwn has quit IRC (Read error: Operation timed out)
18:30 ^🔗		bwn has joined #urlteam
18:33 ^🔗		jon_ has joined #urlteam
18:40 ^🔗		bwn has quit IRC (Ping timeout: 244 seconds)
18:47 ^🔗		bwn has joined #urlteam
18:49 ^🔗		jon_ has quit IRC (Ping timeout: 268 seconds)
21:12 ^🔗		Start has quit IRC (Read error: Connection reset by peer)
21:13 ^🔗		Start has joined #urlteam
23:43 ^🔗		tyzoid has joined #urlteam
23:44 ^🔗	tyzoid	Hey, so I run a small url shortener (~4000ish entries), is this out of scope of urlteam to include in the archve?
23:44 ^🔗	tyzoid	Most of which is spam
23:45 ^🔗	JW_work	tyzoid: hi, I saw your message earlier; thanks for dropping by again
23:46 ^🔗	JW_work	as I said earlier (a while after you left): The preferred format is described on the wiki page — but really anything that maps short URLs to full URLs is fine. ; we don't currently use the resulting data for anything; we're just storing it for now. ; If your shortener is incremental, probably the easiest way is just for us to scrap it ourselves.
23:46 ^🔗	JW_work	what's the URL of the shortener?
23:46 ^🔗	tyzoid	I built an export function for it: http://api.tzd.me/export.php?meta
23:46 ^🔗	tyzoid	The url shortener is http://tzd.me/
23:47 ^🔗	tyzoid	otherwise, it is sequential
23:48 ^🔗	JW_work	the size is small enough, I'll just make a quick project for it in a couple of days; it'll be done in an hour or two.
23:48 ^🔗	JW_work	Are you planning on shutting it down?
23:49 ^🔗	tyzoid	Not in the near future, but It's small enough (and spammy enough) that I don't keep it in my backups.
23:49 ^🔗	JW_work	(oh, and I forgot to mention — Thank you very much for pro-actively reaching out to us! I wish all the other shorteners were so thoughtful)
23:49 ^🔗	tyzoid	So if the server were to go offline, I would lose that data
23:50 ^🔗	tyzoid	So would I. I think that the goal of keeping these links is laudable.
23:51 ^🔗	JW_work	The Internet Archive would be happy to host backups — if you feel like doing it, you could code up a script to run your export API once a month or so, and upload the results to archive.org yourself. That would probably be the best way to preserve it going forward.
23:51 ^🔗	JW_work	We'll still grab a copy of what's there now, though.
23:51 ^🔗	tyzoid	Sounds good. Is there a resource you can point me for a way to automate that?
23:51 ^🔗	tyzoid	Also, is the format from the exporter useful? or would you prefer a different format?
23:52 ^🔗	tyzoid	I tried to match the format of the other backups, but I don't know if the format has changed recently.
23:53 ^🔗	JW_work	The format is generally fine — if you wanted to dump the silly headers we use on the top, that would be nice.
23:53 ^🔗	JW_work	But the basic importance is just that you are providing a bulk export at all.
23:54 ^🔗	tyzoid	It should be trivial to add the headers. Would the pagination be an issue?
23:54 ^🔗	JW_work	it shouldn't be a problem
23:55 ^🔗	JW_work	as for automatically uploading it to archive.org — it's actually even more trivial than I thought: you can just save it through the wayback machine. :-)
23:55 ^🔗	JW_work	See: https://web.archive.org/web/20161031235339/http://api.tzd.me/export.php?page=0
23:55 ^🔗	JW_work	I just went to https://web.archive.org/save/http://api.tzd.me/export.php?page=0 and that will grab a new copy
23:56 ^🔗	JW_work	You can just dump that into a crontab to do once an month. :-)
23:56 ^🔗	tyzoid	hmm
23:56 ^🔗	tyzoid	There's no centralized place to keep all related data right now?
23:57 ^🔗	JW_work	and if/when you get a 2nd page (which will probably be a while, since you're only about halfway through the first page, it looks like) just add that too.
23:57 ^🔗	JW_work	The best way to get it in a sensible centralized place is for us to scrape it (which I'll get to, eventually). But as for saving the data, what I suggested is clearly the easiest.
23:59 ^🔗	JW_work	If/when you do feel like taking it offline (or more relevantly, when you decide to no longer accept new URLs), let us know, we'll make a final scrape, and then it'll be available when (eventually) we actually use our data. Until then, saving the export in the Wayback Machine will serve as a good non-local backup for you.

irclogger-viewer