#urlteam 2016-10-31,Mon

↑back Search

Time Nickname Message
01:06 🔗 JesseW has joined #urlteam
03:35 🔗 bwn jessew: I've got one more test run finishing up, I'll throw it into crontab tonight
03:35 🔗 bwn next month should happen automagically
03:54 🔗 GLaDOS has quit IRC (Quit: Oh crap, I died.)
04:22 🔗 JesseW excellent!
04:22 🔗 JesseW Now we just need to think of other places to automatically dump a copy...
04:22 🔗 JesseW bwn:
04:24 🔗 bwn hrm
04:27 🔗 JesseW Google Drive? Does Azure offer free space?
04:28 🔗 JesseW Are there pastebins that accept 50 MB files?
04:29 🔗 JesseW Once you've got it in the crontab, feel free to mention it in the ArchiveLabs slack channel.
04:37 🔗 bwn I can't really think of a good place that would be persistent
04:41 🔗 JesseW nothing is persistent :-) the primary value is to have copies out of control of IA, so if they were pressured into doing something skeevy, they would be able to resist by pointing out it would be discovered
04:42 🔗 JesseW and I think various places are likely to survive various things that might temporarily disrupt IA materials
04:46 🔗 bwn spraying them around google drive, etc seems a decent start for that
04:47 🔗 JesseW nods
04:48 🔗 bwn ia.bak too
04:48 🔗 JesseW yep, although I'm not sure how that would work exactl
04:48 🔗 JesseW y
04:49 🔗 bwn individual items might make that easier
04:50 🔗 JesseW The schedule I'm thinking of overall would be something like a monthly list of items, and a twice-a-year census of all the union of *all* the existing item lists, with the results filtered of non-public details before publishing.
04:50 🔗 JesseW It might -- I'm still very uncertain about the tradeoffs there.
04:58 🔗 bwn nod
05:43 🔗 tyzoid has joined #urlteam
05:52 🔗 Start has quit IRC (Quit: Disconnected.)
05:55 🔗 Start has joined #urlteam
06:27 🔗 tyzoid So I have a small url shortener (4706 URLs), so I'm wondering how to best format it for ingestion? Or is it out of scope of URLteam?
06:30 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
06:50 🔗 tyzoid has quit IRC (Quit: Page closed)
08:15 🔗 WinterFox has joined #urlteam
08:42 🔗 W1nterFox has joined #urlteam
08:48 🔗 WinterFox has quit IRC (Read error: Operation timed out)
13:39 🔗 W1nterFox has quit IRC (Read error: Operation timed out)
15:48 🔗 JesseW has joined #urlteam
16:26 🔗 j has joined #urlteam
16:26 🔗 j has quit IRC (Client Quit)
16:43 🔗 jon_ has joined #urlteam
16:48 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
16:55 🔗 jon_ Hi, I arrived here because we noticed that you've started archiving our short urls at www.theguardian.com/p/{shortcode}
16:55 🔗 jon_ did anyone try and contact us? can we assist?
16:56 🔗 jon_ (I work for the guardian)
17:04 🔗 xmc hi jon_!
17:04 🔗 xmc glad to hear from you, sorry i'm not the main driver on urlteam
17:05 🔗 xmc stick around or leave an email address for JesseW (usa/pacific time) to contact you at, he's doing most of the work in urlteam these days
17:22 🔗 jon_ thanks for the info!
17:22 🔗 xmc sure thing :)
17:23 🔗 jon_ I can be reached at jonathan.soul at theguardian.com
17:24 🔗 xmc ok! i'll make sure JesseW gets the message
17:24 🔗 xmc do you have an agenda or do you just want to help make it go smoothly?
17:27 🔗 jon_ No particular agenda, we just noticed an increase in traffic
17:27 🔗 xmc kool
17:27 🔗 JW_work has joined #urlteam
17:27 🔗 jon_ if we can provide any information that will make the process easier or not needed then I can probably find out
17:28 🔗 xmc are your url shorteners incremental as we've observed? because if so i think most we need is not being blocked
17:28 🔗 xmc also we can tune the speed up or down if you have database load issues
17:29 🔗 jon_ not sure off the top of my head, I'll have to ask the team responsible
17:30 🔗 jon_ as far as load goes, I think the current level is fine
17:30 🔗 jon_ enough for us to notice, but not an issue
17:31 🔗 xmc ok! :)
17:51 🔗 JW_work hi, I'm here now!
17:52 🔗 JW_work jon_: Glad the load is appropriate. One bit of info that would be nice to know is: what's the current highest shortcode assigned? Once we get there, I'll make sure to shut it off so we don't waste your resources.
17:53 🔗 JW_work On a longer term basis, if it was easy for you to provide a full dump of mappings between short URLs and full story URLs, that'd be nice to have.
17:54 🔗 JW_work tyzoid (if you read the logs): The preferred format is described on the wiki page — but really anything that maps short URLs to full URLs is fine.
17:55 🔗 JW_work we don't currently *use* the resulting data for anything; we're just storing it for now.
17:55 🔗 JW_work If your shortener is incremental, probably the easiest way is just for us to scrap it ourselves.
18:02 🔗 jon_ has quit IRC (Ping timeout: 268 seconds)
18:09 🔗 bwn has quit IRC (Read error: Operation timed out)
18:30 🔗 bwn has joined #urlteam
18:33 🔗 jon_ has joined #urlteam
18:40 🔗 bwn has quit IRC (Ping timeout: 244 seconds)
18:47 🔗 bwn has joined #urlteam
18:49 🔗 jon_ has quit IRC (Ping timeout: 268 seconds)
21:12 🔗 Start has quit IRC (Read error: Connection reset by peer)
21:13 🔗 Start has joined #urlteam
23:43 🔗 tyzoid has joined #urlteam
23:44 🔗 tyzoid Hey, so I run a small url shortener (~4000ish entries), is this out of scope of urlteam to include in the archve?
23:44 🔗 tyzoid Most of which is spam
23:45 🔗 JW_work tyzoid: hi, I saw your message earlier; thanks for dropping by again
23:46 🔗 JW_work as I said earlier (a while after you left): The preferred format is described on the wiki page — but really anything that maps short URLs to full URLs is fine. ; we don't currently *use* the resulting data for anything; we're just storing it for now. ; If your shortener is incremental, probably the easiest way is just for us to scrap it ourselves.
23:46 🔗 JW_work what's the URL of the shortener?
23:46 🔗 tyzoid I built an export function for it: http://api.tzd.me/export.php?meta
23:46 🔗 tyzoid The url shortener is http://tzd.me/
23:47 🔗 tyzoid otherwise, it is sequential
23:48 🔗 JW_work the size is small enough, I'll just make a quick project for it in a couple of days; it'll be done in an hour or two.
23:48 🔗 JW_work Are you planning on shutting it down?
23:49 🔗 tyzoid Not in the near future, but It's small enough (and spammy enough) that I don't keep it in my backups.
23:49 🔗 JW_work (oh, and I forgot to mention — Thank you *very much* for pro-actively reaching out to us! I wish all the other shorteners were so thoughtful)
23:49 🔗 tyzoid So if the server were to go offline, I would lose that data
23:50 🔗 tyzoid So would I. I think that the goal of keeping these links is laudable.
23:51 🔗 JW_work The Internet Archive would be happy to host backups — if you feel like doing it, you could code up a script to run your export API once a month or so, and upload the results to archive.org yourself. That would probably be the best way to preserve it going forward.
23:51 🔗 JW_work We'll still grab a copy of what's there now, though.
23:51 🔗 tyzoid Sounds good. Is there a resource you can point me for a way to automate that?
23:51 🔗 tyzoid Also, is the format from the exporter useful? or would you prefer a different format?
23:52 🔗 tyzoid I tried to match the format of the other backups, but I don't know if the format has changed recently.
23:53 🔗 JW_work The format is generally fine — if you wanted to dump the silly headers we use on the top, that would be nice.
23:53 🔗 JW_work But the basic importance is just that you are providing a bulk export at all.
23:54 🔗 tyzoid It should be trivial to add the headers. Would the pagination be an issue?
23:54 🔗 JW_work it shouldn't be a problem
23:55 🔗 JW_work as for automatically uploading it to archive.org — it's actually even more trivial than I thought: you can just save it through the wayback machine. :-)
23:55 🔗 JW_work See: https://web.archive.org/web/20161031235339/http://api.tzd.me/export.php?page=0
23:55 🔗 JW_work I just went to https://web.archive.org/save/http://api.tzd.me/export.php?page=0 and that will grab a new copy
23:56 🔗 JW_work You can just dump that into a crontab to do once an month. :-)
23:56 🔗 tyzoid hmm
23:56 🔗 tyzoid There's no centralized place to keep all related data right now?
23:57 🔗 JW_work and if/when you get a 2nd page (which will probably be a while, since you're only about halfway through the first page, it looks like) just add that too.
23:57 🔗 JW_work The best way to get it in a sensible centralized place is for us to scrape it (which I'll get to, eventually). But as for saving the data, what I suggested is clearly the easiest.
23:59 🔗 JW_work If/when you do feel like taking it offline (or more relevantly, when you decide to no longer accept new URLs), let us know, we'll make a final scrape, and then it'll be available when (eventually) we actually *use* our data. Until then, saving the export in the Wayback Machine will serve as a good non-local backup for you.

irclogger-viewer