[01:06] *** JesseW has joined #urlteam [03:35] jessew: I've got one more test run finishing up, I'll throw it into crontab tonight [03:35] next month should happen automagically [03:54] *** GLaDOS has quit IRC (Quit: Oh crap, I died.) [04:22] excellent! [04:22] Now we just need to think of other places to automatically dump a copy... [04:22] bwn: [04:24] hrm [04:27] Google Drive? Does Azure offer free space? [04:28] Are there pastebins that accept 50 MB files? [04:29] Once you've got it in the crontab, feel free to mention it in the ArchiveLabs slack channel. [04:37] I can't really think of a good place that would be persistent [04:41] nothing is persistent :-) the primary value is to have copies out of control of IA, so if they were pressured into doing something skeevy, they would be able to resist by pointing out it would be discovered [04:42] and I think various places are likely to survive various things that might temporarily disrupt IA materials [04:46] spraying them around google drive, etc seems a decent start for that [04:47] nods [04:48] ia.bak too [04:48] yep, although I'm not sure how that would work exactl [04:48] y [04:49] individual items might make that easier [04:50] The schedule I'm thinking of overall would be something like a monthly list of items, and a twice-a-year census of all the union of *all* the existing item lists, with the results filtered of non-public details before publishing. [04:50] It might -- I'm still very uncertain about the tradeoffs there. [04:58] nod [05:43] *** tyzoid has joined #urlteam [05:52] *** Start has quit IRC (Quit: Disconnected.) [05:55] *** Start has joined #urlteam [06:27] So I have a small url shortener (4706 URLs), so I'm wondering how to best format it for ingestion? Or is it out of scope of URLteam? [06:30] *** JesseW has quit IRC (Ping timeout: 370 seconds) [06:50] *** tyzoid has quit IRC (Quit: Page closed) [08:15] *** WinterFox has joined #urlteam [08:42] *** W1nterFox has joined #urlteam [08:48] *** WinterFox has quit IRC (Read error: Operation timed out) [13:39] *** W1nterFox has quit IRC (Read error: Operation timed out) [15:48] *** JesseW has joined #urlteam [16:26] *** j has joined #urlteam [16:26] *** j has quit IRC (Client Quit) [16:43] *** jon_ has joined #urlteam [16:48] *** JesseW has quit IRC (Ping timeout: 370 seconds) [16:55] Hi, I arrived here because we noticed that you've started archiving our short urls at www.theguardian.com/p/{shortcode} [16:55] did anyone try and contact us? can we assist? [16:56] (I work for the guardian) [17:04] hi jon_! [17:04] glad to hear from you, sorry i'm not the main driver on urlteam [17:05] stick around or leave an email address for JesseW (usa/pacific time) to contact you at, he's doing most of the work in urlteam these days [17:22] thanks for the info! [17:22] sure thing :) [17:23] I can be reached at jonathan.soul at theguardian.com [17:24] ok! i'll make sure JesseW gets the message [17:24] do you have an agenda or do you just want to help make it go smoothly? [17:27] No particular agenda, we just noticed an increase in traffic [17:27] kool [17:27] *** JW_work has joined #urlteam [17:27] if we can provide any information that will make the process easier or not needed then I can probably find out [17:28] are your url shorteners incremental as we've observed? because if so i think most we need is not being blocked [17:28] also we can tune the speed up or down if you have database load issues [17:29] not sure off the top of my head, I'll have to ask the team responsible [17:30] as far as load goes, I think the current level is fine [17:30] enough for us to notice, but not an issue [17:31] ok! :) [17:51] hi, I'm here now! [17:52] jon_: Glad the load is appropriate. One bit of info that would be nice to know is: what's the current highest shortcode assigned? Once we get there, I'll make sure to shut it off so we don't waste your resources. [17:53] On a longer term basis, if it was easy for you to provide a full dump of mappings between short URLs and full story URLs, that'd be nice to have. [17:54] tyzoid (if you read the logs): The preferred format is described on the wiki page — but really anything that maps short URLs to full URLs is fine. [17:55] we don't currently *use* the resulting data for anything; we're just storing it for now. [17:55] If your shortener is incremental, probably the easiest way is just for us to scrap it ourselves. [18:02] *** jon_ has quit IRC (Ping timeout: 268 seconds) [18:09] *** bwn has quit IRC (Read error: Operation timed out) [18:30] *** bwn has joined #urlteam [18:33] *** jon_ has joined #urlteam [18:40] *** bwn has quit IRC (Ping timeout: 244 seconds) [18:47] *** bwn has joined #urlteam [18:49] *** jon_ has quit IRC (Ping timeout: 268 seconds) [21:12] *** Start has quit IRC (Read error: Connection reset by peer) [21:13] *** Start has joined #urlteam [23:43] *** tyzoid has joined #urlteam [23:44] Hey, so I run a small url shortener (~4000ish entries), is this out of scope of urlteam to include in the archve? [23:44] Most of which is spam [23:45] tyzoid: hi, I saw your message earlier; thanks for dropping by again [23:46] as I said earlier (a while after you left): The preferred format is described on the wiki page — but really anything that maps short URLs to full URLs is fine. ; we don't currently *use* the resulting data for anything; we're just storing it for now. ; If your shortener is incremental, probably the easiest way is just for us to scrap it ourselves. [23:46] what's the URL of the shortener? [23:46] I built an export function for it: http://api.tzd.me/export.php?meta [23:46] The url shortener is http://tzd.me/ [23:47] otherwise, it is sequential [23:48] the size is small enough, I'll just make a quick project for it in a couple of days; it'll be done in an hour or two. [23:48] Are you planning on shutting it down? [23:49] Not in the near future, but It's small enough (and spammy enough) that I don't keep it in my backups. [23:49] (oh, and I forgot to mention — Thank you *very much* for pro-actively reaching out to us! I wish all the other shorteners were so thoughtful) [23:49] So if the server were to go offline, I would lose that data [23:50] So would I. I think that the goal of keeping these links is laudable. [23:51] The Internet Archive would be happy to host backups — if you feel like doing it, you could code up a script to run your export API once a month or so, and upload the results to archive.org yourself. That would probably be the best way to preserve it going forward. [23:51] We'll still grab a copy of what's there now, though. [23:51] Sounds good. Is there a resource you can point me for a way to automate that? [23:51] Also, is the format from the exporter useful? or would you prefer a different format? [23:52] I tried to match the format of the other backups, but I don't know if the format has changed recently. [23:53] The format is generally fine — if you wanted to dump the silly headers we use on the top, that would be nice. [23:53] But the basic importance is just that you are providing a bulk export at all. [23:54] It should be trivial to add the headers. Would the pagination be an issue? [23:54] it shouldn't be a problem [23:55] as for automatically uploading it to archive.org — it's actually even more trivial than I thought: you can just save it through the wayback machine. :-) [23:55] See: https://web.archive.org/web/20161031235339/http://api.tzd.me/export.php?page=0 [23:55] I just went to https://web.archive.org/save/http://api.tzd.me/export.php?page=0 and that will grab a new copy [23:56] You can just dump that into a crontab to do once an month. :-) [23:56] hmm [23:56] There's no centralized place to keep all related data right now? [23:57] and if/when you get a 2nd page (which will probably be a while, since you're only about halfway through the first page, it looks like) just add that too. [23:57] The best way to get it in a sensible centralized place is for us to scrape it (which I'll get to, eventually). But as for saving the data, what I suggested is clearly the easiest. [23:59] If/when you do feel like taking it offline (or more relevantly, when you decide to no longer accept new URLs), let us know, we'll make a final scrape, and then it'll be available when (eventually) we actually *use* our data. Until then, saving the export in the Wayback Machine will serve as a good non-local backup for you.