[00:57] I wrote a little about Archive Team and the Upcoming.org preservation effort here, underneath the Wayne's World GIF. It went out to every Kickstarter backer. [00:57] https://www.kickstarter.com/projects/waxpancake/the-return-of-upcomingorg/posts/859645 [02:31] so, I can't be the first person who's thought of this, but reading the ruling in the PersonalWeb patent case, I was wondering if there's a tool to extract all the case references, check if they are on recap, and if not, let you get them (provided you have an account) [04:15] Any recommendations for a cloud server, preferably not too expensive, that can handle 2 TB of WARC data? Bonus points if it's in Europe. [04:17] I'm planning on doing grabs of some large government historical records sites. [04:18] Digital Ocean doesn't have SSD's that big. [04:19] wpull lets you roll over to a new warc every n bytes [04:21] Ooooh... *runs off to read documentation* [05:03] Okay, have questions about wpull... [05:05] If I use the options --warc-file MyFile --warc-max-size 1000000 then I will end up with MyFile and MyFile2 and MyFile3 (etc.) automatically sliced up roughly every 1 MB? [05:06] But I will still have to handle uploading the chunked WARC's individually, by hand, so I don't run out of disk space. [05:06] And deleting them after upload too. [05:09] I guess that would work, but ideally they'd be recombined into a MegaWARC before being available for download on IA. [05:09] And I don't know where or how that recombination would happen. [05:13] I can think of a few terrible ways to determine whether wpull is done with a file, including lsof to see if it still has it open [05:13] https://github.com/ArchiveTeam/archiveteam-megawarc-factory could be used to automatically upload them, but it assumes that if a file is in a certain directory, it's ready to be packed and uploaded [05:13] maybe wpull needs an option to move finished WARCs [05:24] hm [05:24] (dirty hack: make wpull mark them executable, change the uploader shell script) [05:25] that is a tricky problem that might be best solved by having wpull do a move [05:25] unless there's some magic UNIX utility that can tell you when a file is no longer open [05:25] that isn't lsof [05:25] or rather lsof | grep ... [05:26] for now though I wonder if it would be okay to just check in on the download every few hours [05:26] and just move the bits that are complete to the upload dir [05:26] that's up to Asparagir though [05:28] Because I'm busy babysitting actual babies. :-) [05:28] Unfortunately, I cannot babysit the server consistently. [05:29] Need a Mother's Little Helper script to keep it moving smoothly. [05:29] ah [05:30] note IA may go down and you'll run out of disk space because there's no backpressure to wpull [05:31] or, if you get a server in Europe, the upload to IA could be slower than the download from the site [05:32] :P [05:32] Arrrrrrgh. [05:32] I'm adding --move-warc-to to wpull [05:32] maybe yipdw wants to manage this crawl on my server? I have 1TB available [05:33] Thanks! [05:33] cool [05:33] or at least looking into it [05:34] FYI, this site is the Latvian vital records archives. And if this works, then next up can be the various Polish State Archives, which are starting to put scanned images online too. [05:34] Gobs and gobs of wonderful vital records data. [05:35] ooo [05:43] I know, right? High quality scans of century-old birth, marriage, and death records from all over Europe. Awesome. [05:43] I need chfoo to also poke at that I guess [05:43] and yeah, tests etc [05:43] ivan`: https://gist.github.com/yipdw/1bdca9cc4235416a3786 [05:43] maybe? [05:43] wow [05:45] hmm nope [05:46] oh yeah, WARCRecorderParams [05:46] Goood morning. [05:46] I will like being back in my own timezone. [05:47] ah, there we go [05:49] Asparagir: so, here's a patch for wpull that adds a --move-warc-to DIR option, which tells wpull to move completed WARCs to a given directory: https://gist.github.com/yipdw/1bdca9cc4235416a3786 [05:50] Asparagir: I still need to clean this up and submit to chfoo for inclusion, but if you'd like, feel free to try that patch and/or have me or someone else get it working on your server [05:50] (I'm trying it out now; it seems to work) [05:51] actually, I'm not sure if the -meta and CDXs are moved [05:51] might be worth running it under archivebot so that one can add ignore patterns? [05:51] probably not [05:52] that was the wrong button [05:52] ivan`: the patch or Asparagir's jobs? [05:53] Asparagir's jobs [05:53] oh [05:53] possibly, up to her I guess [05:53] I'm okay with either, if you want to sacrifice ArchiveBot space for something this big. [05:54] But the Latvian archives requires a (free) login/pw to get access to the images, so... [05:54] oh [05:54] I was going to handle that with wget or wpull. [05:55] Um, unofficially. [05:55] This is the site: http://www.lvva-raduraksti.lv/en.html [05:58] oops, there's an off-by-one in my patch [06:02] Well, thank you for putting together the world's fastest feature request! [06:04] np [06:04] I'll toss the patch as-is at chfoo and get around to the tests later [06:05] ...assuming it actually works this time [06:07] I have to log off for the night in a minute. Thanks again. [06:09] sure thing [06:09] feel free to check back in whenever [06:11] I will [09:29] https://twitter.com/GotlandGAME/status/472278925201379328 [19:51] A tech. magazine target: http://www.os2museum.com/wp/?p=2478 [19:51] :) [22:12] i may reupload this: https://archive.org/details/5234_labeled_infographics [22:12] i will upload it as a zip file [22:13] cause the way that items is its just a dump of 5234 image files [22:14] for some reason its only 5233 images [22:14] in my collection [22:14] that has 5234 in the title [23:15] anyways i'm at 13k in my godaneinbox