#archiveteam 2014-05-30,Fri

↑back Search

Time Nickname Message
00:57 🔗 waxpancak I wrote a little about Archive Team and the Upcoming.org preservation effort here, underneath the Wayne's World GIF. It went out to every Kickstarter backer.
00:57 🔗 waxpancak https://www.kickstarter.com/projects/waxpancake/the-return-of-upcomingorg/posts/859645
02:31 🔗 dashcloud so, I can't be the first person who's thought of this, but reading the ruling in the PersonalWeb patent case, I was wondering if there's a tool to extract all the case references, check if they are on recap, and if not, let you get them (provided you have an account)
04:15 🔗 Asparagir Any recommendations for a cloud server, preferably not too expensive, that can handle 2 TB of WARC data? Bonus points if it's in Europe.
04:17 🔗 Asparagir I'm planning on doing grabs of some large government historical records sites.
04:18 🔗 Asparagir Digital Ocean doesn't have SSD's that big.
04:19 🔗 DFJustin wpull lets you roll over to a new warc every n bytes
04:21 🔗 Asparagir Ooooh... *runs off to read documentation*
05:03 🔗 Asparagir Okay, have questions about wpull...
05:05 🔗 Asparagir If I use the options --warc-file MyFile --warc-max-size 1000000 then I will end up with MyFile and MyFile2 and MyFile3 (etc.) automatically sliced up roughly every 1 MB?
05:06 🔗 Asparagir But I will still have to handle uploading the chunked WARC's individually, by hand, so I don't run out of disk space.
05:06 🔗 Asparagir And deleting them after upload too.
05:09 🔗 Asparagir I guess that would work, but ideally they'd be recombined into a MegaWARC before being available for download on IA.
05:09 🔗 Asparagir And I don't know where or how that recombination would happen.
05:13 🔗 ivan` I can think of a few terrible ways to determine whether wpull is done with a file, including lsof to see if it still has it open
05:13 🔗 ivan` https://github.com/ArchiveTeam/archiveteam-megawarc-factory could be used to automatically upload them, but it assumes that if a file is in a certain directory, it's ready to be packed and uploaded
05:13 🔗 ivan` maybe wpull needs an option to move finished WARCs
05:24 🔗 yipdw hm
05:24 🔗 ivan` (dirty hack: make wpull mark them executable, change the uploader shell script)
05:25 🔗 yipdw that is a tricky problem that might be best solved by having wpull do a move
05:25 🔗 yipdw unless there's some magic UNIX utility that can tell you when a file is no longer open
05:25 🔗 yipdw that isn't lsof
05:25 🔗 yipdw or rather lsof | grep ...
05:26 🔗 yipdw for now though I wonder if it would be okay to just check in on the download every few hours
05:26 🔗 yipdw and just move the bits that are complete to the upload dir
05:26 🔗 yipdw that's up to Asparagir though
05:28 🔗 Asparagir Because I'm busy babysitting actual babies. :-)
05:28 🔗 Asparagir Unfortunately, I cannot babysit the server consistently.
05:29 🔗 Asparagir Need a Mother's Little Helper script to keep it moving smoothly.
05:29 🔗 yipdw ah
05:30 🔗 ivan` note IA may go down and you'll run out of disk space because there's no backpressure to wpull
05:31 🔗 ivan` or, if you get a server in Europe, the upload to IA could be slower than the download from the site
05:32 🔗 yipdw :P
05:32 🔗 Asparagir Arrrrrrgh.
05:32 🔗 yipdw I'm adding --move-warc-to to wpull
05:32 🔗 ivan` maybe yipdw wants to manage this crawl on my server? I have 1TB available
05:33 🔗 Asparagir Thanks!
05:33 🔗 ivan` cool
05:33 🔗 yipdw or at least looking into it
05:34 🔗 Asparagir FYI, this site is the Latvian vital records archives. And if this works, then next up can be the various Polish State Archives, which are starting to put scanned images online too.
05:34 🔗 Asparagir Gobs and gobs of wonderful vital records data.
05:35 🔗 DFJustin ooo
05:43 🔗 Asparagir I know, right? High quality scans of century-old birth, marriage, and death records from all over Europe. Awesome.
05:43 🔗 yipdw I need chfoo to also poke at that I guess
05:43 🔗 yipdw and yeah, tests etc
05:43 🔗 yipdw ivan`: https://gist.github.com/yipdw/1bdca9cc4235416a3786
05:43 🔗 yipdw maybe?
05:43 🔗 amerrykan wow
05:45 🔗 yipdw hmm nope
05:46 🔗 yipdw oh yeah, WARCRecorderParams
05:46 🔗 SketchCow Goood morning.
05:46 🔗 SketchCow I will like being back in my own timezone.
05:47 🔗 yipdw ah, there we go
05:49 🔗 yipdw Asparagir: so, here's a patch for wpull that adds a --move-warc-to DIR option, which tells wpull to move completed WARCs to a given directory: https://gist.github.com/yipdw/1bdca9cc4235416a3786
05:50 🔗 yipdw Asparagir: I still need to clean this up and submit to chfoo for inclusion, but if you'd like, feel free to try that patch and/or have me or someone else get it working on your server
05:50 🔗 yipdw (I'm trying it out now; it seems to work)
05:51 🔗 yipdw actually, I'm not sure if the -meta and CDXs are moved
05:51 🔗 ivan` might be worth running it under archivebot so that one can add ignore patterns?
05:51 🔗 yipdw probably not
05:52 🔗 yipdw that was the wrong button
05:52 🔗 yipdw ivan`: the patch or Asparagir's jobs?
05:53 🔗 ivan` Asparagir's jobs
05:53 🔗 yipdw oh
05:53 🔗 yipdw possibly, up to her I guess
05:53 🔗 Asparagir I'm okay with either, if you want to sacrifice ArchiveBot space for something this big.
05:54 🔗 Asparagir But the Latvian archives requires a (free) login/pw to get access to the images, so...
05:54 🔗 yipdw oh
05:54 🔗 Asparagir I was going to handle that with wget or wpull.
05:55 🔗 Asparagir Um, unofficially.
05:55 🔗 Asparagir This is the site: http://www.lvva-raduraksti.lv/en.html
05:58 🔗 yipdw oops, there's an off-by-one in my patch
06:02 🔗 Asparagir Well, thank you for putting together the world's fastest feature request!
06:04 🔗 yipdw np
06:04 🔗 yipdw I'll toss the patch as-is at chfoo and get around to the tests later
06:05 🔗 yipdw ...assuming it actually works this time
06:07 🔗 Asparagir I have to log off for the night in a minute. Thanks again.
06:09 🔗 yipdw sure thing
06:09 🔗 yipdw feel free to check back in whenever
06:11 🔗 Asparagir I will
09:29 🔗 SketchCow https://twitter.com/GotlandGAME/status/472278925201379328
19:51 🔗 Stiletto A tech. magazine target: http://www.os2museum.com/wp/?p=2478
19:51 🔗 Stiletto :)
22:12 🔗 godane i may reupload this: https://archive.org/details/5234_labeled_infographics
22:12 🔗 godane i will upload it as a zip file
22:13 🔗 godane cause the way that items is its just a dump of 5234 image files
22:14 🔗 godane for some reason its only 5233 images
22:14 🔗 godane in my collection
22:14 🔗 godane that has 5234 in the title
23:15 🔗 godane anyways i'm at 13k in my godaneinbox

irclogger-viewer