[00:00] is anyone having problems with news.softpedia.com [00:01] i can view articles anymore after trying to grab them: http://news.softpedia.com/news/ubuntu-based-black-lab-linux-7-0-3-distro-arrives-with-updated-kernel-more-499410.shtml [00:01] They might have IP banned you [00:02] hmm, doesnt work over my RDP at OVH [00:02] ok then there having problems maybe [00:02] ok [00:02] its funny cause i could grab the images without any problem [00:02] Works fine though my home connection though [00:02] just the articles [00:03] Is it a home IP you are using? [00:03] yes [00:50] *** dashcloud has quit IRC (Read error: Operation timed out) [00:54] *** dashcloud has joined #archiveteam-bs [01:14] i'm grabbing Christian History Magazine from https://www.christianhistoryinstitute.org [01:16] ersi: it actually is pretty fun(ny) when you get an Elastic Beanstalk environment into an unrecoverable stat [01:16] e [01:17] like, what happens is (A) you add a status check, which puts the environment into transition; (B) environment fails status checks because it's misconfigured; (C) you can't adjust the configuration because environment is in transition [01:18] your only way to avoid that is to add the status check last and if you forget that you just have to make a new env and wait for the oh-well timeout to kick in, which is something like 30 minutes to whenever-the-fuck [01:19] you can't terminate the errant environment because the environment is in transition and the terminate command gets stuck in Amazon CloudFormation [01:19] tl;dr fuck the cloud [02:24] *** JesseW has joined #archiveteam-bs [02:25] "my cloud is stuck, help" [03:06] *** dashcloud has quit IRC (Read error: Operation timed out) [03:09] *** dashcloud has joined #archiveteam-bs [04:26] *** JesseW has quit IRC (Leaving.) [05:17] *** yipdw has quit IRC (Read error: Operation timed out) [05:19] *** yipdw has joined #archiveteam-bs [05:23] *** JesseW has joined #archiveteam-bs [06:04] *** robink has quit IRC (Ping timeout: 190 seconds) [06:05] *** robink has joined #archiveteam-bs [06:43] *** Start_ has joined #archiveteam-bs [06:43] *** Start has quit IRC (Read error: Connection reset by peer) [07:19] 6 GB of census differences... [07:23] 83,416,395 additions/removals/changes (which took 76 seconds to calculate) [07:31] only 419,461 changes, of which only 15,605 are "interesting" (i.e. not changes to metadata) [07:36] and about 10,000 of those are changes to slightly more unusual metadata (_dc.xml, _scandata.xml, _bhlmets.xml) [07:53] oh niiiice [07:53] if you specify bcrypt-encrypted htpasswd auth for Radicale (a CalDAV server) and bcrypt isn't available, Radicale defaults to auth strategy "none" [07:54] unless you specify "htpasswd_encryption = bcrypt" and then it will abort on startup [07:56] JesseW: is the data published somewhere? [07:58] not yet; it's currently sitting locally on my computer [08:00] I do intend to ask Jake to upload it to a new IA item (I'd rather it was under his name than mine) -- but I haven't fully settled on the best form yet (and you're the first one to ask if it's published :-) ) [08:00] Here's the 8 "interesting" xml files that changed: [08:00] http://archive.org/do/AlexandroxEDMPowerPodcast/AlexandroxEdmPowerPodcast.xml [08:00] http://archive.org/do/AnthologyOfVerklarung/RSSfeed.xml [08:00] http://archive.org/do/BackInAFlash_20150115/BackInAFlash.xml [08:00] http://archive.org/do/elpodcastdelbuho/elpodcastdelbuho_rss.xml [08:00] http://archive.org/do/thebeatvia/config.xml [08:00] http://archive.org/do/ThisWeeksWeirdNewsRssFeed/ThisweeksweirdnewsRss-Itunes.xml [08:01] http://archive.org/do/ThoughtsOnTheTable/feed.xml [08:01] http://archive.org/do/University_SDA_Orlando_Podcast/podcast_rss.xml [08:01] humph, typo [08:01] http://archive.org/download/AlexandroxEDMPowerPodcast/AlexandroxEdmPowerPodcast.xml [08:01] http://archive.org/download/AnthologyOfVerklarung/RSSfeed.xml [08:01] http://archive.org/download/BackInAFlash_20150115/BackInAFlash.xml [08:01] http://archive.org/download/elpodcastdelbuho/elpodcastdelbuho_rss.xml [08:01] http://archive.org/download/thebeatvia/config.xml [08:01] http://archive.org/download/ThisWeeksWeirdNewsRssFeed/ThisweeksweirdnewsRss-Itunes.xml [08:01] http://archive.org/download/ThoughtsOnTheTable/feed.xml [08:01] http://archive.org/download/University_SDA_Orlando_Podcast/podcast_rss.xml [08:02] here they are with working links [08:02] they all appear to be podcast feeds [08:02] so maybe not so interesting. :-/ [08:08] *** vitzli has joined #archiveteam-bs [08:09] bizarely, one of the files that changed is http://archive.org/download/wikimediadownloads/legal.html [08:10] which I'm not sure where it is linked from, but seems to be a copy of https://dumps.wikimedia.org/legal.html [08:13] !ao http://www.theblaze.com/stories/2016/01/25/texas-grand-jury-indicts-center-for-medical-progress-filmmakers-but-not-planned-parenthood/ [08:13] i put in archivebot channel [08:13] *it [08:14] JesseW, is it just md5 only or md5-sha1-crc32? [08:14] sadly, the original census only has md5 [08:15] IA provides md5/sha1/crc32 -- but (to be consistent with the previous census), I only grabbed md5 [08:16] It'd probably be good to grab sha1 and crc32 going forward, I suppose. -- but that would increase the size quite a bit [08:16] vitzli: [08:18] I thought you grabbed sha1 too and I wanted to offer the storage for it if it is not needed [08:19] how much time and bandwidth it takes to do? [08:19] vitzli: it only took about a day to do the grab -- if you have spare storage lying around, it'd probably be neat to grab the *full* metadata and leave it somewhere [08:20] That would be *much* larger, though. There are a few identifiers where IA (for /reasons/) effectively put the whole content into the metadata. One of the more eggregious ones (amusingly enough) is http://archive.org/details/nsa [08:21] But even if you stripped out those outliers, it would likely be at least a few hundred gigabytes, if not larger. [08:22] metadata as in _files.xml or something else? [08:23] metadata as in archive.org/metadata/blahblbhahblah [08:23] which is a generated combination of _meta.xml and _files.xml (IIRC) [08:29] another amusing change: http://archive.org/download/arcadeflow/28th.html -- this is (presumably the most recent) part of SketchCow's alternate interface for the Internet Arcade [08:37] I don't know if I could take all metadata, but I could grab and store sha1-md5-crc32-size data from IA [08:39] cool; it'd be great to have a 3rd person get it working. The two basic tools you need are iamine and jq. [08:40] I'm heading to sleep soon, but I'll be glad to walk you through it later (and feel free to mention me in the channel when I'm not here -- I'll read the logs) [08:41] Oh, you'll also need python3.3 or 3.4 [08:41] and GNU parallel [08:41] I played with jq when found out about IA census, little bit annoying filter language, but good otherwise [08:42] eh, it's so much nicer than trying to do JSON parsing in a non-CLI-oriented way [08:42] and it grows on you (or at least it did on me) [08:43] warn me (and Jake at IA) before you start actually downloading, as it does hit IA's servers noticeably. [08:44] I will not be doing it in 4 or 5 days [08:44] and please add any additional details you can think of to the Census page on the Archiveteam wiki -- the more the better. [08:44] I'll send the email/ irc message [08:44] I need to dump a bunch of my command line stuff on there [08:45] I extracted md5s from it half a year ago, it was about 6 or 7 GB in MD5SUMs format - "md5 *filename" [08:45] so far, I've generated 122G of in-progress files (a lot of them duplicate, and uncompressed, so don't worry that it'll get *that* big) [08:45] vitzli: nice [08:46] Hm, I should probably put it in md5sum format, actually. [08:46] nuked it two weeks ago, still have the data from IA census on my hdd [08:46] I used a 3 column tab-separated values. [08:46] identifier \t filename \t md5 [08:48] my census didn't exclude the private files, so it has md5s for (some? most? all?) of the Wayback Machine data, too. [08:48] they aren't downloadable, but the reported md5s (and I think sha1s and crc32s) *are* available [08:49] and my census doesn't include any identifiers created after the original census -- Jake said he'll make a new identifier list and provide that, but I told him there was no hurry, as I was more interested in changes. [08:51] eh, I really should go to sleep. G'night! [08:52] good night [08:57] *** dashcloud has quit IRC (Read error: Operation timed out) [09:00] *** JesseW has quit IRC (Read error: Operation timed out) [09:02] *** dashcloud has joined #archiveteam-bs [09:26] SO MUCH CD-ROM BACKING UP [09:58] *** Stiletto has quit IRC (Read error: Operation timed out) [10:02] Backin' up backin up~ cause my daddy tought me goood [10:02] taught, damn it [10:18] *** megaminxw has joined #archiveteam-bs [10:18] *** bzc6p has joined #archiveteam-bs [10:19] megaminxw: When requests are made in the background by javascript means, the browser doesn't show it's loading. [10:20] im just going to put this down to me not understanding javascript very well [10:22] now im wondering if its possible to combine warcs because otherwise i wont have a clue what to do about this [10:26] You can concatenate WARCs with tools like megawarc. But you can also store/upload them as separate files. [10:27] well, alright [10:28] im rather new to this whole thing (WHO WOULD HAVE GUESSED) so [10:35] We're here to share experience. [10:35] alright [11:12] *** dashcloud has quit IRC (Read error: Operation timed out) [11:16] *** dashcloud has joined #archiveteam-bs [11:28] *** bzc6p has left [11:40] *** dashcloud has quit IRC (Read error: Operation timed out) [11:43] *** dashcloud has joined #archiveteam-bs [12:57] *** megaminxw has quit IRC (Quit: Leaving.) [13:24] *** dashcloud has quit IRC (Read error: Operation timed out) [13:28] *** dashcloud has joined #archiveteam-bs [14:34] *** mr-b has quit IRC (Read error: Operation timed out) [14:41] *** mr-b has joined #archiveteam-bs [14:51] *** Stiletto has joined #archiveteam-bs [14:53] *** slyphic|a is now known as slyphic [15:19] *** dashcloud has quit IRC (Read error: Operation timed out) [15:23] *** dashcloud has joined #archiveteam-bs [16:48] *** vitzli has quit IRC (Leaving) [17:01] *** dashcloud has quit IRC (Read error: Operation timed out) [17:05] *** dashcloud has joined #archiveteam-bs [17:27] *** JesseW has joined #archiveteam-bs [18:01] *** dashcloud has quit IRC (Read error: Operation timed out) [18:02] *** dashcloud has joined #archiveteam-bs [18:02] *** JesseW has quit IRC (Leaving.) [18:14] You will all be DELIGHTED to know my smashing into my CD-ROM inbox is going swimmingly. [18:15] We're well past "send a hard drive to IA because it's just too much data." [18:16] SketchCow: Do you just sneakernet HDDs to IA when you fly there (what seems like) every week? Or do you fedex/ups them? [18:17] I Fedex. [18:17] Or I take it along. [18:17] Gotcha. [18:18] When it's stuff like this. [18:18] Anything over, say, 20gb of data, it becomes more useful to just hd it. [18:18] heh [18:18] We're well past 200gb of CD-ROM/DVD-ROM images and scans. [18:18] For this batch. [18:18] This is going to be a doozy. [18:19] What's your upstream? [18:19] Cable modem. [18:19] That I use. [18:20] I won't slow down xbox so I can droop up a bunch of ISOs [18:20] Gotcha. [18:20] I do stuff on FOS all the time for this reason. But for this creation of digital materials here in my home, I am just creating them, boxing up the physicals, then heading off to the IA [18:28] o.O http://www.friendsreunited.co.uk/barack-obama-with-his-mum/People/b4952da8-3b65-4c0a-ae68-a166009d6b5d [18:29] there i thought fr was a uk based site! [18:29] for the userwalled http://www.assetstorage.co.uk/AssetStorageService.svc/GetImageFriendly/721510904/400/281/0/0/1/80/ResizeBestFit/0/FRU/649029A5DCE7F31D5C0FBDB1E2A4F1BD/barack-obama-with-his-mum.jpg [18:30] Why does that remind me of Michael Jackson? [18:30] afro? [18:30] Yeah [18:31] friendsreunited link grab is almost done [18:45] *** SilSte has joined #archiveteam-bs [18:45] *** Silvan has quit IRC (Read error: Connection reset by peer) [19:04] *** PurpleSym has quit IRC (Remote host closed the connection) [19:25] *** PurpleSym has joined #archiveteam-bs [19:37] http://phasenoise.livejournal.com/1500.html?nojs=1 [19:37] very cool [20:37] *** megaminxw has joined #archiveteam-bs [20:37] *** megaminxw has left [21:15] *** JetBalsa has quit IRC (Read error: Operation timed out) [21:48] "This computer will soon stop receiving Google Chrome updates because this Linux system will no longer be supported." [21:48] grmbl grmbl grmbl [21:49] lol [21:52] joepie91: which distro? [22:20] any, probably [22:49] phuzion: openSUSE 13.1 [22:50] Huh [23:43] *** BlueMaxim has joined #archiveteam-bs