#archiveteam-bs 2016-01-26,Tue

↑back Search

Time Nickname Message
00:00 🔗 godane is anyone having problems with news.softpedia.com
00:01 🔗 godane i can view articles anymore after trying to grab them: http://news.softpedia.com/news/ubuntu-based-black-lab-linux-7-0-3-distro-arrives-with-updated-kernel-more-499410.shtml
00:01 🔗 HCross They might have IP banned you
00:02 🔗 HCross hmm, doesnt work over my RDP at OVH
00:02 🔗 godane ok then there having problems maybe
00:02 🔗 godane ok
00:02 🔗 godane its funny cause i could grab the images without any problem
00:02 🔗 HCross Works fine though my home connection though
00:02 🔗 godane just the articles
00:03 🔗 HCross Is it a home IP you are using?
00:03 🔗 godane yes
00:50 🔗 dashcloud has quit IRC (Read error: Operation timed out)
00:54 🔗 dashcloud has joined #archiveteam-bs
01:14 🔗 godane i'm grabbing Christian History Magazine from https://www.christianhistoryinstitute.org
01:16 🔗 yipdw ersi: it actually is pretty fun(ny) when you get an Elastic Beanstalk environment into an unrecoverable stat
01:16 🔗 yipdw e
01:17 🔗 yipdw like, what happens is (A) you add a status check, which puts the environment into transition; (B) environment fails status checks because it's misconfigured; (C) you can't adjust the configuration because environment is in transition
01:18 🔗 yipdw your only way to avoid that is to add the status check last and if you forget that you just have to make a new env and wait for the oh-well timeout to kick in, which is something like 30 minutes to whenever-the-fuck
01:19 🔗 yipdw you can't terminate the errant environment because the environment is in transition and the terminate command gets stuck in Amazon CloudFormation
01:19 🔗 yipdw tl;dr fuck the cloud
02:24 🔗 JesseW has joined #archiveteam-bs
02:25 🔗 joepie91 "my cloud is stuck, help"
03:06 🔗 dashcloud has quit IRC (Read error: Operation timed out)
03:09 🔗 dashcloud has joined #archiveteam-bs
04:26 🔗 JesseW has quit IRC (Leaving.)
05:17 🔗 yipdw has quit IRC (Read error: Operation timed out)
05:19 🔗 yipdw has joined #archiveteam-bs
05:23 🔗 JesseW has joined #archiveteam-bs
06:04 🔗 robink has quit IRC (Ping timeout: 190 seconds)
06:05 🔗 robink has joined #archiveteam-bs
06:43 🔗 Start_ has joined #archiveteam-bs
06:43 🔗 Start has quit IRC (Read error: Connection reset by peer)
07:19 🔗 JesseW 6 GB of census differences...
07:23 🔗 JesseW 83,416,395 additions/removals/changes (which took 76 seconds to calculate)
07:31 🔗 JesseW only 419,461 changes, of which only 15,605 are "interesting" (i.e. not changes to metadata)
07:36 🔗 JesseW and about 10,000 of those are changes to slightly more unusual metadata (_dc.xml, _scandata.xml, _bhlmets.xml)
07:53 🔗 yipdw oh niiiice
07:53 🔗 yipdw if you specify bcrypt-encrypted htpasswd auth for Radicale (a CalDAV server) and bcrypt isn't available, Radicale defaults to auth strategy "none"
07:54 🔗 yipdw unless you specify "htpasswd_encryption = bcrypt" and then it will abort on startup
07:56 🔗 yipdw JesseW: is the data published somewhere?
07:58 🔗 JesseW not yet; it's currently sitting locally on my computer
08:00 🔗 JesseW I do intend to ask Jake to upload it to a new IA item (I'd rather it was under his name than mine) -- but I haven't fully settled on the best form yet (and you're the first one to ask if it's published :-) )
08:00 🔗 JesseW Here's the 8 "interesting" xml files that changed:
08:00 🔗 JesseW http://archive.org/do/AlexandroxEDMPowerPodcast/AlexandroxEdmPowerPodcast.xml
08:00 🔗 JesseW http://archive.org/do/AnthologyOfVerklarung/RSSfeed.xml
08:00 🔗 JesseW http://archive.org/do/BackInAFlash_20150115/BackInAFlash.xml
08:00 🔗 JesseW http://archive.org/do/elpodcastdelbuho/elpodcastdelbuho_rss.xml
08:00 🔗 JesseW http://archive.org/do/thebeatvia/config.xml
08:00 🔗 JesseW http://archive.org/do/ThisWeeksWeirdNewsRssFeed/ThisweeksweirdnewsRss-Itunes.xml
08:01 🔗 JesseW http://archive.org/do/ThoughtsOnTheTable/feed.xml
08:01 🔗 JesseW http://archive.org/do/University_SDA_Orlando_Podcast/podcast_rss.xml
08:01 🔗 JesseW humph, typo
08:01 🔗 JesseW http://archive.org/download/AlexandroxEDMPowerPodcast/AlexandroxEdmPowerPodcast.xml
08:01 🔗 JesseW http://archive.org/download/AnthologyOfVerklarung/RSSfeed.xml
08:01 🔗 JesseW http://archive.org/download/BackInAFlash_20150115/BackInAFlash.xml
08:01 🔗 JesseW http://archive.org/download/elpodcastdelbuho/elpodcastdelbuho_rss.xml
08:01 🔗 JesseW http://archive.org/download/thebeatvia/config.xml
08:01 🔗 JesseW http://archive.org/download/ThisWeeksWeirdNewsRssFeed/ThisweeksweirdnewsRss-Itunes.xml
08:01 🔗 JesseW http://archive.org/download/ThoughtsOnTheTable/feed.xml
08:01 🔗 JesseW http://archive.org/download/University_SDA_Orlando_Podcast/podcast_rss.xml
08:02 🔗 JesseW here they are with working links
08:02 🔗 JesseW they all appear to be podcast feeds
08:02 🔗 JesseW so maybe not so interesting. :-/
08:08 🔗 vitzli has joined #archiveteam-bs
08:09 🔗 JesseW bizarely, one of the files that changed is http://archive.org/download/wikimediadownloads/legal.html
08:10 🔗 JesseW which I'm not sure where it is linked from, but seems to be a copy of https://dumps.wikimedia.org/legal.html
08:13 🔗 godane !ao http://www.theblaze.com/stories/2016/01/25/texas-grand-jury-indicts-center-for-medical-progress-filmmakers-but-not-planned-parenthood/
08:13 🔗 godane i put in archivebot channel
08:13 🔗 godane *it
08:14 🔗 vitzli JesseW, is it just md5 only or md5-sha1-crc32?
08:14 🔗 JesseW sadly, the original census only has md5
08:15 🔗 JesseW IA provides md5/sha1/crc32 -- but (to be consistent with the previous census), I only grabbed md5
08:16 🔗 JesseW It'd probably be good to grab sha1 and crc32 going forward, I suppose. -- but that would increase the size quite a bit
08:16 🔗 JesseW vitzli:
08:18 🔗 vitzli I thought you grabbed sha1 too and I wanted to offer the storage for it if it is not needed
08:19 🔗 vitzli how much time and bandwidth it takes to do?
08:19 🔗 JesseW vitzli: it only took about a day to do the grab -- if you have spare storage lying around, it'd probably be neat to grab the *full* metadata and leave it somewhere
08:20 🔗 JesseW That would be *much* larger, though. There are a few identifiers where IA (for /reasons/) effectively put the whole content into the metadata. One of the more eggregious ones (amusingly enough) is http://archive.org/details/nsa
08:21 🔗 JesseW But even if you stripped out those outliers, it would likely be at least a few hundred gigabytes, if not larger.
08:22 🔗 vitzli metadata as in _files.xml or something else?
08:23 🔗 JesseW metadata as in archive.org/metadata/blahblbhahblah
08:23 🔗 JesseW which is a generated combination of _meta.xml and _files.xml (IIRC)
08:29 🔗 JesseW another amusing change: http://archive.org/download/arcadeflow/28th.html -- this is (presumably the most recent) part of SketchCow's alternate interface for the Internet Arcade
08:37 🔗 vitzli I don't know if I could take all metadata, but I could grab and store sha1-md5-crc32-size data from IA
08:39 🔗 JesseW cool; it'd be great to have a 3rd person get it working. The two basic tools you need are iamine and jq.
08:40 🔗 JesseW I'm heading to sleep soon, but I'll be glad to walk you through it later (and feel free to mention me in the channel when I'm not here -- I'll read the logs)
08:41 🔗 JesseW Oh, you'll also need python3.3 or 3.4
08:41 🔗 JesseW and GNU parallel
08:41 🔗 vitzli I played with jq when found out about IA census, little bit annoying filter language, but good otherwise
08:42 🔗 JesseW eh, it's so much nicer than trying to do JSON parsing in a non-CLI-oriented way
08:42 🔗 JesseW and it grows on you (or at least it did on me)
08:43 🔗 JesseW warn me (and Jake at IA) before you start actually downloading, as it does hit IA's servers noticeably.
08:44 🔗 vitzli I will not be doing it in 4 or 5 days
08:44 🔗 JesseW and please add any additional details you can think of to the Census page on the Archiveteam wiki -- the more the better.
08:44 🔗 vitzli I'll send the email/ irc message
08:44 🔗 JesseW I need to dump a bunch of my command line stuff on there
08:45 🔗 vitzli I extracted md5s from it half a year ago, it was about 6 or 7 GB in MD5SUMs format - "md5 *filename"
08:45 🔗 JesseW so far, I've generated 122G of in-progress files (a lot of them duplicate, and uncompressed, so don't worry that it'll get *that* big)
08:45 🔗 JesseW vitzli: nice
08:46 🔗 JesseW Hm, I should probably put it in md5sum format, actually.
08:46 🔗 vitzli nuked it two weeks ago, still have the data from IA census on my hdd
08:46 🔗 JesseW I used a 3 column tab-separated values.
08:46 🔗 JesseW identifier \t filename \t md5
08:48 🔗 JesseW my census didn't exclude the private files, so it has md5s for (some? most? all?) of the Wayback Machine data, too.
08:48 🔗 JesseW they aren't downloadable, but the reported md5s (and I think sha1s and crc32s) *are* available
08:49 🔗 JesseW and my census doesn't include any identifiers created after the original census -- Jake said he'll make a new identifier list and provide that, but I told him there was no hurry, as I was more interested in changes.
08:51 🔗 JesseW eh, I really should go to sleep. G'night!
08:52 🔗 vitzli good night
08:57 🔗 dashcloud has quit IRC (Read error: Operation timed out)
09:00 🔗 JesseW has quit IRC (Read error: Operation timed out)
09:02 🔗 dashcloud has joined #archiveteam-bs
09:26 🔗 SketchCow SO MUCH CD-ROM BACKING UP
09:58 🔗 Stiletto has quit IRC (Read error: Operation timed out)
10:02 🔗 ersi Backin' up backin up~ cause my daddy tought me goood
10:02 🔗 ersi taught, damn it
10:18 🔗 megaminxw has joined #archiveteam-bs
10:18 🔗 bzc6p has joined #archiveteam-bs
10:19 🔗 bzc6p megaminxw: When requests are made in the background by javascript means, the browser doesn't show it's loading.
10:20 🔗 megaminxw im just going to put this down to me not understanding javascript very well
10:22 🔗 megaminxw now im wondering if its possible to combine warcs because otherwise i wont have a clue what to do about this
10:26 🔗 bzc6p You can concatenate WARCs with tools like megawarc. But you can also store/upload them as separate files.
10:27 🔗 megaminxw well, alright
10:28 🔗 megaminxw im rather new to this whole thing (WHO WOULD HAVE GUESSED) so
10:35 🔗 bzc6p We're here to share experience.
10:35 🔗 megaminxw alright
11:12 🔗 dashcloud has quit IRC (Read error: Operation timed out)
11:16 🔗 dashcloud has joined #archiveteam-bs
11:28 🔗 bzc6p has left
11:40 🔗 dashcloud has quit IRC (Read error: Operation timed out)
11:43 🔗 dashcloud has joined #archiveteam-bs
12:57 🔗 megaminxw has quit IRC (Quit: Leaving.)
13:24 🔗 dashcloud has quit IRC (Read error: Operation timed out)
13:28 🔗 dashcloud has joined #archiveteam-bs
14:34 🔗 mr-b has quit IRC (Read error: Operation timed out)
14:41 🔗 mr-b has joined #archiveteam-bs
14:51 🔗 Stiletto has joined #archiveteam-bs
14:53 🔗 slyphic|a is now known as slyphic
15:19 🔗 dashcloud has quit IRC (Read error: Operation timed out)
15:23 🔗 dashcloud has joined #archiveteam-bs
16:48 🔗 vitzli has quit IRC (Leaving)
17:01 🔗 dashcloud has quit IRC (Read error: Operation timed out)
17:05 🔗 dashcloud has joined #archiveteam-bs
17:27 🔗 JesseW has joined #archiveteam-bs
18:01 🔗 dashcloud has quit IRC (Read error: Operation timed out)
18:02 🔗 dashcloud has joined #archiveteam-bs
18:02 🔗 JesseW has quit IRC (Leaving.)
18:14 🔗 SketchCow You will all be DELIGHTED to know my smashing into my CD-ROM inbox is going swimmingly.
18:15 🔗 SketchCow We're well past "send a hard drive to IA because it's just too much data."
18:16 🔗 phuzion SketchCow: Do you just sneakernet HDDs to IA when you fly there (what seems like) every week? Or do you fedex/ups them?
18:17 🔗 SketchCow I Fedex.
18:17 🔗 SketchCow Or I take it along.
18:17 🔗 phuzion Gotcha.
18:18 🔗 SketchCow When it's stuff like this.
18:18 🔗 SketchCow Anything over, say, 20gb of data, it becomes more useful to just hd it.
18:18 🔗 SimpBrain heh
18:18 🔗 SketchCow We're well past 200gb of CD-ROM/DVD-ROM images and scans.
18:18 🔗 SketchCow For this batch.
18:18 🔗 SketchCow This is going to be a doozy.
18:19 🔗 phuzion What's your upstream?
18:19 🔗 SketchCow Cable modem.
18:19 🔗 SketchCow That I use.
18:20 🔗 SketchCow I won't slow down xbox so I can droop up a bunch of ISOs
18:20 🔗 phuzion Gotcha.
18:20 🔗 SketchCow I do stuff on FOS all the time for this reason. But for this creation of digital materials here in my home, I am just creating them, boxing up the physicals, then heading off to the IA
18:28 🔗 SimpBrain o.O http://www.friendsreunited.co.uk/barack-obama-with-his-mum/People/b4952da8-3b65-4c0a-ae68-a166009d6b5d
18:29 🔗 SimpBrain there i thought fr was a uk based site!
18:29 🔗 SimpBrain for the userwalled http://www.assetstorage.co.uk/AssetStorageService.svc/GetImageFriendly/721510904/400/281/0/0/1/80/ResizeBestFit/0/FRU/649029A5DCE7F31D5C0FBDB1E2A4F1BD/barack-obama-with-his-mum.jpg
18:30 🔗 HCross Why does that remind me of Michael Jackson?
18:30 🔗 SimpBrain afro?
18:30 🔗 HCross Yeah
18:31 🔗 SimpBrain friendsreunited link grab is almost done
18:45 🔗 SilSte has joined #archiveteam-bs
18:45 🔗 Silvan has quit IRC (Read error: Connection reset by peer)
19:04 🔗 PurpleSym has quit IRC (Remote host closed the connection)
19:25 🔗 PurpleSym has joined #archiveteam-bs
19:37 🔗 joepie91 http://phasenoise.livejournal.com/1500.html?nojs=1
19:37 🔗 joepie91 very cool
20:37 🔗 megaminxw has joined #archiveteam-bs
20:37 🔗 megaminxw has left
21:15 🔗 JetBalsa has quit IRC (Read error: Operation timed out)
21:48 🔗 joepie91 "This computer will soon stop receiving Google Chrome updates because this Linux system will no longer be supported."
21:48 🔗 joepie91 grmbl grmbl grmbl
21:49 🔗 SimpBrain lol
21:52 🔗 phuzion joepie91: which distro?
22:20 🔗 ersi any, probably
22:49 🔗 joepie91 phuzion: openSUSE 13.1
22:50 🔗 phuzion Huh
23:43 🔗 BlueMaxim has joined #archiveteam-bs

irclogger-viewer