Time |
Nickname |
Message |
00:00
🔗
|
godane |
is anyone having problems with news.softpedia.com |
00:01
🔗
|
godane |
i can view articles anymore after trying to grab them: http://news.softpedia.com/news/ubuntu-based-black-lab-linux-7-0-3-distro-arrives-with-updated-kernel-more-499410.shtml |
00:01
🔗
|
HCross |
They might have IP banned you |
00:02
🔗
|
HCross |
hmm, doesnt work over my RDP at OVH |
00:02
🔗
|
godane |
ok then there having problems maybe |
00:02
🔗
|
godane |
ok |
00:02
🔗
|
godane |
its funny cause i could grab the images without any problem |
00:02
🔗
|
HCross |
Works fine though my home connection though |
00:02
🔗
|
godane |
just the articles |
00:03
🔗
|
HCross |
Is it a home IP you are using? |
00:03
🔗
|
godane |
yes |
00:50
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
00:54
🔗
|
|
dashcloud has joined #archiveteam-bs |
01:14
🔗
|
godane |
i'm grabbing Christian History Magazine from https://www.christianhistoryinstitute.org |
01:16
🔗
|
yipdw |
ersi: it actually is pretty fun(ny) when you get an Elastic Beanstalk environment into an unrecoverable stat |
01:16
🔗
|
yipdw |
e |
01:17
🔗
|
yipdw |
like, what happens is (A) you add a status check, which puts the environment into transition; (B) environment fails status checks because it's misconfigured; (C) you can't adjust the configuration because environment is in transition |
01:18
🔗
|
yipdw |
your only way to avoid that is to add the status check last and if you forget that you just have to make a new env and wait for the oh-well timeout to kick in, which is something like 30 minutes to whenever-the-fuck |
01:19
🔗
|
yipdw |
you can't terminate the errant environment because the environment is in transition and the terminate command gets stuck in Amazon CloudFormation |
01:19
🔗
|
yipdw |
tl;dr fuck the cloud |
02:24
🔗
|
|
JesseW has joined #archiveteam-bs |
02:25
🔗
|
joepie91 |
"my cloud is stuck, help" |
03:06
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
03:09
🔗
|
|
dashcloud has joined #archiveteam-bs |
04:26
🔗
|
|
JesseW has quit IRC (Leaving.) |
05:17
🔗
|
|
yipdw has quit IRC (Read error: Operation timed out) |
05:19
🔗
|
|
yipdw has joined #archiveteam-bs |
05:23
🔗
|
|
JesseW has joined #archiveteam-bs |
06:04
🔗
|
|
robink has quit IRC (Ping timeout: 190 seconds) |
06:05
🔗
|
|
robink has joined #archiveteam-bs |
06:43
🔗
|
|
Start_ has joined #archiveteam-bs |
06:43
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
07:19
🔗
|
JesseW |
6 GB of census differences... |
07:23
🔗
|
JesseW |
83,416,395 additions/removals/changes (which took 76 seconds to calculate) |
07:31
🔗
|
JesseW |
only 419,461 changes, of which only 15,605 are "interesting" (i.e. not changes to metadata) |
07:36
🔗
|
JesseW |
and about 10,000 of those are changes to slightly more unusual metadata (_dc.xml, _scandata.xml, _bhlmets.xml) |
07:53
🔗
|
yipdw |
oh niiiice |
07:53
🔗
|
yipdw |
if you specify bcrypt-encrypted htpasswd auth for Radicale (a CalDAV server) and bcrypt isn't available, Radicale defaults to auth strategy "none" |
07:54
🔗
|
yipdw |
unless you specify "htpasswd_encryption = bcrypt" and then it will abort on startup |
07:56
🔗
|
yipdw |
JesseW: is the data published somewhere? |
07:58
🔗
|
JesseW |
not yet; it's currently sitting locally on my computer |
08:00
🔗
|
JesseW |
I do intend to ask Jake to upload it to a new IA item (I'd rather it was under his name than mine) -- but I haven't fully settled on the best form yet (and you're the first one to ask if it's published :-) ) |
08:00
🔗
|
JesseW |
Here's the 8 "interesting" xml files that changed: |
08:00
🔗
|
JesseW |
http://archive.org/do/AlexandroxEDMPowerPodcast/AlexandroxEdmPowerPodcast.xml |
08:00
🔗
|
JesseW |
http://archive.org/do/AnthologyOfVerklarung/RSSfeed.xml |
08:00
🔗
|
JesseW |
http://archive.org/do/BackInAFlash_20150115/BackInAFlash.xml |
08:00
🔗
|
JesseW |
http://archive.org/do/elpodcastdelbuho/elpodcastdelbuho_rss.xml |
08:00
🔗
|
JesseW |
http://archive.org/do/thebeatvia/config.xml |
08:00
🔗
|
JesseW |
http://archive.org/do/ThisWeeksWeirdNewsRssFeed/ThisweeksweirdnewsRss-Itunes.xml |
08:01
🔗
|
JesseW |
http://archive.org/do/ThoughtsOnTheTable/feed.xml |
08:01
🔗
|
JesseW |
http://archive.org/do/University_SDA_Orlando_Podcast/podcast_rss.xml |
08:01
🔗
|
JesseW |
humph, typo |
08:01
🔗
|
JesseW |
http://archive.org/download/AlexandroxEDMPowerPodcast/AlexandroxEdmPowerPodcast.xml |
08:01
🔗
|
JesseW |
http://archive.org/download/AnthologyOfVerklarung/RSSfeed.xml |
08:01
🔗
|
JesseW |
http://archive.org/download/BackInAFlash_20150115/BackInAFlash.xml |
08:01
🔗
|
JesseW |
http://archive.org/download/elpodcastdelbuho/elpodcastdelbuho_rss.xml |
08:01
🔗
|
JesseW |
http://archive.org/download/thebeatvia/config.xml |
08:01
🔗
|
JesseW |
http://archive.org/download/ThisWeeksWeirdNewsRssFeed/ThisweeksweirdnewsRss-Itunes.xml |
08:01
🔗
|
JesseW |
http://archive.org/download/ThoughtsOnTheTable/feed.xml |
08:01
🔗
|
JesseW |
http://archive.org/download/University_SDA_Orlando_Podcast/podcast_rss.xml |
08:02
🔗
|
JesseW |
here they are with working links |
08:02
🔗
|
JesseW |
they all appear to be podcast feeds |
08:02
🔗
|
JesseW |
so maybe not so interesting. :-/ |
08:08
🔗
|
|
vitzli has joined #archiveteam-bs |
08:09
🔗
|
JesseW |
bizarely, one of the files that changed is http://archive.org/download/wikimediadownloads/legal.html |
08:10
🔗
|
JesseW |
which I'm not sure where it is linked from, but seems to be a copy of https://dumps.wikimedia.org/legal.html |
08:13
🔗
|
godane |
!ao http://www.theblaze.com/stories/2016/01/25/texas-grand-jury-indicts-center-for-medical-progress-filmmakers-but-not-planned-parenthood/ |
08:13
🔗
|
godane |
i put in archivebot channel |
08:13
🔗
|
godane |
*it |
08:14
🔗
|
vitzli |
JesseW, is it just md5 only or md5-sha1-crc32? |
08:14
🔗
|
JesseW |
sadly, the original census only has md5 |
08:15
🔗
|
JesseW |
IA provides md5/sha1/crc32 -- but (to be consistent with the previous census), I only grabbed md5 |
08:16
🔗
|
JesseW |
It'd probably be good to grab sha1 and crc32 going forward, I suppose. -- but that would increase the size quite a bit |
08:16
🔗
|
JesseW |
vitzli: |
08:18
🔗
|
vitzli |
I thought you grabbed sha1 too and I wanted to offer the storage for it if it is not needed |
08:19
🔗
|
vitzli |
how much time and bandwidth it takes to do? |
08:19
🔗
|
JesseW |
vitzli: it only took about a day to do the grab -- if you have spare storage lying around, it'd probably be neat to grab the *full* metadata and leave it somewhere |
08:20
🔗
|
JesseW |
That would be *much* larger, though. There are a few identifiers where IA (for /reasons/) effectively put the whole content into the metadata. One of the more eggregious ones (amusingly enough) is http://archive.org/details/nsa |
08:21
🔗
|
JesseW |
But even if you stripped out those outliers, it would likely be at least a few hundred gigabytes, if not larger. |
08:22
🔗
|
vitzli |
metadata as in _files.xml or something else? |
08:23
🔗
|
JesseW |
metadata as in archive.org/metadata/blahblbhahblah |
08:23
🔗
|
JesseW |
which is a generated combination of _meta.xml and _files.xml (IIRC) |
08:29
🔗
|
JesseW |
another amusing change: http://archive.org/download/arcadeflow/28th.html -- this is (presumably the most recent) part of SketchCow's alternate interface for the Internet Arcade |
08:37
🔗
|
vitzli |
I don't know if I could take all metadata, but I could grab and store sha1-md5-crc32-size data from IA |
08:39
🔗
|
JesseW |
cool; it'd be great to have a 3rd person get it working. The two basic tools you need are iamine and jq. |
08:40
🔗
|
JesseW |
I'm heading to sleep soon, but I'll be glad to walk you through it later (and feel free to mention me in the channel when I'm not here -- I'll read the logs) |
08:41
🔗
|
JesseW |
Oh, you'll also need python3.3 or 3.4 |
08:41
🔗
|
JesseW |
and GNU parallel |
08:41
🔗
|
vitzli |
I played with jq when found out about IA census, little bit annoying filter language, but good otherwise |
08:42
🔗
|
JesseW |
eh, it's so much nicer than trying to do JSON parsing in a non-CLI-oriented way |
08:42
🔗
|
JesseW |
and it grows on you (or at least it did on me) |
08:43
🔗
|
JesseW |
warn me (and Jake at IA) before you start actually downloading, as it does hit IA's servers noticeably. |
08:44
🔗
|
vitzli |
I will not be doing it in 4 or 5 days |
08:44
🔗
|
JesseW |
and please add any additional details you can think of to the Census page on the Archiveteam wiki -- the more the better. |
08:44
🔗
|
vitzli |
I'll send the email/ irc message |
08:44
🔗
|
JesseW |
I need to dump a bunch of my command line stuff on there |
08:45
🔗
|
vitzli |
I extracted md5s from it half a year ago, it was about 6 or 7 GB in MD5SUMs format - "md5 *filename" |
08:45
🔗
|
JesseW |
so far, I've generated 122G of in-progress files (a lot of them duplicate, and uncompressed, so don't worry that it'll get *that* big) |
08:45
🔗
|
JesseW |
vitzli: nice |
08:46
🔗
|
JesseW |
Hm, I should probably put it in md5sum format, actually. |
08:46
🔗
|
vitzli |
nuked it two weeks ago, still have the data from IA census on my hdd |
08:46
🔗
|
JesseW |
I used a 3 column tab-separated values. |
08:46
🔗
|
JesseW |
identifier \t filename \t md5 |
08:48
🔗
|
JesseW |
my census didn't exclude the private files, so it has md5s for (some? most? all?) of the Wayback Machine data, too. |
08:48
🔗
|
JesseW |
they aren't downloadable, but the reported md5s (and I think sha1s and crc32s) *are* available |
08:49
🔗
|
JesseW |
and my census doesn't include any identifiers created after the original census -- Jake said he'll make a new identifier list and provide that, but I told him there was no hurry, as I was more interested in changes. |
08:51
🔗
|
JesseW |
eh, I really should go to sleep. G'night! |
08:52
🔗
|
vitzli |
good night |
08:57
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
09:00
🔗
|
|
JesseW has quit IRC (Read error: Operation timed out) |
09:02
🔗
|
|
dashcloud has joined #archiveteam-bs |
09:26
🔗
|
SketchCow |
SO MUCH CD-ROM BACKING UP |
09:58
🔗
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
10:02
🔗
|
ersi |
Backin' up backin up~ cause my daddy tought me goood |
10:02
🔗
|
ersi |
taught, damn it |
10:18
🔗
|
|
megaminxw has joined #archiveteam-bs |
10:18
🔗
|
|
bzc6p has joined #archiveteam-bs |
10:19
🔗
|
bzc6p |
megaminxw: When requests are made in the background by javascript means, the browser doesn't show it's loading. |
10:20
🔗
|
megaminxw |
im just going to put this down to me not understanding javascript very well |
10:22
🔗
|
megaminxw |
now im wondering if its possible to combine warcs because otherwise i wont have a clue what to do about this |
10:26
🔗
|
bzc6p |
You can concatenate WARCs with tools like megawarc. But you can also store/upload them as separate files. |
10:27
🔗
|
megaminxw |
well, alright |
10:28
🔗
|
megaminxw |
im rather new to this whole thing (WHO WOULD HAVE GUESSED) so |
10:35
🔗
|
bzc6p |
We're here to share experience. |
10:35
🔗
|
megaminxw |
alright |
11:12
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
11:16
🔗
|
|
dashcloud has joined #archiveteam-bs |
11:28
🔗
|
|
bzc6p has left |
11:40
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
11:43
🔗
|
|
dashcloud has joined #archiveteam-bs |
12:57
🔗
|
|
megaminxw has quit IRC (Quit: Leaving.) |
13:24
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
13:28
🔗
|
|
dashcloud has joined #archiveteam-bs |
14:34
🔗
|
|
mr-b has quit IRC (Read error: Operation timed out) |
14:41
🔗
|
|
mr-b has joined #archiveteam-bs |
14:51
🔗
|
|
Stiletto has joined #archiveteam-bs |
14:53
🔗
|
|
slyphic|a is now known as slyphic |
15:19
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
15:23
🔗
|
|
dashcloud has joined #archiveteam-bs |
16:48
🔗
|
|
vitzli has quit IRC (Leaving) |
17:01
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
17:05
🔗
|
|
dashcloud has joined #archiveteam-bs |
17:27
🔗
|
|
JesseW has joined #archiveteam-bs |
18:01
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
18:02
🔗
|
|
dashcloud has joined #archiveteam-bs |
18:02
🔗
|
|
JesseW has quit IRC (Leaving.) |
18:14
🔗
|
SketchCow |
You will all be DELIGHTED to know my smashing into my CD-ROM inbox is going swimmingly. |
18:15
🔗
|
SketchCow |
We're well past "send a hard drive to IA because it's just too much data." |
18:16
🔗
|
phuzion |
SketchCow: Do you just sneakernet HDDs to IA when you fly there (what seems like) every week? Or do you fedex/ups them? |
18:17
🔗
|
SketchCow |
I Fedex. |
18:17
🔗
|
SketchCow |
Or I take it along. |
18:17
🔗
|
phuzion |
Gotcha. |
18:18
🔗
|
SketchCow |
When it's stuff like this. |
18:18
🔗
|
SketchCow |
Anything over, say, 20gb of data, it becomes more useful to just hd it. |
18:18
🔗
|
SimpBrain |
heh |
18:18
🔗
|
SketchCow |
We're well past 200gb of CD-ROM/DVD-ROM images and scans. |
18:18
🔗
|
SketchCow |
For this batch. |
18:18
🔗
|
SketchCow |
This is going to be a doozy. |
18:19
🔗
|
phuzion |
What's your upstream? |
18:19
🔗
|
SketchCow |
Cable modem. |
18:19
🔗
|
SketchCow |
That I use. |
18:20
🔗
|
SketchCow |
I won't slow down xbox so I can droop up a bunch of ISOs |
18:20
🔗
|
phuzion |
Gotcha. |
18:20
🔗
|
SketchCow |
I do stuff on FOS all the time for this reason. But for this creation of digital materials here in my home, I am just creating them, boxing up the physicals, then heading off to the IA |
18:28
🔗
|
SimpBrain |
o.O http://www.friendsreunited.co.uk/barack-obama-with-his-mum/People/b4952da8-3b65-4c0a-ae68-a166009d6b5d |
18:29
🔗
|
SimpBrain |
there i thought fr was a uk based site! |
18:29
🔗
|
SimpBrain |
for the userwalled http://www.assetstorage.co.uk/AssetStorageService.svc/GetImageFriendly/721510904/400/281/0/0/1/80/ResizeBestFit/0/FRU/649029A5DCE7F31D5C0FBDB1E2A4F1BD/barack-obama-with-his-mum.jpg |
18:30
🔗
|
HCross |
Why does that remind me of Michael Jackson? |
18:30
🔗
|
SimpBrain |
afro? |
18:30
🔗
|
HCross |
Yeah |
18:31
🔗
|
SimpBrain |
friendsreunited link grab is almost done |
18:45
🔗
|
|
SilSte has joined #archiveteam-bs |
18:45
🔗
|
|
Silvan has quit IRC (Read error: Connection reset by peer) |
19:04
🔗
|
|
PurpleSym has quit IRC (Remote host closed the connection) |
19:25
🔗
|
|
PurpleSym has joined #archiveteam-bs |
19:37
🔗
|
joepie91 |
http://phasenoise.livejournal.com/1500.html?nojs=1 |
19:37
🔗
|
joepie91 |
very cool |
20:37
🔗
|
|
megaminxw has joined #archiveteam-bs |
20:37
🔗
|
|
megaminxw has left |
21:15
🔗
|
|
JetBalsa has quit IRC (Read error: Operation timed out) |
21:48
🔗
|
joepie91 |
"This computer will soon stop receiving Google Chrome updates because this Linux system will no longer be supported." |
21:48
🔗
|
joepie91 |
grmbl grmbl grmbl |
21:49
🔗
|
SimpBrain |
lol |
21:52
🔗
|
phuzion |
joepie91: which distro? |
22:20
🔗
|
ersi |
any, probably |
22:49
🔗
|
joepie91 |
phuzion: openSUSE 13.1 |
22:50
🔗
|
phuzion |
Huh |
23:43
🔗
|
|
BlueMaxim has joined #archiveteam-bs |