[00:09] still requires a login, even if access is complementary [00:10] willing to bet watermarked as well [00:17] hmm [00:17] v1 actually loadd [00:56] alard: another error: [00:56] - Running wget --mirror (at least 873 files)... ERROR (3). [00:56] Error downloading 'wonshik'. [00:56] Error downloading from web.me.com. [00:59] O_o [00:59] 2011-11-04 00:44:17 ERROR 402: Payment Required. [01:01] alard: http://pastebin.com/Vgz1M3QM [01:05] will this work src/wget listserv.aol.com -m --warc-file=aol [01:06] bsmith094: you need to give it cookies for a login. I already have a logged-in wget running against it [01:06] it seems to be downloading something just fine [01:06] it is downloading a lot of "please log in" pages [01:08] oy well ok ill stop it then [02:41] of course they couldn't just make an mbox-format dump available [03:18] http://www.reddit.com/r/AskReddit/comments/lzlb5/will_facebook_one_day_become_a_mass_egrave_for/c2wwg1b they forgot the part where everything gets deleted [03:19] Cameron_D: :D [03:30] I'm trying to successfully run get-wget-warc.sh, but it's failing with a message of "--with-ssl was given, but GNUTLS is not available". I'm running Ubuntu, I can't find much of use in Synaptic/using apt-get, and my install attempts with gnutls-3.0.5 or nettle-2.4 are failing. Any suggestions? [03:32] Paradoks: apt-get install libssl-dev [03:32] That's a metapackage that should pull in what you need [03:33] Holy Balls, 7:30am flight [03:33] what the dillyfuck was I thinking [03:35] :D [03:37] Underscor: Any other possibilities? It installed, but I ended up with the same "GNUTLS is not available" error. [03:41] http://weknowmemes.com/wp-content/uploads/2011/10/there-it-goes-the-last-fuck-i-give.gif [03:54] Paradoks: Try libcurl4-openssl-dev [03:59] http://imgur.com/gallery/UP18u [04:00] underscor: Woo! libcurl4-openssl-dev didn't work, but libcurl4-gnutls-dev did. Thanks! [04:01] It appears to be running. Now off to bed with me. Hopefully the morning mess won't be too bad. [04:06] Yay [04:18] hmm cygwin wget doesn't like github.com's certificate [04:19] http://imgur.com/gallery/NVBoL [04:20] DFJustin: My wget doesn't either [04:20] --no-check-certificate [04:20] yeah already did that, just sayin [04:20] Oh okay [04:20] Yeah, my ubuntu 10.10 wget doesn't like it either [04:20] http://imgur.com/gallery/NVBoL [04:20] Oh [04:20] God [04:21] same link [04:21] Oops [04:21] http://imgur.com/gallery/3Ffk5 [04:22] yeah it's not like I was planning on sleeping tonight or anything [04:24] what are you wgetting from github? [04:24] first line of get-wget-warc.sh [04:46] alard: now running the fix ... could you please make the next update handle restarts after forcible stops sensibly? ;) [04:46] alard: though I suppose ./dld-user.sh johnwooton && ./dld-client.sh chronomex is a reasonable workaround [08:22] chronomex: dld-single.sh (nickname) (usertoresume) [08:22] that will properly tell the tracker you finished that user upon completion [08:23] in fact, dld-client just calls dld-single [08:23] (after fetching the next username from the tracker) [08:28] okay [08:29] dld-user.sh does not properly notify the tracker? [08:30] no [08:30] er, I should check [08:31] i might have them backwards [08:32] dld-user just gets the user's data. it does not notify the tracker. dld-single calls dld-user and then handles telling the tracker about completion [08:32] ahk. [08:32] dld-single (and by extension dld-user) won't redownload upon finding a completed user, correct? [08:32] dld-user calls dld-me-com.sh to do the actual heavy lifting [08:32] correct [08:32] okay sweet [08:33] it also won't re-download a completed site for a user [08:33] (it gets data from 4 domains. if it notices any are complete, it won't re-download that site) [08:33] hm, doesn't seem to notify the tracker if it finds a complete thinger [08:34] dld-user I mean [08:34] dld-user doesn't notify the tracker [08:34] dld-single does [08:34] ah shit k [08:34] * chronomex retard [08:39] alard: thanks for making this infrastructure, it's really awesome. [08:40] I should get some sleep [08:41] gotta get up somewhat early and get down to registration to pick up my badge. [08:42] nite [08:42] what con? [08:43] youmacon. anime con in detoilet, mi [08:44] literally on the detroit river waterfront [08:45] i look down towards the ground and there's the windsor tunnel entrance. i look a bit above that towards the horizon and there's the abassador bridge [08:45] ah, neat [08:45] so my poor phone is getting confused with the towers it sees up on 51 [08:45] I've never been to detroit [09:04] - Downloading (15 files)... done. [09:04] - Result: 372M [09:04] christ [13:20] Mooorning. [13:21] It'll be interesting to see what that reporter brings to the yard. [13:24] "On December 1, AOL will shut down its free LISTSERV-based mailing-list hosting operations, the company has told mailing list administrators. 'If your list is still actively used, please make arrangements to find another service prior to the shutdown date and notify your list members of the transition details,' an email notice sent out by AOL stated. At the peak of the service's popularity in the late 1990s, AOL was the third-largest provider of mailing lis [13:25] [17:18] Someone kindly WARC that bitch [13:26] ...though I'll admit I have no idea what happened after that. [13:27] On a separate topic, does MobileMe normally go down from 2a-2:30a Pacific time? [13:28] SketchCow: I uploaded a listserv archive to batcave. Retrieved via email, not via the web interface, so it may be different from what Coderjoe can get. [13:30] Ah~ [14:10] OK. [14:11] I feel we'll need to do some google searches to find the hidden servs. [14:13] Update: my earlier collection wasn't complete, I've just discovered more lists (that are not on the web page). [15:42] Yes [15:42] Exactly. [15:51] Here's the list that you get if you email the listserv: https://pastebin.com/raw.php?i=i1cMPwWt [16:16] alard: just as an FYI -- I don't think it's actionable -- the build of wget-warc that's download by the MobileMe tools segfaults on my install of OS X 10.6.8 when retrieving a site [16:16] I don't yet have any more information than that; I'm putting it through gdb now [16:17] And the latest wget-without-warc works? [16:17] haven't tried that one [16:17] I do have Wget 1.12 installed on this system, which does work [16:26] hmm, well, at least it's a straightforward crash [16:26] https://gist.github.com/af6ec51617ac4d97dbbd [16:28] Is it on a particular domain/user, or does it always crash? [16:28] always, so far [16:29] I'm rebuilding wget-warc with debug symbols [16:32] hm [16:32] "checking whether strcasestr works in linear time... no" [16:32] I wonder how often that check generates false negatives [16:33] https://gist.github.com/af6ec51617ac4d97dbbd/37d2c1bbb2230dcdb00710d9f4541dc74567f0c3 [16:33] ok, much more useful backtrace this time [16:35] oh, ok [16:35] " The basename() function returns a pointer to internal static storage space that will be overwritten by subsequent [16:35] calls. The function may modify the string pointed to by path. [16:35] " [16:36] so I guess that can be read "don't trust basename(3) to give you back anything useful" [16:42] Does this fix it? https://gist.github.com/6626eb704974af0c8d1d [16:43] hah [16:43] alard: that's exactly what I just wrote [16:43] Good! [16:43] well [16:43] with one other addition [16:43] on OS X (and probably other platforms), basename(3) is defined in libgen.h [16:43] er, exported from [16:44] that isn't included by warc.c, however, so I'm not entirely sure which basename is being used [16:44] Maybe the gnulib one: lib/basename-lgpl.c:/* basename.c -- return the last element in a file name [16:45] oh, yes, probably [16:46] Does it work now? [16:46] anyway, yeah, that libgen.h inclusion is also required, as using whichever other definition of basename will lead to the same crash [16:47] it does seem to build WARCs correctly now; I'm trying it out on user bdemoss [16:48] alard: is it useful to compare checksums of WARCs to verify that the above modifications are good to go? [16:49] I'm not sure if WARCs contain data that would make such checksums useless, like file retrieval times [16:49] Yes, they do. File retrieval times, random unique record ids. [16:49] hmm [16:50] Maybe you can unzip and look at the first few lines. It should show the filename in the body of warc-info record. [16:50] yeah [16:50] Sorry, with "libgen.h inclusion is also required", do you mean that I should add that include to warc.c? [16:50] yeah, I'll post my full patch [16:51] throwing that in will probably also require changes to configure.ac [16:52] oh wait, libgen is already present in wget-warc HEAD [16:52] weird [16:52] I remember that libgen.h was there, but I removed it later because it worked without it. [16:53] (The github version is not the same as the tar.bz2 version, by the way. The tar is newer.) [16:53] oh [16:53] Maybe I should update that. [16:53] yeah [16:53] in which case, the patch you posted should be sufficient [16:54] So then it's without libgen.h. [16:54] er [16:54] ugh, sorry [16:54] with [16:54] But if it works without libgen, why add it? (I'm not a very experienced c programmer, sorry. :) [16:54] it does't [16:54] doesn't [16:55] at least not on OS X [16:55] Ah, ok. [16:55] I suppose the problem can be patched in gnulib, though [16:55] let me see how hard that'd be [16:56] also, bdemoss' account is a really terrible test case [16:56] it's too huge [16:58] alard: actually, basename-lgpl.c doesn't define basename at all [16:58] I have no idea why the fuck it's named basename-lgpl.c [16:58] No, I just saw that. It has something to do with basenames, though. [17:10] ahh [17:10] libgen is part of SUSv2 and is defined as declaring, among other things, char * basename(char *) [17:10] so, yeah, libgen should be in there anyway [17:10] for UNIX platforms that is [17:12] Okay, so no need to add it? [17:13] no, it needs to be there for wget on systems that conform to SUSv2 [17:13] or rather it should be there [17:13] Ah, yes. (The github repo is now up to date, by the way.) [17:13] the fact that warc.c compiles without libgen.h there confuses me [17:14] do you know of any MobileMe users that don't have a lot of data? [17:15] oh, are people archiving aol listserv? [17:15] I hope you all saw that slashdot post [17:15] it's going away Dec 1 [17:18] 12M lioneltcb [17:19] oh ,hmm [17:19] - Result: du: illegal option -- - [17:19] from dld-user.sh: [17:19] usage: du [-H | -L | -P] [-a | -s | -d depth] [-c] [-h | -k | -m | -g] [-x] [-I mask] [file ...] [17:19] I guess that's just an options thing, will look at that further [17:20] yeah, --apparent-size looks like a GNU extension [17:20] Met the inventor of .tar last week [17:20] life is good [17:26] yipdw: Updated the scripts, --apparent-size is now optional. [17:27] alard: oh, ok, cool [17:27] I've got a patch that detects if gdu is present and uses that instead [17:28] alard: anyway, here's the warc.c patch if you're interested -> https://gist.github.com/edf3351f3c95a85788d3 [17:28] it was made against the tarball downloaded by get-wget-warc.sh [17:33] alard: and here's a patch for du --apparent-size feature detection if you'd like to use it -> https://gist.github.com/20c59248b5bd97d8affd [17:56] can I help with the AOL grab? [20:01] balrog: yes. [20:02] I've got a wget-warc crawling it with a login. I may need the names of other lists to try and poke those as well [20:02] ok. [20:02] balrog: but alard was grabbing it through the email archive interface as well [20:02] btw, there's still a lot that you can get to with the old AOL client [20:02] that would suck to archive. [20:26] bits rot over time / { UncorrectableError } / rsync fixes all [20:26] #accidentalhaiku [22:03] Downloading web.me.com/mrbambam [22:03] - Running wget --mirror (at least 29406 files)... [22:03] you go mrbambam [22:06] I've been running wget --mirror on web.me.com/khickling for over 7 hours. At least 106592 files. This seems excessive. I still assume that there are users with numbers that dwarf khickling, too. [22:06] perhaps [22:20] Paradoks: how big is the WARC? [22:23] The khickling directory currently has 73,217 items, totalling 4.3GB. The WARC is 2.1 GB, currently. So I don't think they're huge files, and it certainly hasn't been overly stressing to my connection. [22:31] neat [22:39] http://bzr.savannah.gnu.org/lh/wget/trunk/revision/2571 [23:17] Anyone happen to have a copy of the file that used to live here? [23:17] http://www.archiveteam.org/archives/lulupoetry/letter.txt [23:17] (Long shot, I know)