#archiveteam 2011-11-04,Fri

↑back Search

Time Nickname Message
00:09 🔗 Coderjoe still requires a login, even if access is complementary
00:10 🔗 Coderjoe willing to bet watermarked as well
00:17 🔗 Coderjoe hmm
00:17 🔗 Coderjoe v1 actually loadd
00:56 🔗 Coderjoe alard: another error:
00:56 🔗 Coderjoe - Running wget --mirror (at least 873 files)... ERROR (3).
00:56 🔗 Coderjoe Error downloading 'wonshik'.
00:56 🔗 Coderjoe Error downloading from web.me.com.
00:59 🔗 Coderjoe O_o
00:59 🔗 Coderjoe 2011-11-04 00:44:17 ERROR 402: Payment Required.
01:01 🔗 Coderjoe alard: http://pastebin.com/Vgz1M3QM
01:05 🔗 bsmith094 will this work src/wget listserv.aol.com -m --warc-file=aol
01:06 🔗 Coderjoe bsmith094: you need to give it cookies for a login. I already have a logged-in wget running against it
01:06 🔗 bsmith094 it seems to be downloading something just fine
01:06 🔗 Coderjoe it is downloading a lot of "please log in" pages
01:08 🔗 bsmith094 oy well ok ill stop it then
02:41 🔗 Coderjoe of course they couldn't just make an mbox-format dump available
03:18 🔗 Cameron_D http://www.reddit.com/r/AskReddit/comments/lzlb5/will_facebook_one_day_become_a_mass_egrave_for/c2wwg1b they forgot the part where everything gets deleted
03:19 🔗 underscor Cameron_D: :D
03:30 🔗 Paradoks I'm trying to successfully run get-wget-warc.sh, but it's failing with a message of "--with-ssl was given, but GNUTLS is not available". I'm running Ubuntu, I can't find much of use in Synaptic/using apt-get, and my install attempts with gnutls-3.0.5 or nettle-2.4 are failing. Any suggestions?
03:32 🔗 underscor Paradoks: apt-get install libssl-dev
03:32 🔗 underscor That's a metapackage that should pull in what you need
03:33 🔗 SketchCow Holy Balls, 7:30am flight
03:33 🔗 SketchCow what the dillyfuck was I thinking
03:35 🔗 underscor :D
03:37 🔗 Paradoks Underscor: Any other possibilities? It installed, but I ended up with the same "GNUTLS is not available" error.
03:41 🔗 SketchCow http://weknowmemes.com/wp-content/uploads/2011/10/there-it-goes-the-last-fuck-i-give.gif
03:54 🔗 underscor Paradoks: Try libcurl4-openssl-dev
03:59 🔗 underscor http://imgur.com/gallery/UP18u
04:00 🔗 Paradoks underscor: Woo! libcurl4-openssl-dev didn't work, but libcurl4-gnutls-dev did. Thanks!
04:01 🔗 Paradoks It appears to be running. Now off to bed with me. Hopefully the morning mess won't be too bad.
04:06 🔗 underscor Yay
04:18 🔗 DFJustin hmm cygwin wget doesn't like github.com's certificate
04:19 🔗 underscor http://imgur.com/gallery/NVBoL
04:20 🔗 underscor DFJustin: My wget doesn't either
04:20 🔗 underscor --no-check-certificate
04:20 🔗 DFJustin yeah already did that, just sayin
04:20 🔗 underscor Oh okay
04:20 🔗 underscor Yeah, my ubuntu 10.10 wget doesn't like it either
04:20 🔗 underscor http://imgur.com/gallery/NVBoL
04:20 🔗 underscor Oh
04:20 🔗 underscor God
04:21 🔗 DFJustin same link
04:21 🔗 underscor Oops
04:21 🔗 underscor http://imgur.com/gallery/3Ffk5
04:22 🔗 DFJustin yeah it's not like I was planning on sleeping tonight or anything
04:24 🔗 closure what are you wgetting from github?
04:24 🔗 DFJustin first line of get-wget-warc.sh
04:46 🔗 chronomex alard: now running the fix ... could you please make the next update handle restarts after forcible stops sensibly? ;)
04:46 🔗 chronomex alard: though I suppose ./dld-user.sh johnwooton && ./dld-client.sh chronomex is a reasonable workaround
08:22 🔗 Coderjoe chronomex: dld-single.sh (nickname) (usertoresume)
08:22 🔗 Coderjoe that will properly tell the tracker you finished that user upon completion
08:23 🔗 Coderjoe in fact, dld-client just calls dld-single
08:23 🔗 Coderjoe (after fetching the next username from the tracker)
08:28 🔗 chronomex okay
08:29 🔗 chronomex dld-user.sh does not properly notify the tracker?
08:30 🔗 Coderjoe no
08:30 🔗 Coderjoe er, I should check
08:31 🔗 Coderjoe i might have them backwards
08:32 🔗 Coderjoe dld-user just gets the user's data. it does not notify the tracker. dld-single calls dld-user and then handles telling the tracker about completion
08:32 🔗 chronomex ahk.
08:32 🔗 chronomex dld-single (and by extension dld-user) won't redownload upon finding a completed user, correct?
08:32 🔗 Coderjoe dld-user calls dld-me-com.sh to do the actual heavy lifting
08:32 🔗 Coderjoe correct
08:32 🔗 chronomex okay sweet
08:33 🔗 Coderjoe it also won't re-download a completed site for a user
08:33 🔗 Coderjoe (it gets data from 4 domains. if it notices any are complete, it won't re-download that site)
08:33 🔗 chronomex hm, doesn't seem to notify the tracker if it finds a complete thinger
08:34 🔗 chronomex dld-user I mean
08:34 🔗 Coderjoe dld-user doesn't notify the tracker
08:34 🔗 Coderjoe dld-single does
08:34 🔗 chronomex ah shit k
08:34 🔗 * chronomex retard
08:39 🔗 chronomex alard: thanks for making this infrastructure, it's really awesome.
08:40 🔗 Coderjoe I should get some sleep
08:41 🔗 Coderjoe gotta get up somewhat early and get down to registration to pick up my badge.
08:42 🔗 chronomex nite
08:42 🔗 chronomex what con?
08:43 🔗 Coderjoe youmacon. anime con in detoilet, mi
08:44 🔗 Coderjoe literally on the detroit river waterfront
08:45 🔗 Coderjoe i look down towards the ground and there's the windsor tunnel entrance. i look a bit above that towards the horizon and there's the abassador bridge
08:45 🔗 chronomex ah, neat
08:45 🔗 Coderjoe so my poor phone is getting confused with the towers it sees up on 51
08:45 🔗 chronomex I've never been to detroit
09:04 🔗 chronomex - Downloading (15 files)... done.
09:04 🔗 chronomex - Result: 372M
09:04 🔗 chronomex christ
13:20 🔗 SketchCow Mooorning.
13:21 🔗 SketchCow It'll be interesting to see what that reporter brings to the yard.
13:24 🔗 ersi "On December 1, AOL will shut down its free LISTSERV-based mailing-list hosting operations, the company has told mailing list administrators. 'If your list is still actively used, please make arrangements to find another service prior to the shutdown date and notify your list members of the transition details,' an email notice sent out by AOL stated. At the peak of the service's popularity in the late 1990s, AOL was the third-largest provider of mailing lis
13:25 🔗 Paradoks [17:18] <SketchCow> Someone kindly WARC that bitch
13:26 🔗 Paradoks ...though I'll admit I have no idea what happened after that.
13:27 🔗 Paradoks On a separate topic, does MobileMe normally go down from 2a-2:30a Pacific time?
13:28 🔗 alard SketchCow: I uploaded a listserv archive to batcave. Retrieved via email, not via the web interface, so it may be different from what Coderjoe can get.
13:30 🔗 ersi Ah~
14:10 🔗 SketchCow OK.
14:11 🔗 SketchCow I feel we'll need to do some google searches to find the hidden servs.
14:13 🔗 alard Update: my earlier collection wasn't complete, I've just discovered more lists (that are not on the web page).
15:42 🔗 SketchCow Yes
15:42 🔗 SketchCow Exactly.
15:51 🔗 alard Here's the list that you get if you email the listserv: https://pastebin.com/raw.php?i=i1cMPwWt
16:16 🔗 yipdw alard: just as an FYI -- I don't think it's actionable -- the build of wget-warc that's download by the MobileMe tools segfaults on my install of OS X 10.6.8 when retrieving a site
16:16 🔗 yipdw I don't yet have any more information than that; I'm putting it through gdb now
16:17 🔗 alard And the latest wget-without-warc works?
16:17 🔗 yipdw haven't tried that one
16:17 🔗 yipdw I do have Wget 1.12 installed on this system, which does work
16:26 🔗 yipdw hmm, well, at least it's a straightforward crash
16:26 🔗 yipdw https://gist.github.com/af6ec51617ac4d97dbbd
16:28 🔗 alard Is it on a particular domain/user, or does it always crash?
16:28 🔗 yipdw always, so far
16:29 🔗 yipdw I'm rebuilding wget-warc with debug symbols
16:32 🔗 yipdw hm
16:32 🔗 yipdw "checking whether strcasestr works in linear time... no"
16:32 🔗 yipdw I wonder how often that check generates false negatives
16:33 🔗 yipdw https://gist.github.com/af6ec51617ac4d97dbbd/37d2c1bbb2230dcdb00710d9f4541dc74567f0c3
16:33 🔗 yipdw ok, much more useful backtrace this time
16:35 🔗 yipdw oh, ok
16:35 🔗 yipdw " The basename() function returns a pointer to internal static storage space that will be overwritten by subsequent
16:35 🔗 yipdw calls. The function may modify the string pointed to by path.
16:35 🔗 yipdw "
16:36 🔗 yipdw so I guess that can be read "don't trust basename(3) to give you back anything useful"
16:42 🔗 alard Does this fix it? https://gist.github.com/6626eb704974af0c8d1d
16:43 🔗 yipdw hah
16:43 🔗 yipdw alard: that's exactly what I just wrote
16:43 🔗 alard Good!
16:43 🔗 yipdw well
16:43 🔗 yipdw with one other addition
16:43 🔗 yipdw on OS X (and probably other platforms), basename(3) is defined in libgen.h
16:43 🔗 yipdw er, exported from
16:44 🔗 yipdw that isn't included by warc.c, however, so I'm not entirely sure which basename is being used
16:44 🔗 alard Maybe the gnulib one: lib/basename-lgpl.c:/* basename.c -- return the last element in a file name
16:45 🔗 yipdw oh, yes, probably
16:46 🔗 alard Does it work now?
16:46 🔗 yipdw anyway, yeah, that libgen.h inclusion is also required, as using whichever other definition of basename will lead to the same crash
16:47 🔗 yipdw it does seem to build WARCs correctly now; I'm trying it out on user bdemoss
16:48 🔗 yipdw alard: is it useful to compare checksums of WARCs to verify that the above modifications are good to go?
16:49 🔗 yipdw I'm not sure if WARCs contain data that would make such checksums useless, like file retrieval times
16:49 🔗 alard Yes, they do. File retrieval times, random unique record ids.
16:49 🔗 yipdw hmm
16:50 🔗 alard Maybe you can unzip and look at the first few lines. It should show the filename in the body of warc-info record.
16:50 🔗 yipdw yeah
16:50 🔗 alard Sorry, with "libgen.h inclusion is also required", do you mean that I should add that include to warc.c?
16:50 🔗 yipdw yeah, I'll post my full patch
16:51 🔗 yipdw throwing that in will probably also require changes to configure.ac
16:52 🔗 yipdw oh wait, libgen is already present in wget-warc HEAD
16:52 🔗 yipdw weird
16:52 🔗 alard I remember that libgen.h was there, but I removed it later because it worked without it.
16:53 🔗 alard (The github version is not the same as the tar.bz2 version, by the way. The tar is newer.)
16:53 🔗 yipdw oh
16:53 🔗 alard Maybe I should update that.
16:53 🔗 yipdw yeah
16:53 🔗 yipdw in which case, the patch you posted should be sufficient
16:54 🔗 alard So then it's without libgen.h.
16:54 🔗 yipdw er
16:54 🔗 yipdw ugh, sorry
16:54 🔗 yipdw with
16:54 🔗 alard But if it works without libgen, why add it? (I'm not a very experienced c programmer, sorry. :)
16:54 🔗 yipdw it does't
16:54 🔗 yipdw doesn't
16:55 🔗 yipdw at least not on OS X
16:55 🔗 alard Ah, ok.
16:55 🔗 yipdw I suppose the problem can be patched in gnulib, though
16:55 🔗 yipdw let me see how hard that'd be
16:56 🔗 yipdw also, bdemoss' account is a really terrible test case
16:56 🔗 yipdw it's too huge
16:58 🔗 yipdw alard: actually, basename-lgpl.c doesn't define basename at all
16:58 🔗 yipdw I have no idea why the fuck it's named basename-lgpl.c
16:58 🔗 alard No, I just saw that. It has something to do with basenames, though.
17:10 🔗 yipdw ahh
17:10 🔗 yipdw libgen is part of SUSv2 and is defined as declaring, among other things, char * basename(char *)
17:10 🔗 yipdw so, yeah, libgen should be in there anyway
17:10 🔗 yipdw for UNIX platforms that is
17:12 🔗 alard Okay, so no need to add it?
17:13 🔗 yipdw no, it needs to be there for wget on systems that conform to SUSv2
17:13 🔗 yipdw or rather it should be there
17:13 🔗 alard Ah, yes. (The github repo is now up to date, by the way.)
17:13 🔗 yipdw the fact that warc.c compiles without libgen.h there confuses me
17:14 🔗 yipdw do you know of any MobileMe users that don't have a lot of data?
17:15 🔗 balrog oh, are people archiving aol listserv?
17:15 🔗 balrog I hope you all saw that slashdot post
17:15 🔗 balrog it's going away Dec 1
17:18 🔗 alard 12M lioneltcb
17:19 🔗 yipdw oh ,hmm
17:19 🔗 yipdw - Result: du: illegal option -- -
17:19 🔗 yipdw from dld-user.sh:
17:19 🔗 yipdw usage: du [-H | -L | -P] [-a | -s | -d depth] [-c] [-h | -k | -m | -g] [-x] [-I mask] [file ...]
17:19 🔗 yipdw I guess that's just an options thing, will look at that further
17:20 🔗 yipdw yeah, --apparent-size looks like a GNU extension
17:20 🔗 SketchCow Met the inventor of .tar last week
17:20 🔗 SketchCow life is good
17:26 🔗 alard yipdw: Updated the scripts, --apparent-size is now optional.
17:27 🔗 yipdw alard: oh, ok, cool
17:27 🔗 yipdw I've got a patch that detects if gdu is present and uses that instead
17:28 🔗 yipdw alard: anyway, here's the warc.c patch if you're interested -> https://gist.github.com/edf3351f3c95a85788d3
17:28 🔗 yipdw it was made against the tarball downloaded by get-wget-warc.sh
17:33 🔗 yipdw alard: and here's a patch for du --apparent-size feature detection if you'd like to use it -> https://gist.github.com/20c59248b5bd97d8affd
17:56 🔗 closure can I help with the AOL grab?
20:01 🔗 Coderjoe balrog: yes.
20:02 🔗 Coderjoe I've got a wget-warc crawling it with a login. I may need the names of other lists to try and poke those as well
20:02 🔗 balrog ok.
20:02 🔗 Coderjoe balrog: but alard was grabbing it through the email archive interface as well
20:02 🔗 balrog btw, there's still a lot that you can get to with the old AOL client
20:02 🔗 balrog that would suck to archive.
20:26 🔗 chronomex bits rot over time / { UncorrectableError } / rsync fixes all
20:26 🔗 chronomex #accidentalhaiku
22:03 🔗 chronomex Downloading web.me.com/mrbambam
22:03 🔗 chronomex - Running wget --mirror (at least 29406 files)...
22:03 🔗 chronomex you go mrbambam
22:06 🔗 Paradoks I've been running wget --mirror on web.me.com/khickling for over 7 hours. At least 106592 files. This seems excessive. I still assume that there are users with numbers that dwarf khickling, too.
22:06 🔗 chronomex perhaps
22:20 🔗 yipdw Paradoks: how big is the WARC?
22:23 🔗 Paradoks The khickling directory currently has 73,217 items, totalling 4.3GB. The WARC is 2.1 GB, currently. So I don't think they're huge files, and it certainly hasn't been overly stressing to my connection.
22:31 🔗 yipdw neat
22:39 🔗 alard http://bzr.savannah.gnu.org/lh/wget/trunk/revision/2571
23:17 🔗 underscor Anyone happen to have a copy of the file that used to live here?
23:17 🔗 underscor http://www.archiveteam.org/archives/lulupoetry/letter.txt
23:17 🔗 underscor (Long shot, I know)

irclogger-viewer