[00:01] g'day [01:26] hey Famicoman [01:27] i found 54 episodes of the lab with leo laporte [01:51] on myspleen? [02:07] yes [02:07] i know there are lot more out there [02:08] thats cause there are lot of dead torrents of it [02:11] godane, how goes your hackernews backup [02:13] i decided not to do [02:14] i got so much stuff to upload now and if i lose internet/wifi it would be incomplete [03:53] cool, someone wrote http://metatalk.metafilter.com/22734/Google-Seceder [04:34] Can someone please do a WGET WARC grab of misc.yero.org/modulez [04:36] on it [04:37] it looks like the zip files are on another domain, I guess I have to do something extra [04:39] I will probably grep the files for "download:" and WARC all those links as well [04:43] wow. all the actual files were the responsibility of the artist to host. [04:52] yeah, that's nuts [09:48] one of you, please PLEASE PLEASE give me some xanga username lists [09:48] https://archive.org/details/archiveteam-xanga-userlist-20130142 << already grabbving this [11:20] SketchCow: grabbing the music on that site is going to be problematic because 80% are hosted at ftp://ftp.scenesp.org/pub/modulez/ and it is down [11:21] http://www.scenesp.org/ftp/modulez/ looks like the domain got expired [11:22] ftp://ftp.scene.org/pub/mirrors/scenesp.org/ ah here we go [11:23] ivan`: In case you can't find it elsewhere: there's a comment from yero at http://misc.yero.org/modulez/ saying he has a backup of the stuff at scenesp [11:24] Meaning that if you only find out of date mirrors you should be able to get the rest from him [11:24] wonder if there's any way to tell what the IP address of that FTP site used to be? The site itself might still be up even if the domain has disappeared [11:25] [I'm sure there used to be a web page that could give you that kind of historic info on a domain] [11:26] http://dnshistory.org/dns-records/scenesp.org has two-year-old info [11:27] Pointing to a domain park [11:34] zsh: segmentation fault (core dumped) wget --warc-file=misc.yero.org-music --warc-cdx --page-requisites -e 5 60 [11:46] can someone make try wget on their machine? http://paste.archivingyoursh.it/raw/sukenuvuti http://paste.archivingyoursh.it/raw/tesatikoke [12:02] ivan`: will do in a few minutes. [12:07] thanks [12:08] also, would anyone like to write a crawler to find livejournal users for greader-grab? [12:09] it is a simple pulling of friends on pages like http://makaalz.livejournal.com/profile?socconns=pfriends&mode_full_socconns=1&comms=cfriends [12:18] GLaDOS: are you around to fork a repo into ArchiveTeam? [12:25] ivan`: da [12:27] GLaDOS: sec [12:29] GLaDOS: https://github.com/ludios/greader-directory-grab [12:30] AM FORK =) https://github.com/ArchiveTeam/greader-directory-grab [12:30] thanks! [12:42] wget --warc-file=misc.yero.org-music --warc-cdx --page-requisites -e 5 60 [12:42] that ivan` ? [12:42] you're missing a url [12:44] I don't know if your paste got eaten [12:44] see http://paste.archivingyoursh.it/raw/sukenuvuti [12:45] the text file is the second paste? [12:45] yes [12:45] if it segfaults for you too after the first two URLs, maybe try wget-lua [12:46] here goes nothing [12:46] Segmentation fault [12:52] hmmm I'm building wget-lua :D [12:53] valgrind says it's a warc-writing bug [12:53] That wouldn't surprise me [12:53] uyrgh [12:53] so this might not help. [12:53] If you look at the source code for wget you will notice there is 0% test coverage of warc support [12:54] :D [12:55] yeah, wget-lua also segfaults [12:56] Doh [12:57] _music_urls.fixed.txt: Invalid URL http:/r: Unsupported scheme 'http' [12:57] _music_urls.fixed.txt: Invalid URL ftp:b/stream/noof_-_no_shine.mp3: Unsupported scheme 'ftp' [12:57] lol wut [12:57] http unsupported? [12:57] should have two slashes after schema, very silly error message [12:57] reproducible with: valgrind ./wget-lua --warc-file=misc.yero.org-music --warc-cdx ftp://ftp.scene.org/pub/mirrors/scenesp.org/modulez/bitl/chvalley.it [12:58] the problem is ftp... [13:01] http sources work fine [13:02] https://www.refheap.com/15873 [13:07] Right, trying to grab all non-ftp sources at least. [13:08] I'll mail the bug-wget list [13:14] ivan`: I'd reproduce it with the original wget first [13:16] it is something betwee ftp tho [13:19] ivan`: most of these files are 404, I presume thats expected? [13:24] ersi: I did [13:24] Smiley: I don't know [13:24] most likely [13:29] ivan`: Ah, alright. I just meant that it might look less severe if one would mention wget-lua, reproducability and all that jazz [13:29] great stuff though [13:38] does anyone know of not-too-expensive CD autoloaders/autofeeders? [13:39] (for bulk reading and archiving CDs or DVDs) [13:48] the only thing I can find that might possibly work is sony's media changer line [13:50] and those are discontinued and appear very difficult to get [13:55] it would be great if outside people could edit http://www.archiveteam.org/index.php?title=Google_Reader without connecting to efnet, which pretty much nobody knows how to do [13:57] maybe even embedded irc in the wiki, mibbit or whatever efnet can use? [14:20] ivan`: you want them editing it if they can't figure out IRC?! [14:20] Smiley: yes [14:20] (and point them to our pastebin and tell them to paste it there ..... then email you :/ [17:38] lots of 90s pages [17:38] there are the domains hem1.passagen.se to hem3.passagen.se but they are the same ip and host the same content [17:38] this looks like a worth target for archival: http://www.passagen.se/hemsidor/hemsideindex/a.html [17:42] ~115k homepages [17:44] quick user list: http://p.defau.lt/?78lwBN4znhMBy6XcP1mcHw [17:44] (1MB) [18:31] Schbirid: is it complete? [19:16] it should be all the names from the hemsidenindex [19:16] nighty [19:38] Ok so im grabbing all of the websites from that list [19:39] OH HELLO i hear you take requests [19:40] would you be so kind as to add the user getoutofmygalley to the list of xanga users to archive? [19:40] someone i knew that died in 2008 [19:44] winr4r: I asked the project channel to add it [19:48] ivan`: thank you :) [19:57] ivan`: yo, what is the IRC channel, i'm fixing the wiki article [19:59] winr4r: #jenga [20:00] ta :) [20:04] oh, apparently xanga.com triggers the wiki's spam filter [20:04] you can put after the http:// [20:07] ivan`: rolled with that idea and 'd it :) [21:43] so, I'm starting a dump of ftp://ftp.scene.org/pub/mirrors/scenesp.org/ [21:44] Anyone dumped the kickasstorrents forums? [22:14] hey Smiley [22:16] hey winr4r [22:25] I've been making a copy of the front page of HN every 4 hours. Should I be sending my warc's somewhere for longer term storage? [22:28] probably [22:28] IA? [22:29] jfranusic: that's great. and yeah, seconding IA [22:30] I'm not sure what the best way to do that is [22:30] it looks like there are different "collections"? I'm not sure [22:33] also, do general tools exist for looking at warc's? I'm considering writing my own [22:33] I want to be able to do things like "expand" a warc into a directory, or just list its contents [22:35] https://pypi.python.org/pypi/Warcat/1.8 [22:36] haha! that's _exactly_ what I was looking for, thanks winr4r [22:36] :) [22:36] Well, here goes… [22:36] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [22:37] everdred: yahoosucks [22:38] winr4r: Yes, that is clear to me, but what is the secret word? ;) [22:39] everdred: yahoosucks [22:39] :D [22:43] Everdred (Talk | contribs)‎ New user account [22:43] * winr4r spins around in leather chair to face everdred, cigar in hand, arms outstretched, "WELCOME ABOARD SON" [22:47] winr4r: I just bite it; it's for the look — I don't light it. [23:28] I just sassed a reporter. [23:29] Reporters love getting sassed. [23:29] Wanted to interview me about greader feeds. [23:29] I said "how about the myspace thing" [23:29] He'll go "well, my assignment....." [23:29] Yes, because if your editor sends you out to report on the squirrels and you pass a house fire, you need to get to those fuckin' squirrels. [23:37] Can someone please WGET-WARC http://www.gont.com.ar/ before it disappears [23:49] okay- working on it now [23:52] dashcloud: if you need any help i have a debian box i can run stuff on [23:52] here's the wget warc command I'm using: wget -e robots=off -r -l 0 -m -p --wait 1 --warc-header "operator: Archive Team" --warc-cdx --warc-file www_sitename_com http://www.sitename.com/ [23:52] i'm just not sure how to a scraping from scratch, but yeah, i can run things [23:52] do we have a place that hosts a standard or generic wget warc command? [23:53] hm would be good to divide the effort if possible [23:55] Took me a while to look up the 5000 switches I need, but I'm grabbing the gont site also. [23:56] that's why I copied and pasted some of the ones I've used into a file- I just have to remember what the file's called now [23:56] dashcloud: http://www.archiveteam.org/index.php?title=Wget_with_WARC_output [23:57] dashcloud: that's the closest i've seen. it would be good to have a shorter page though i think [23:57] also hm, might require using the AT Warrior but a quick way to say "get warcs of this full site" then pool the resources of a few users would be good [23:59] but how do you parallelize it? otherwise you've got X people downloading exactly the same thing