#archiveteam 2013-06-18,Tue

↑back Search

Time Nickname Message
00:01 πŸ”— BlueMax g'day
01:26 πŸ”— godane hey Famicoman
01:27 πŸ”— godane i found 54 episodes of the lab with leo laporte
01:51 πŸ”— Famicoman on myspleen?
02:07 πŸ”— godane yes
02:07 πŸ”— godane i know there are lot more out there
02:08 πŸ”— godane thats cause there are lot of dead torrents of it
02:11 πŸ”— omf_ godane, how goes your hackernews backup
02:13 πŸ”— godane i decided not to do
02:14 πŸ”— godane i got so much stuff to upload now and if i lose internet/wifi it would be incomplete
03:53 πŸ”— ivan` cool, someone wrote http://metatalk.metafilter.com/22734/Google-Seceder
04:34 πŸ”— SketchCow Can someone please do a WGET WARC grab of misc.yero.org/modulez
04:36 πŸ”— ivan` on it
04:37 πŸ”— ivan` it looks like the zip files are on another domain, I guess I have to do something extra
04:39 πŸ”— ivan` I will probably grep the files for "download:" and WARC all those links as well
04:43 πŸ”— Coderjoe wow. all the actual files were the responsibility of the artist to host.
04:52 πŸ”— ivan` yeah, that's nuts
09:48 πŸ”— Smiley one of you, please PLEASE PLEASE give me some xanga username lists
09:48 πŸ”— Smiley https://archive.org/details/archiveteam-xanga-userlist-20130142 << already grabbving this
11:20 πŸ”— ivan` SketchCow: grabbing the music on that site is going to be problematic because 80% are hosted at ftp://ftp.scenesp.org/pub/modulez/ and it is down
11:21 πŸ”— ivan` http://www.scenesp.org/ftp/modulez/ looks like the domain got expired
11:22 πŸ”— ivan` ftp://ftp.scene.org/pub/mirrors/scenesp.org/ ah here we go
11:23 πŸ”— Deewiant ivan`: In case you can't find it elsewhere: there's a comment from yero at http://misc.yero.org/modulez/ saying he has a backup of the stuff at scenesp
11:24 πŸ”— Deewiant Meaning that if you only find out of date mirrors you should be able to get the rest from him
11:24 πŸ”— antomatic wonder if there's any way to tell what the IP address of that FTP site used to be? The site itself might still be up even if the domain has disappeared
11:25 πŸ”— antomatic [I'm sure there used to be a web page that could give you that kind of historic info on a domain]
11:26 πŸ”— Deewiant http://dnshistory.org/dns-records/scenesp.org has two-year-old info
11:27 πŸ”— Deewiant Pointing to a domain park
11:34 πŸ”— ivan` zsh: segmentation fault (core dumped) wget --warc-file=misc.yero.org-music --warc-cdx --page-requisites -e 5 60
11:46 πŸ”— ivan` can someone make try wget on their machine? http://paste.archivingyoursh.it/raw/sukenuvuti http://paste.archivingyoursh.it/raw/tesatikoke
12:02 πŸ”— Smiley ivan`: will do in a few minutes.
12:07 πŸ”— ivan` thanks
12:08 πŸ”— ivan` also, would anyone like to write a crawler to find livejournal users for greader-grab?
12:09 πŸ”— ivan` it is a simple pulling of friends on pages like http://makaalz.livejournal.com/profile?socconns=pfriends&mode_full_socconns=1&comms=cfriends
12:18 πŸ”— ivan` GLaDOS: are you around to fork a repo into ArchiveTeam?
12:25 πŸ”— GLaDOS ivan`: da
12:27 πŸ”— ivan` GLaDOS: sec
12:29 πŸ”— ivan` GLaDOS: https://github.com/ludios/greader-directory-grab
12:30 πŸ”— GLaDOS AM FORK =) https://github.com/ArchiveTeam/greader-directory-grab
12:30 πŸ”— ivan` thanks!
12:42 πŸ”— Smiley wget --warc-file=misc.yero.org-music --warc-cdx --page-requisites -e 5 60
12:42 πŸ”— Smiley that ivan` ?
12:42 πŸ”— Smiley you're missing a url
12:44 πŸ”— Smiley I don't know if your paste got eaten
12:44 πŸ”— ivan` see http://paste.archivingyoursh.it/raw/sukenuvuti
12:45 πŸ”— Smiley the text file is the second paste?
12:45 πŸ”— ivan` yes
12:45 πŸ”— ivan` if it segfaults for you too after the first two URLs, maybe try wget-lua
12:46 πŸ”— Smiley here goes nothing
12:46 πŸ”— Smiley Segmentation fault
12:52 πŸ”— Smiley hmmm I'm building wget-lua :D
12:53 πŸ”— ivan` valgrind says it's a warc-writing bug
12:53 πŸ”— omf_ That wouldn't surprise me
12:53 πŸ”— Smiley uyrgh
12:53 πŸ”— Smiley so this might not help.
12:53 πŸ”— omf_ If you look at the source code for wget you will notice there is 0% test coverage of warc support
12:54 πŸ”— Smiley :D
12:55 πŸ”— ivan` yeah, wget-lua also segfaults
12:56 πŸ”— Smiley Doh
12:57 πŸ”— Smiley _music_urls.fixed.txt: Invalid URL http:/r: Unsupported scheme 'http'
12:57 πŸ”— Smiley _music_urls.fixed.txt: Invalid URL ftp:b/stream/noof_-_no_shine.mp3: Unsupported scheme 'ftp'
12:57 πŸ”— Smiley lol wut
12:57 πŸ”— Smiley http unsupported?
12:57 πŸ”— ivan` should have two slashes after schema, very silly error message
12:57 πŸ”— ivan` reproducible with: valgrind ./wget-lua --warc-file=misc.yero.org-music --warc-cdx ftp://ftp.scene.org/pub/mirrors/scenesp.org/modulez/bitl/chvalley.it
12:58 πŸ”— Smiley the problem is ftp...
13:01 πŸ”— Smiley http sources work fine
13:02 πŸ”— ivan` https://www.refheap.com/15873
13:07 πŸ”— Smiley Right, trying to grab all non-ftp sources at least.
13:08 πŸ”— ivan` I'll mail the bug-wget list
13:14 πŸ”— ersi ivan`: I'd reproduce it with the original wget first
13:16 πŸ”— Smiley it is something betwee ftp tho
13:19 πŸ”— Smiley ivan`: most of these files are 404, I presume thats expected?
13:24 πŸ”— ivan` ersi: I did
13:24 πŸ”— ivan` Smiley: I don't know
13:24 πŸ”— ivan` most likely
13:29 πŸ”— ersi ivan`: Ah, alright. I just meant that it might look less severe if one would mention wget-lua, reproducability and all that jazz
13:29 πŸ”— ersi great stuff though
13:38 πŸ”— balrog does anyone know of not-too-expensive CD autoloaders/autofeeders?
13:39 πŸ”— balrog (for bulk reading and archiving CDs or DVDs)
13:48 πŸ”— balrog the only thing I can find that might possibly work is sony's media changer line
13:50 πŸ”— balrog and those are discontinued and appear very difficult to get
13:55 πŸ”— ivan` it would be great if outside people could edit http://www.archiveteam.org/index.php?title=Google_Reader without connecting to efnet, which pretty much nobody knows how to do
13:57 πŸ”— Danneh_ maybe even embedded irc in the wiki, mibbit or whatever efnet can use?
14:20 πŸ”— Smiley ivan`: you want them editing it if they can't figure out IRC?!
14:20 πŸ”— ivan` Smiley: yes
14:20 πŸ”— Smiley (and point them to our pastebin and tell them to paste it there ..... then email you :/
17:38 πŸ”— Schbirid lots of 90s pages
17:38 πŸ”— Schbirid there are the domains hem1.passagen.se to hem3.passagen.se but they are the same ip and host the same content
17:38 πŸ”— Schbirid this looks like a worth target for archival: http://www.passagen.se/hemsidor/hemsideindex/a.html
17:42 πŸ”— Schbirid ~115k homepages
17:44 πŸ”— Schbirid quick user list: http://p.defau.lt/?78lwBN4znhMBy6XcP1mcHw
17:44 πŸ”— Schbirid (1MB)
18:31 πŸ”— Tephra Schbirid: is it complete?
19:16 πŸ”— Schbirid it should be all the names from the hemsidenindex
19:16 πŸ”— Schbirid nighty
19:38 πŸ”— Tephra Ok so im grabbing all of the websites from that list
19:39 πŸ”— winr4r OH HELLO i hear you take requests
19:40 πŸ”— winr4r would you be so kind as to add the user getoutofmygalley to the list of xanga users to archive?
19:40 πŸ”— winr4r someone i knew that died in 2008
19:44 πŸ”— ivan` winr4r: I asked the project channel to add it
19:48 πŸ”— winr4r ivan`: thank you :)
19:57 πŸ”— winr4r ivan`: yo, what is the IRC channel, i'm fixing the wiki article
19:59 πŸ”— ivan` winr4r: #jenga
20:00 πŸ”— winr4r ta :)
20:04 πŸ”— winr4r oh, apparently xanga.com triggers the wiki's spam filter
20:04 πŸ”— ivan` you can put <font></font> after the http://
20:07 πŸ”— winr4r ivan`: rolled with that idea and <nowiki>'d it :)
21:43 πŸ”— dashcloud so, I'm starting a dump of ftp://ftp.scene.org/pub/mirrors/scenesp.org/
21:44 πŸ”— Smiley Anyone dumped the kickasstorrents forums?
22:14 πŸ”— winr4r hey Smiley
22:16 πŸ”— Smiley hey winr4r
22:25 πŸ”— jfranusic I've been making a copy of the front page of HN every 4 hours. Should I be sending my warc's somewhere for longer term storage?
22:28 πŸ”— winr4r probably
22:28 πŸ”— winr4r IA?
22:29 πŸ”— arrith1 jfranusic: that's great. and yeah, seconding IA
22:30 πŸ”— jfranusic I'm not sure what the best way to do that is
22:30 πŸ”— jfranusic it looks like there are different "collections"? I'm not sure
22:33 πŸ”— jfranusic also, do general tools exist for looking at warc's? I'm considering writing my own
22:33 πŸ”— jfranusic I want to be able to do things like "expand" a warc into a directory, or just list its contents
22:35 πŸ”— winr4r https://pypi.python.org/pypi/Warcat/1.8
22:36 πŸ”— jfranusic haha! that's _exactly_ what I was looking for, thanks winr4r
22:36 πŸ”— winr4r :)
22:36 πŸ”— everdred Well, here goesҀ¦
22:36 πŸ”— everdred WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
22:37 πŸ”— winr4r everdred: yahoosucks
22:38 πŸ”— everdred winr4r: Yes, that is clear to me, but what is the secret word? ;)
22:39 πŸ”— underscor everdred: yahoosucks
22:39 πŸ”— underscor :D
22:43 πŸ”— winr4r Everdred (Talk | contribs)Γ’Β€ΒŽ New user account
22:43 πŸ”— * winr4r spins around in leather chair to face everdred, cigar in hand, arms outstretched, "WELCOME ABOARD SON"
22:47 πŸ”— everdred winr4r: I just bite it; it's for the look Ҁ” I don't light it.
23:28 πŸ”— SketchCow I just sassed a reporter.
23:29 πŸ”— SketchCow Reporters love getting sassed.
23:29 πŸ”— SketchCow Wanted to interview me about greader feeds.
23:29 πŸ”— SketchCow I said "how about the myspace thing"
23:29 πŸ”— SketchCow He'll go "well, my assignment....."
23:29 πŸ”— SketchCow Yes, because if your editor sends you out to report on the squirrels and you pass a house fire, you need to get to those fuckin' squirrels.
23:37 πŸ”— SketchCow Can someone please WGET-WARC http://www.gont.com.ar/ before it disappears
23:49 πŸ”— dashcloud okay- working on it now
23:52 πŸ”— arrith1 dashcloud: if you need any help i have a debian box i can run stuff on
23:52 πŸ”— dashcloud here's the wget warc command I'm using: wget -e robots=off -r -l 0 -m -p --wait 1 --warc-header "operator: Archive Team" --warc-cdx --warc-file www_sitename_com http://www.sitename.com/
23:52 πŸ”— arrith1 i'm just not sure how to a scraping from scratch, but yeah, i can run things
23:52 πŸ”— dashcloud do we have a place that hosts a standard or generic wget warc command?
23:53 πŸ”— arrith1 hm would be good to divide the effort if possible
23:55 πŸ”— DoubleJ Took me a while to look up the 5000 switches I need, but I'm grabbing the gont site also.
23:56 πŸ”— dashcloud that's why I copied and pasted some of the ones I've used into a file- I just have to remember what the file's called now
23:56 πŸ”— arrith1 dashcloud: http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
23:57 πŸ”— arrith1 dashcloud: that's the closest i've seen. it would be good to have a shorter page though i think
23:57 πŸ”— arrith1 also hm, might require using the AT Warrior but a quick way to say "get warcs of this full site" then pool the resources of a few users would be good
23:59 πŸ”— dashcloud but how do you parallelize it? otherwise you've got X people downloading exactly the same thing

irclogger-viewer