#archiveteam 2013-06-18,Tue

↑back Search

Time	Nickname	Message
00:01 ^🔗	BlueMax	g'day
01:26 ^🔗	godane	hey Famicoman
01:27 ^🔗	godane	i found 54 episodes of the lab with leo laporte
01:51 ^🔗	Famicoman	on myspleen?
02:07 ^🔗	godane	yes
02:07 ^🔗	godane	i know there are lot more out there
02:08 ^🔗	godane	thats cause there are lot of dead torrents of it
02:11 ^🔗	omf_	godane, how goes your hackernews backup
02:13 ^🔗	godane	i decided not to do
02:14 ^🔗	godane	i got so much stuff to upload now and if i lose internet/wifi it would be incomplete
03:53 ^🔗	ivan`	cool, someone wrote http://metatalk.metafilter.com/22734/Google-Seceder
04:34 ^🔗	SketchCow	Can someone please do a WGET WARC grab of misc.yero.org/modulez
04:36 ^🔗	ivan`	on it
04:37 ^🔗	ivan`	it looks like the zip files are on another domain, I guess I have to do something extra
04:39 ^🔗	ivan`	I will probably grep the files for "download:" and WARC all those links as well
04:43 ^🔗	Coderjoe	wow. all the actual files were the responsibility of the artist to host.
04:52 ^🔗	ivan`	yeah, that's nuts
09:48 ^🔗	Smiley	one of you, please PLEASE PLEASE give me some xanga username lists
09:48 ^🔗	Smiley	https://archive.org/details/archiveteam-xanga-userlist-20130142 << already grabbving this
11:20 ^🔗	ivan`	SketchCow: grabbing the music on that site is going to be problematic because 80% are hosted at ftp://ftp.scenesp.org/pub/modulez/ and it is down
11:21 ^🔗	ivan`	http://www.scenesp.org/ftp/modulez/ looks like the domain got expired
11:22 ^🔗	ivan`	ftp://ftp.scene.org/pub/mirrors/scenesp.org/ ah here we go
11:23 ^🔗	Deewiant	ivan`: In case you can't find it elsewhere: there's a comment from yero at http://misc.yero.org/modulez/ saying he has a backup of the stuff at scenesp
11:24 ^🔗	Deewiant	Meaning that if you only find out of date mirrors you should be able to get the rest from him
11:24 ^🔗	antomatic	wonder if there's any way to tell what the IP address of that FTP site used to be? The site itself might still be up even if the domain has disappeared
11:25 ^🔗	antomatic	[I'm sure there used to be a web page that could give you that kind of historic info on a domain]
11:26 ^🔗	Deewiant	http://dnshistory.org/dns-records/scenesp.org has two-year-old info
11:27 ^🔗	Deewiant	Pointing to a domain park
11:34 ^🔗	ivan`	zsh: segmentation fault (core dumped) wget --warc-file=misc.yero.org-music --warc-cdx --page-requisites -e 5 60
11:46 ^🔗	ivan`	can someone make try wget on their machine? http://paste.archivingyoursh.it/raw/sukenuvuti http://paste.archivingyoursh.it/raw/tesatikoke
12:02 ^🔗	Smiley	ivan`: will do in a few minutes.
12:07 ^🔗	ivan`	thanks
12:08 ^🔗	ivan`	also, would anyone like to write a crawler to find livejournal users for greader-grab?
12:09 ^🔗	ivan`	it is a simple pulling of friends on pages like http://makaalz.livejournal.com/profile?socconns=pfriends&mode_full_socconns=1&comms=cfriends
12:18 ^🔗	ivan`	GLaDOS: are you around to fork a repo into ArchiveTeam?
12:25 ^🔗	GLaDOS	ivan`: da
12:27 ^🔗	ivan`	GLaDOS: sec
12:29 ^🔗	ivan`	GLaDOS: https://github.com/ludios/greader-directory-grab
12:30 ^🔗	GLaDOS	AM FORK =) https://github.com/ArchiveTeam/greader-directory-grab
12:30 ^🔗	ivan`	thanks!
12:42 ^🔗	Smiley	wget --warc-file=misc.yero.org-music --warc-cdx --page-requisites -e 5 60
12:42 ^🔗	Smiley	that ivan` ?
12:42 ^🔗	Smiley	you're missing a url
12:44 ^🔗	Smiley	I don't know if your paste got eaten
12:44 ^🔗	ivan`	see http://paste.archivingyoursh.it/raw/sukenuvuti
12:45 ^🔗	Smiley	the text file is the second paste?
12:45 ^🔗	ivan`	yes
12:45 ^🔗	ivan`	if it segfaults for you too after the first two URLs, maybe try wget-lua
12:46 ^🔗	Smiley	here goes nothing
12:46 ^🔗	Smiley	Segmentation fault
12:52 ^🔗	Smiley	hmmm I'm building wget-lua :D
12:53 ^🔗	ivan`	valgrind says it's a warc-writing bug
12:53 ^🔗	omf_	That wouldn't surprise me
12:53 ^🔗	Smiley	uyrgh
12:53 ^🔗	Smiley	so this might not help.
12:53 ^🔗	omf_	If you look at the source code for wget you will notice there is 0% test coverage of warc support
12:54 ^🔗	Smiley	:D
12:55 ^🔗	ivan`	yeah, wget-lua also segfaults
12:56 ^🔗	Smiley	Doh
12:57 ^🔗	Smiley	_music_urls.fixed.txt: Invalid URL http:/r: Unsupported scheme 'http'
12:57 ^🔗	Smiley	_music_urls.fixed.txt: Invalid URL ftp:b/stream/noof_-_no_shine.mp3: Unsupported scheme 'ftp'
12:57 ^🔗	Smiley	lol wut
12:57 ^🔗	Smiley	http unsupported?
12:57 ^🔗	ivan`	should have two slashes after schema, very silly error message
12:57 ^🔗	ivan`	reproducible with: valgrind ./wget-lua --warc-file=misc.yero.org-music --warc-cdx ftp://ftp.scene.org/pub/mirrors/scenesp.org/modulez/bitl/chvalley.it
12:58 ^🔗	Smiley	the problem is ftp...
13:01 ^🔗	Smiley	http sources work fine
13:02 ^🔗	ivan`	https://www.refheap.com/15873
13:07 ^🔗	Smiley	Right, trying to grab all non-ftp sources at least.
13:08 ^🔗	ivan`	I'll mail the bug-wget list
13:14 ^🔗	ersi	ivan`: I'd reproduce it with the original wget first
13:16 ^🔗	Smiley	it is something betwee ftp tho
13:19 ^🔗	Smiley	ivan`: most of these files are 404, I presume thats expected?
13:24 ^🔗	ivan`	ersi: I did
13:24 ^🔗	ivan`	Smiley: I don't know
13:24 ^🔗	ivan`	most likely
13:29 ^🔗	ersi	ivan`: Ah, alright. I just meant that it might look less severe if one would mention wget-lua, reproducability and all that jazz
13:29 ^🔗	ersi	great stuff though
13:38 ^🔗	balrog	does anyone know of not-too-expensive CD autoloaders/autofeeders?
13:39 ^🔗	balrog	(for bulk reading and archiving CDs or DVDs)
13:48 ^🔗	balrog	the only thing I can find that might possibly work is sony's media changer line
13:50 ^🔗	balrog	and those are discontinued and appear very difficult to get
13:55 ^🔗	ivan`	it would be great if outside people could edit http://www.archiveteam.org/index.php?title=Google_Reader without connecting to efnet, which pretty much nobody knows how to do
13:57 ^🔗	Danneh_	maybe even embedded irc in the wiki, mibbit or whatever efnet can use?
14:20 ^🔗	Smiley	ivan`: you want them editing it if they can't figure out IRC?!
14:20 ^🔗	ivan`	Smiley: yes
14:20 ^🔗	Smiley	(and point them to our pastebin and tell them to paste it there ..... then email you :/
17:38 ^🔗	Schbirid	lots of 90s pages
17:38 ^🔗	Schbirid	there are the domains hem1.passagen.se to hem3.passagen.se but they are the same ip and host the same content
17:38 ^🔗	Schbirid	this looks like a worth target for archival: http://www.passagen.se/hemsidor/hemsideindex/a.html
17:42 ^🔗	Schbirid	~115k homepages
17:44 ^🔗	Schbirid	quick user list: http://p.defau.lt/?78lwBN4znhMBy6XcP1mcHw
17:44 ^🔗	Schbirid	(1MB)
18:31 ^🔗	Tephra	Schbirid: is it complete?
19:16 ^🔗	Schbirid	it should be all the names from the hemsidenindex
19:16 ^🔗	Schbirid	nighty
19:38 ^🔗	Tephra	Ok so im grabbing all of the websites from that list
19:39 ^🔗	winr4r	OH HELLO i hear you take requests
19:40 ^🔗	winr4r	would you be so kind as to add the user getoutofmygalley to the list of xanga users to archive?
19:40 ^🔗	winr4r	someone i knew that died in 2008
19:44 ^🔗	ivan`	winr4r: I asked the project channel to add it
19:48 ^🔗	winr4r	ivan`: thank you :)
19:57 ^🔗	winr4r	ivan`: yo, what is the IRC channel, i'm fixing the wiki article
19:59 ^🔗	ivan`	winr4r: #jenga
20:00 ^🔗	winr4r	ta :)
20:04 ^🔗	winr4r	oh, apparently xanga.com triggers the wiki's spam filter
20:04 ^🔗	ivan`	you can put <font></font> after the http://
20:07 ^🔗	winr4r	ivan`: rolled with that idea and <nowiki>'d it :)
21:43 ^🔗	dashcloud	so, I'm starting a dump of ftp://ftp.scene.org/pub/mirrors/scenesp.org/
21:44 ^🔗	Smiley	Anyone dumped the kickasstorrents forums?
22:14 ^🔗	winr4r	hey Smiley
22:16 ^🔗	Smiley	hey winr4r
22:25 ^🔗	jfranusic	I've been making a copy of the front page of HN every 4 hours. Should I be sending my warc's somewhere for longer term storage?
22:28 ^🔗	winr4r	probably
22:28 ^🔗	winr4r	IA?
22:29 ^🔗	arrith1	jfranusic: that's great. and yeah, seconding IA
22:30 ^🔗	jfranusic	I'm not sure what the best way to do that is
22:30 ^🔗	jfranusic	it looks like there are different "collections"? I'm not sure
22:33 ^🔗	jfranusic	also, do general tools exist for looking at warc's? I'm considering writing my own
22:33 ^🔗	jfranusic	I want to be able to do things like "expand" a warc into a directory, or just list its contents
22:35 ^🔗	winr4r	https://pypi.python.org/pypi/Warcat/1.8
22:36 ^🔗	jfranusic	haha! that's _exactly_ what I was looking for, thanks winr4r
22:36 ^🔗	winr4r	:)
22:36 ^🔗	everdred	Well, here goesâ¦
22:36 ^🔗	everdred	WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
22:37 ^🔗	winr4r	everdred: yahoosucks
22:38 ^🔗	everdred	winr4r: Yes, that is clear to me, but what is the secret word? ;)
22:39 ^🔗	underscor	everdred: yahoosucks
22:39 ^🔗	underscor	:D
22:43 ^🔗	winr4r	Everdred (Talk \| contribs)â New user account
22:43 ^🔗	*	winr4r spins around in leather chair to face everdred, cigar in hand, arms outstretched, "WELCOME ABOARD SON"
22:47 ^🔗	everdred	winr4r: I just bite it; it's for the look â I don't light it.
23:28 ^🔗	SketchCow	I just sassed a reporter.
23:29 ^🔗	SketchCow	Reporters love getting sassed.
23:29 ^🔗	SketchCow	Wanted to interview me about greader feeds.
23:29 ^🔗	SketchCow	I said "how about the myspace thing"
23:29 ^🔗	SketchCow	He'll go "well, my assignment....."
23:29 ^🔗	SketchCow	Yes, because if your editor sends you out to report on the squirrels and you pass a house fire, you need to get to those fuckin' squirrels.
23:37 ^🔗	SketchCow	Can someone please WGET-WARC http://www.gont.com.ar/ before it disappears
23:49 ^🔗	dashcloud	okay- working on it now
23:52 ^🔗	arrith1	dashcloud: if you need any help i have a debian box i can run stuff on
23:52 ^🔗	dashcloud	here's the wget warc command I'm using: wget -e robots=off -r -l 0 -m -p --wait 1 --warc-header "operator: Archive Team" --warc-cdx --warc-file www_sitename_com http://www.sitename.com/
23:52 ^🔗	arrith1	i'm just not sure how to a scraping from scratch, but yeah, i can run things
23:52 ^🔗	dashcloud	do we have a place that hosts a standard or generic wget warc command?
23:53 ^🔗	arrith1	hm would be good to divide the effort if possible
23:55 ^🔗	DoubleJ	Took me a while to look up the 5000 switches I need, but I'm grabbing the gont site also.
23:56 ^🔗	dashcloud	that's why I copied and pasted some of the ones I've used into a file- I just have to remember what the file's called now
23:56 ^🔗	arrith1	dashcloud: http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
23:57 ^🔗	arrith1	dashcloud: that's the closest i've seen. it would be good to have a shorter page though i think
23:57 ^🔗	arrith1	also hm, might require using the AT Warrior but a quick way to say "get warcs of this full site" then pool the resources of a few users would be good
23:59 ^🔗	dashcloud	but how do you parallelize it? otherwise you've got X people downloading exactly the same thing

irclogger-viewer