#archiveteam 2013-03-24,Sun

↑back Search

Time	Nickname	Message
00:45 ^🔗	omf_	How does the IA take in site grabs that do not have warcs?
00:47 ^🔗	chronomex	they don't
00:47 ^🔗	chronomex	well, not into waybackmachine
00:48 ^🔗	omf_	What if you have all the data that makes the warc
00:48 ^🔗	omf_	like the transfer time, size, headers, etc...
00:48 ^🔗	chronomex	then I suppose you could make a warc?
00:49 ^🔗	omf_	I guess I could write a conversion program.
00:56 ^🔗	godane	you would need like a wget log of the files being grabed for this to work
00:56 ^🔗	godane	in theory
01:06 ^🔗	chronomex	that won't have headers tho
01:07 ^🔗	godane	thats why i said in theory
01:07 ^🔗	godane	was not sure
01:22 ^🔗	ianweller	what.
01:22 ^🔗	ianweller	so i went to bed thinking maybe the local warrior that i have running will stop
01:22 ^🔗	ianweller	nope
01:22 ^🔗	ianweller	it's on 7010 URLs and counting
01:25 ^🔗	chronomex	perfect!
01:33 ^🔗	marczak	Is there a script that I could run instead of using the warrior VM?
01:34 ^🔗	marczak	I have a few extra IPs I could run from, but won't have a virtualized environment to run under.
01:34 ^🔗	omf_	marczak, the peeps in #warrior can answer that
01:34 ^🔗	marczak	great - thanks
01:36 ^🔗	DrDeke	the answer is "yes" but i don't have a link to it handy
01:38 ^🔗	marczak	DrDeke: thanks - someone in #warrior is helping out.
02:14 ^🔗	omf_	For all the new warriors out there we have long term projects after yahoo and posterous. #urlteam is constantly unfucking the url shorteners so we can find sites without twitter, bitly, etc...
02:15 ^🔗	omf_	That is our proactive side to saving the web.
02:31 ^🔗	ersi	mah, don't send people to #warrior when they're asking project specific questions
02:32 ^🔗	ersi	marczak: You can run the scripts from: https://github.com/ArchiveTeam/yahoomessages-grab/
02:32 ^🔗	ersi	that's the stand-alone ones. You'll need to compile wget though (script is checked in there ^) and install the seesaw python package.
03:40 ^🔗	SketchCow	I think we just exploded the Yahoo
03:40 ^🔗	pilgrim	well they had it coming
03:46 ^🔗	godane	i just saved low rider world 2006 clip of attack of the show
03:46 ^🔗	godane	it was one of the flvsm videos that i couldn't get
03:55 ^🔗	SketchCow	We just destroyed the Yahoo! backlog
03:57 ^🔗	DFJustin	and how
03:57 ^🔗	SketchCow	The graph looks like a zombie death apocalypse
04:00 ^🔗	SketchCow	40G .
04:00 ^🔗	SketchCow	root@teamarchive-1:/2/DISCOGS/www.discogs.com/data# du -sh .
04:00 ^🔗	SketchCow	by the way
04:11 ^🔗	omf_	SketchCow, was that a preventative grab?
04:13 ^🔗	SketchCow	Yes
04:13 ^🔗	SketchCow	I'm working with MusicBrainz to get their stuff on archive.
04:13 ^🔗	SketchCow	And they said "You know, I don't know of any mirrors of discogs.org"
04:14 ^🔗	DFJustin	might do vgmdb.net while you're at it
04:20 ^🔗	SketchCow	Show me where you can download the DB and I will.
04:20 ^🔗	omf_	DFJustin, I already got a grab of vgmdb.net
04:21 ^🔗	omf_	it is about 8 months old though
04:21 ^🔗	DFJustin	o/\o
04:22 ^🔗	omf_	I want to merge some of their data into freebase
06:33 ^🔗	chronomex	why don't we have all warriors running urlteam in the background all the time?
06:35 ^🔗	chronomex	:)
06:35 ^🔗	omf_	It would help
07:12 ^🔗	omf_	We need to recruit someone who has google fiber, it could be real helpful
07:13 ^🔗	omf_	just throwing that out there
07:49 ^🔗	SketchCow	Man, it's going k-razy out there
07:49 ^🔗	SketchCow	My Hard Drive full of goodness goes out Monday
07:49 ^🔗	SketchCow	Working now to build up the maximum amount of data on it
07:50 ^🔗	omf_	You ship hard drives as well as upload? Talk about no stone unturned :)
08:01 ^🔗	SketchCow	Have to.
08:01 ^🔗	SketchCow	I send in 400-500gb a hit
08:03 ^🔗	chronomex	whumph whumph
08:05 ^🔗	omf_	Do you have shock proof cases for mailing? I always wanted to ask how those work out.
08:06 ^🔗	chronomex	if I were mailing hdds I'd probably reuse original hdd packing materials
08:06 ^🔗	chronomex	seems to work
08:26 ^🔗	ivan`	in case I get hit by a meteor in the next 3 months somebody better remember to scrape all of Reader's *.blogspot.com/atom.xml feeds in addition to the feed URLs they currently use
08:26 ^🔗	ivan`	e.g. xooglers.blogspot.com/atom.xml gets you completely different content
11:00 ^🔗	SketchCow	chronomex: I do.
11:52 ^🔗	ersi	ivan`: Different content than what?
14:54 ^🔗	omf_	Our clown information is growing nicely. If you have any observations you would like to add http://www.archiveteam.org/index.php?title=Clown_hosting
16:19 ^🔗	chazchaz	omf_: Are there any guidelines for including providers is that list?
16:22 ^🔗	omf_	website url, price point, specs, and any insights into why the service works so well or problems with it
16:23 ^🔗	omf_	the joyent and DO are good examples we have built out
16:23 ^🔗	omf_	we have vps and cloud providers on there
16:25 ^🔗	omf_	bandwidth and storage are right up there with price point as important data we need
16:57 ^🔗	chazchaz	Ok, I added BuyVM
16:59 ^🔗	omf_	chazchaz, you use them recently?
16:59 ^🔗	chazchaz	Yeah, I have 2 servers with them.
16:59 ^🔗	chazchaz	One for over a year
17:03 ^🔗	omf_	What can you fit in 128mb ram
17:04 ^🔗	omf_	I cannot think of too much you could run
17:04 ^🔗	omf_	I could host my photos on there. Cheaper than flickr
17:07 ^🔗	neurophyr	edis.at has a good 128MB miniVPS option.
17:07 ^🔗	neurophyr	I run lower-traffic Tor relays and bridges on that kind of box.
17:08 ^🔗	neurophyr	and it was quite happy to run the yahoomessages-grab script.
17:09 ^🔗	chazchaz	omf_: They let you burst up to 2x as long as it's availible, which seems to be almost all the time. I'm using 150 MB for 40 posterous processes and 2 yahoo-messages processes
17:10 ^🔗	omf_	chazchaz, you should make a note on the wiki, that is valuable info
17:13 ^🔗	chazchaz	done
17:14 ^🔗	omf_	thanks
17:36 ^🔗	DrDeke	i'm kind of offended that there is a wiki page called "Clown hosting" and my apartment closet isn't eligible to be listed in it ;)
17:37 ^🔗	DrDeke	outage notifications? pshhh, yeah maybe i'll email you if i decide to take the server apart for some reason 5 minutes before i do it if you have a VM on it
17:39 ^🔗	chazchaz	Just check it yourself. That's what ping is for right?
17:41 ^🔗	DrDeke	exactly!
17:41 ^🔗	DrDeke	i made a major jump in my level of customer service a couple months ago when i put everyone's email address that i could track down in a google spreadsheet
17:41 ^🔗	DrDeke	sometimes it gets copy and pasted into a bcc
17:41 ^🔗	DrDeke	sometimes... =)
17:42 ^🔗	DrDeke	(nobody is paying, so, you know...)
17:42 ^🔗	chronomex	'wall' ought to be acceptable notice for planned maintenance
17:42 ^🔗	DrDeke	i actually got to do that on a couple servers at my real job last night
17:43 ^🔗	DrDeke	"Oh, we forgot to mention that part in the email? Well, just shutdown +30 it, the users will be fine."
17:43 ^🔗	DrDeke	(needless to say, that is not the way it normally works there)
17:43 ^🔗	DrDeke	since the system these servers are for was going to be completely down anyway, we figured oh well
19:31 ^🔗	omf_	Did someone already grab the ign forums?
19:41 ^🔗	Smiley	omf_: ask in #ispygames
19:41 ^🔗	Smiley	someone was doing work on a lot of that stuff there
19:41 ^🔗	omf_	that is me
19:42 ^🔗	omf_	I just checked the scroll back to the 22nd of last month and nothing
19:45 ^🔗	Smiley	D:
19:45 ^🔗	Smiley	sorry for being an idiot then ;)
19:46 ^🔗	omf_	No worries. It is hard to follow so many projects going on.
19:46 ^🔗	Smiley	aye
19:46 ^🔗	omf_	I know some forums for some sites were grabbed but nothing about the main ign
19:48 ^🔗	omf_	The wiki is down
19:49 ^🔗	omf_	Resource Limit Is Reached errors a few times
19:49 ^🔗	omf_	seems fine again now
20:29 ^🔗	SketchCow	It happens.
20:30 ^🔗	omf_	SketchCow, Is it alright is I start uploading that 4data to you?
20:31 ^🔗	omf_	It is 102gb
20:31 ^🔗	omf_	and it will probably take over a week to upload, possibly longer
20:34 ^🔗	SketchCow	What 4data?
20:34 ^🔗	SketchCow	I mean, I'm sure we discussed it. What is it?
20:35 ^🔗	omf_	The 4chandata dump
20:35 ^🔗	omf_	from that archive site that is closed
20:35 ^🔗	SketchCow	Oh, of course.
20:35 ^🔗	SketchCow	Yeah, go ahead. Do you need credentials?
20:35 ^🔗	omf_	I already got them
20:36 ^🔗	omf_	I am still waiting on the database dump itself but I am not worried. This guy has come through on everything he said so far
21:19 ^🔗	Nimbulan	4 Get your free Psybnc 100 user have it come http://www.multiupload.nl/B11JFCYQH6
21:20 ^🔗	Marcelo	lol
21:22 ^🔗	soultcer	In case anyone is wondering: https://www.virustotal.com/en/file/f897432de88adce73b23741da1a133b6a79b8233d50571451dab4b992931d173/analysis/1364160122/
21:23 ^🔗	chronomex	errrr
21:23 ^🔗	chronomex	what's that from?
21:23 ^🔗	soultcer	That's the free Psybnc
21:23 ^🔗	soultcer	Hm, I wonder if xchat logs bans
21:24 ^🔗	Marcelo	So many nicknames for this virus.
21:33 ^🔗	chronomex	is there a ratelimiter on formspring?
21:37 ^🔗	zenpho	howdy doo! I'm reporting back. Soultcer helped me yesterday with digging into the btinternet stuff (http://archive.org/details/archiveteam-btinternet)
21:38 ^🔗	soultcer	Did it work?
21:39 ^🔗	zenpho	yes indeedie! - i wrote some horrible awk scripts to parse the CDX files for stuff I was interested in, download via curl, unpack, and now I'm browsing thru some vintage .au and .wav files ... very cool
21:39 ^🔗	soultcer	Sweet
21:41 ^🔗	zenpho	very kind of you to help and encourage me to carry on, i was almost convinced that the megawarc files would have to be downloaded in entirety (or atleast an entire megawarc) to get anything out of them
21:42 ^🔗	zenpho	i was right about to say "ehh.... it probably doesn't work like that", and give up, but you convinced me. and it's certainly very cool to browse thru this stuff!
21:47 ^🔗	ersi	Neat :)
21:50 ^🔗	alard	chronomex: Yes.
21:50 ^🔗	alard	(I set a rate limit on the tracker, that is.)
21:50 ^🔗	chronomex	ah
21:51 ^🔗	alard	But that limit is not reached, at the moment. I set it to 20 to be safe, but we're currently at 2-4 per minute.
21:52 ^🔗	chronomex	I meant running multiple threads on my end
21:53 ^🔗	alard	I don't know how Formspring behaves.
21:54 ^🔗	chronomex	ok, I'll just run 1 for now
22:11 ^🔗	wp494	would it be possible to get a message asking for assistance on the formspring project in the topic?
22:12 ^🔗	chronomex	sure, is there a channel for it?
22:12 ^🔗	alard	wp494: Are we sure that it works?
22:13 ^🔗	wp494	alard: yep, I've been running 3 concurrent for an hour or two and haven't ran into any issues
22:13 ^🔗	wp494	chronomex: #firespring
22:13 ^🔗	wp494	and others that pop up on the tracker appear to have no issues
22:15 ^🔗	alard	wp494: Yes, that's one thing. But does it get everything we want to get?
22:16 ^🔗	alard	It's a complicated script.
22:16 ^🔗	wp494	hrm
22:16 ^🔗	wp494	if you want to hold off on adding to the topic, feel free
22:17 ^🔗	chronomex	I'm inclined to wait for alard to sign off
22:17 ^🔗	alard	I've checked one or two warcs and they looked good (with the last version of the script, at least).
22:18 ^🔗	alard	We could go with full force, but there's a small risk that we need to do things again.
22:18 ^🔗	alard	I haven't been able to find out about the pagination on the photo albums, for example.
22:18 ^🔗	chronomex	hm
22:18 ^🔗	alard	(Because I haven't found a user with enough photos.)
22:21 ^🔗	wp494	have you tried any triple digit/close to triple digit users?
22:21 ^🔗	wp494	(in file size terms)
22:24 ^🔗	alard	Good idea. I just did that, but didn't see any user with more than 20 pictures. They're big because of something else.
22:28 ^🔗	wp494	probably formspringaholics
22:32 ^🔗	omf_	DFJustin, Did you want a copy of vgmdb?
22:33 ^🔗	alard	I think Formspring works well enough. Checked another warc with the warc-proxy, no missing pages.
22:34 ^🔗	alard	If there are people with too many pictures they'll at least be included via the Previous-Next buttons.
22:35 ^🔗	alard	There are a few pagination things that don't work (the 'who smiled at this'-thing, for example), but that's due to Formspring.
22:42 ^🔗	chronomex	namespace \| I'm worried about google groups.
22:42 ^🔗	chronomex	chronomex \| hmmmmmmm
22:42 ^🔗	chronomex	namespace \| It's basically dead as far as I can tell, and to my knowledge is one of the largest usenet archives.
22:42 ^🔗	chronomex	chronomex \| I'm with you there
22:42 ^🔗	chronomex	chronomex \| it'd be good to turn it back into a news spool
22:42 ^🔗	chronomex	chronomex \| the way usenet was meant to be
22:42 ^🔗	chronomex	yes, ggroups is a worthy opponent
22:43 ^🔗	namespace	And because it's google, you know that the shutdown is a matter of when not if.
22:43 ^🔗	thomasbk	do you think google wouldn't be willing to ship some hard drives to the internet archive if they ever shut ggroups down?
22:43 ^🔗	namespace	True.
22:43 ^🔗	namespace	I'd hope they would anyway.
22:43 ^🔗	chronomex	we'd need to find a crooked googler
23:04 ^🔗	omf_	From my own research we can piece sections of usenet history with what is already available
23:04 ^🔗	omf_	which is better than nothing.
23:04 ^🔗	DFJustin	omf_: I don't personally want a copy but having one on archive.org would be nice
23:05 ^🔗	omf_	I am doing a refresh on it now
23:05 ^🔗	ersi	thomasbk: Always assume the answer to that question is no, unless you're sure
23:05 ^🔗	ersi	That's my rule of thumb
23:06 ^🔗	omf_	Universities still have tapes full of usenet archives
23:06 ^🔗	omf_	it is just finding the tapes and people there who can pull the data out
23:07 ^🔗	omf_	Another angle would be to get the usenet data loaded into big query
23:07 ^🔗	chronomex	tapes used to be really expensive
23:07 ^🔗	DFJustin	from what I read google looked under a lot of rocks to get what they have, I'm not sure there's really a lot more out there
23:12 ^🔗	thomasbk	anyone have any guesses wrt the legalities of rehosting stuff like the yahoo messages content?
23:13 ^🔗	chronomex	nope
23:13 ^🔗	ivan`	ersi: different from what you get from http://xooglers.blogspot.com/feeds/posts/default or http://xooglers.blogspot.com/
23:14 ^🔗	ersi	ivan`: oh, huh
23:14 ^🔗	ersi	thomasbk: most of us don't give two fucks about that
23:14 ^🔗	omf_	I just checked up on my usenet sources
23:14 ^🔗	omf_	I got partial archives going back over 10 years for some groups
23:15 ^🔗	omf_	We could do it
23:15 ^🔗	omf_	add that to what is already on the IA and we would have over 50% of everything as a starting point
23:21 ^🔗	adamc[a]	The longer we wait, the harder it will be to find older data - makes sense to get started on it
23:22 ^🔗	omf_	I can start cutting it up to feed to the warrior
23:22 ^🔗	omf_	We are going to have to hit dozens of different archives
23:23 ^🔗	omf_	I have been tracking this for a few years and there are more archives online now than before
23:23 ^🔗	omf_	People are starting to open things up
23:23 ^🔗	Lord_Nigh	i know google has a usenet archive but its in their weird google-format (missing original headers etc?) so not super useful?
23:23 ^🔗	omf_	plus hosting is cheaper for larger data sets
23:23 ^🔗	Lord_Nigh	also missing all the atatchments
23:23 ^🔗	chronomex	I thought that google usenet posts are retrievable in original form
23:24 ^🔗	omf_	they are
23:34 ^🔗	zerovox	So I've been downloading on the yahoo task all day, It's taken about 12 hours to download nearly 10,000 urls on Item threads-b-1036-3. Can anyone check if someone else has submitted this by now? Or how many urls there will be?
23:35 ^🔗	zerovox	Seems pretty slow, but I guess that's due to the rate limit?
23:49 ^🔗	namespace	Question: Why isn't there a standard URL shortener algorithm in browsers?
23:50 ^🔗	chronomex	gzip \| base64 or something?
23:50 ^🔗	namespace	Something like that.
23:51 ^🔗	namespace	It's totally ridiculous that it's even a service. It's obviously something users want, and it could totally be done client side.
23:51 ^🔗	namespace	I can't think of a single aspect that requires a server to be involved.
23:52 ^🔗	omf_	namespace, do you know why people use url shorteners
23:54 ^🔗	namespace	omf_: Because it's simple and long urls are ugly?
23:54 ^🔗	namespace	(Unless it's for shock sites. But then why would you want to archive them?)
23:54 ^🔗	namespace	That and for twitter.
23:56 ^🔗	omf_	URL shortening services were invented as a way to add a step in the process which allows data to be collected on the user. This is then sold to ad companies
23:56 ^🔗	omf_	that is the whole point of bitly etc
23:56 ^🔗	omf_	It has no benefit to end users
23:56 ^🔗	namespace	Interesting. Source?
23:56 ^🔗	dashcloud	okay- while it is a problem, that's not true
23:57 ^🔗	dashcloud	if you're trying to share a link on a character-constrained environment, you're going to run into the URL issue
23:58 ^🔗	dashcloud	I don't disagree folks found it was a great way to get analytics on web traffic

irclogger-viewer