#archiveteam 2013-12-29,Sun

↑back Search

Time	Nickname	Message
00:15 ^🔗	deathy	any other very near future projects we know about but haven't been started yet? or have things calmed down for now?
02:53 ^🔗	dashcloud	so, how do you archive twitter accounts? if someone knows, Dan Kaminsky mentioned that https://twitter.com/conorpotpie died recently
03:05 ^🔗	deathy	don't know anything first hand, but there are some tools mentioned on the wiki: http://www.archiveteam.org/index.php?title=Twitter
03:06 ^🔗	deathy	out of 3 tools, one is dead/broken link and another requires your login/pass (obviously not possible here..)
03:10 ^🔗	deathy	and the 3rd one is also useless. URL params and classes used in page not the same anymore.
03:15 ^🔗	deathy	updated wiki, made it clear 3rd tool is not usable anymore
03:16 ^🔗	deathy	perhaps it would be good to add one there that actually works..
03:31 ^🔗	xmc	deathy, dashcloud: archive.is seems to expand the whole page of a twitter account
03:33 ^🔗	deathy	nope
03:33 ^🔗	deathy	not really
03:34 ^🔗	deathy	tried on that account dashcloud mentioned. It was doing at least a few URL requests which I recognized from monitoring the twitter infinite scroll thing
03:34 ^🔗	deathy	but it maybe got 1-3 additional pages/scrolls
03:34 ^🔗	deathy	from more than 50 I think on that user at least
03:36 ^🔗	deathy	(used a lot of PgDn keys while looking at it.. )
03:36 ^🔗	xmc	ah
03:38 ^🔗	deathy	from archive.is: "There is 5 minutes timeout, if page is not fully loaded in 5 minutes, the saving considered failed. It is not often, but it happens."
03:38 ^🔗	deathy	and maybe for extreme cases this could be an issue (if lots of twitter pics): "The stored page with all images must be smaller than 50Mb"
03:42 ^🔗	deathy	and double/triple-confirmed, from archive.is blog, issues with twitter: http://blog.archive.is/post/51400352393/it-seems-that-twitter-feeds-with-a-lot-of-tweets-500
04:13 ^🔗	xmc	mm
07:02 ^🔗	SketchCow	Where's the hug.
07:05 ^🔗	*	BlueMaxim hugs SketchCow
08:35 ^🔗	*	Nemo_bis read "bug"
09:07 ^🔗	arkiver	why are wikipedia pages saved so bad in the wayback machine?
09:07 ^🔗	arkiver	http://web.archive.org/save/http://nl.wikipedia.org/wiki/Hoofdpagina
10:19 ^🔗	jonas_	hi=) what about getting more yahoo blogs from google cache (and gigablast cache) (or a cdn avaiable for this if any)?
10:45 ^🔗	m1das	jonas_: join #shipwretched and update your y!b code for new grabs
11:31 ^🔗	arkiver	looks like the archive.is saves are made like ####
11:31 ^🔗	arkiver	would make it possible to save
11:31 ^🔗	arkiver	14.776.336 combinations
11:31 ^🔗	arkiver	then run a program on then to only have the existing ones and discovering all the other urls
11:31 ^🔗	arkiver	and then download
11:44 ^🔗	arkiver	looks like they also have ##### (with 5) now...
11:45 ^🔗	arkiver	the #### is full, all of them are used, so that would be good to archive
11:45 ^🔗	arkiver	will try to start a grab on that... :)
12:09 ^🔗	arkiver	generated all urls from aaaa-0000
12:10 ^🔗	arkiver	now starting the url discovery of the first batch: aaaa-a000
12:10 ^🔗	arkiver	or nah, gonna do aaaa-d000
12:37 ^🔗	antomatic	jonas_: Apparently google cache is really hard to archive from, they rate-limit so agressively that it's almost impossible to do on any scale
12:41 ^🔗	BiggieJo1	pretty much need a massive block of IP's and randonly scatter requests accross the block
12:56 ^🔗	Nemo_bis	so far I'm not having problems with concurrency 4
12:57 ^🔗	joepie91	there's an old piece of software in Perl that does Google Cache extraction pretty well afaik
13:06 ^🔗	ivan`	I waited 2 minutes between requests to google cache and it worked fine
13:06 ^🔗	antomatic	is that wretch or blogs or both, Nemo_bis ?
13:06 ^🔗	antomatic	sounds encouraging
13:07 ^🔗	Nemo_bis	blogs
13:07 ^🔗	antomatic	cool
13:07 ^🔗	Nemo_bis	but I see I uploaded only 14 items so far, dunno what's going on for real
13:07 ^🔗	arkiver	are you talking about getting the yahoo things from google cache?
13:15 ^🔗	arkiver	archive.is is blocked from the IA for some reason...
13:15 ^🔗	arkiver	:(
15:46 ^🔗	Cowering	anyone else blocked from dropbox for too much BW, even when you know it is not true?
16:47 ^🔗	balrog	Cowering: like blocked completely, or one file blocked?
17:32 ^🔗	arkiver	etsi.org/deliver/ save almost complete
17:32 ^🔗	arkiver	working on wikileaks website
18:21 ^🔗	chavezery	don't know if anyone would want to keep this, but it's here in case: https://bui.pm/ded
18:21 ^🔗	chavezery	it was a /b/ archive, dead now, images and database stuff is up for grabs
18:23 ^🔗	chavezery	and with that, it's time for me to leave
18:23 ^🔗	chavezery	later all o/
18:40 ^🔗	joepie91	being archived
19:51 ^🔗	alexvoda	hello
19:51 ^🔗	alexvoda	WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
19:52 ^🔗	BiggieJo1	looks like the bot is not responding - secret word is yahoosucks
19:52 ^🔗	alexvoda	thank you
19:53 ^🔗	alexvoda	also I might as well ask this over here too
19:53 ^🔗	alexvoda	I want to help archive myopera
19:53 ^🔗	alexvoda	how can I help?
19:54 ^🔗	BiggieJo1	info page is here - http://archiveteam.org/index.php?title=My_Opera
19:54 ^🔗	alexvoda	yes I know
19:54 ^🔗	alexvoda	but what can I do
19:55 ^🔗	alexvoda	It doesn't seam set up for use with the warrior
19:55 ^🔗	BiggieJo1	looks like it's not a warrior project, would need to check with Mithrandir who is running that project
19:55 ^🔗	alexvoda	oh
19:56 ^🔗	joepie91	BiggieJo1: wait, we have a bot responding with yahoosucks?
19:57 ^🔗	alexvoda	I see, hmm there seams to be no contact info on his wiki page
19:57 ^🔗	joepie91	irony, a bot responding to a question for the anti-bot system...
19:57 ^🔗	alexvoda	just a public key
19:58 ^🔗	alexvoda	does anyone know if he is regularly on IRC?
19:58 ^🔗	nico_32	joepie91: where ?
19:58 ^🔗	joepie91	nico_32: see what BiggieJo1 said
19:58 ^🔗	BiggieJo1	arrgh, nick broken again
19:59 ^🔗	joepie91	not broken, just in an alternative state of functioning
19:59 ^🔗	joepie91	:)
20:00 ^🔗	alexvoda	or should I write in the discussion page for myopera or for him?
20:02 ^🔗	nico_32	try to write on his discution page
20:02 ^🔗	nico_32	it "should" make mediawiki show him/her/hir a message at next login
20:03 ^🔗	alexvoda	ok, thanks for the advice
20:13 ^🔗	DeVan	Just found a webhost with insane amount of datasheets
20:13 ^🔗	nico_32	DeVan: url ?
20:14 ^🔗	DeVan	I dont know
20:15 ^🔗	nico_32	...
20:15 ^🔗	DeVan	afraid the swarm will make him close it
20:16 ^🔗	nico_32	send it privatly and i will make a very slow download
20:17 ^🔗	nico_32	with --delay 20s
20:17 ^🔗	balrog	electronicsandbooks?
20:17 ^🔗	DeVan	yeah
20:17 ^🔗	balrog	people have archived that before; it's very very slow
20:20 ^🔗	DeVan	balrog: not that site
20:22 ^🔗	nico_32	last IA crawling: 2011/2012
21:37 ^🔗	arkhive	would be good maybe to backup xbins (xbox-scene.com file downloads)
21:39 ^🔗	arkhive	http://www.xbins.org/ and the actual files obtained via IRC and FTP
21:39 ^🔗	arkhive	http://www.xbox-scene.com/articles/xbins.php
21:39 ^🔗	arkhive	tutorial.
21:41 ^🔗	arkhive	I downloaded a whole bunch about ten years ago but all the files probably have newer releases/versions and i never got the whole lot.
21:42 ^🔗	arkhive	Also, I think there was a limit on how many you could get a day. or hour. Can't remember though.
21:42 ^🔗	arkiver	arkhive: I will run an url discovery program tomorrow on those and see how big the sites are and how many files they contain
21:43 ^🔗	arkiver	will then start a grab
21:43 ^🔗	arkhive	okay. i think they are hosted off site. like not on xbins.org. can't remember though. I was like 13 lol
21:43 ^🔗	DFJustin	joepie91: oh are you grabbing that /b/ thing, I just emailed jason about it
21:43 ^🔗	arkhive	but i'll get those CD's i put the xbins files on.
21:44 ^🔗	joepie91	DFJustin: yes, one of my boxes is downloading it atm
21:46 ^🔗	DFJustin	\o/
21:46 ^🔗	arkhive	I still need to continue my saving of old apps/programs from dead/zombied mobile platforms. good article if i remember right lol. http://www.visionmobile.com/blog/2012/01/the-dead-platform-graveyard-lessons-learned-2/
21:46 ^🔗	arkhive	but to -bs for me. :)
21:56 ^🔗	joepie91	for the record, I have several servers with 500G disk now
21:56 ^🔗	joepie91	so anything up to that size, I can fetch
21:56 ^🔗	joepie91	(ping me when necessary)
21:59 ^🔗	yipdw	joepie91: !a https://bui.pm/ded
21:59 ^🔗	joepie91	yipdw: it's too big for that
22:00 ^🔗	yipdw	I should add that to ArchiveBot
22:00 ^🔗	yipdw	"Sorry, this item is too big"
22:04 ^🔗	bsmith093	http://bofh.nikhef.nl/events/ this seems to be a mirror for... everything con related for tech things. how can i get a size, without just dl-ing all of it?
22:06 ^🔗	arkiver	arkhive: I will d the xbox-scene website first, since those download are on-site
22:06 ^🔗	arkiver	and then I'll take a look at the other one
22:06 ^🔗	arkiver	:)
22:11 ^🔗	joepie91	bsmith093: not, probably
22:11 ^🔗	joepie91	unless you script a bit
22:13 ^🔗	bsmith093	joepie91: I'm gonna run a wget spider , then dump the log to a url extractor, then dump that into jdownloader, so i atleast know how big it is.
22:13 ^🔗	joepie91	oh man :P
22:15 ^🔗	arkiver	bsmith093: I will try to get the size and the number of urls for you tomorrow
22:16 ^🔗	bsmith093	ark i'm already running the spider thats faster, isnt it?
22:16 ^🔗	yipdw	mentioning jdownloader to joepie91 is like talking to a TEA Party member about taxes
22:16 ^🔗	yipdw	the cool thing about US-centric similes is that they're always worse than you intend
22:16 ^🔗	yipdw	:P
22:17 ^🔗	bsmith093	joepie91: whats wrong with this workflow? seriously I'd love any suggestions :)
22:17 ^🔗	joepie91	bsmith093: it just seems a bit... duct-tapey :)
22:17 ^🔗	joepie91	aside from the jdownloader bit
22:19 ^🔗	arkiver	hmm, we can see how accurate jdownloader is, please let me know the size of the site tomorrow
22:20 ^🔗	*	m1das moves to #archiveteam-bs and opens the popcorn
22:23 ^🔗	ivan`	bsmith093: I will tell you in a few minutes
22:24 ^🔗	ivan`	I am using HTTrack and find . -name 'index.html' \| xargs cat \| grep 'alt="\[ \]">' \| perl -p -i -e 's/ +/ /g' \| python -c "exec 'import sys\nfor line in sys.stdin: print line.strip().split()[-1]'" \| sed 's/G/102410241024/g' \| sed 's/M/10241024/g' \| sed 's/K/1024/g'
22:25 ^🔗	ivan`	buggy, ask me for fixed version later if you want it
22:31 ^🔗	Nemo_bis	regex ftw
22:40 ^🔗	ivan`	400GB so far but it's still grabbing indexes
22:40 ^🔗	ivan`	find . -name 'index.html' \| xargs cat \| grep 'alt="\[...\]">' \| grep -v 'alt="\[DIR\]">' \| perl -p -i -e 's/ +/ /g' \| python -c "exec 'import sys\nfor line in sys.stdin: print line.strip().split()[-1]'" \| sed 's/G/102410241024/g' \| sed 's/M/10241024/g' \| sed 's/K/1024/g' \| python -c "exec 'import sys\nprint sum(int(eval(line.strip(), {\'__builtins__\': None})) for line in sys.stdin)'"
23:57 ^🔗	bsmith093	well ok then, ivan` you can grab it if you eant, holy crap thats big
23:57 ^🔗	bsmith093	*want

irclogger-viewer