#archiveteam 2014-06-19,Thu

↑back Search

Time	Nickname	Message
11:03 ^🔗	arkiver	anyone else getting error 500 with S3 all the time?
11:20 ^🔗	Nemo_bis	arkiver: have you checked that https://archive.org/catalog.php?history=1&justme=1 looks ok for you?
11:20 ^🔗	arkiver	Nemo_bis: 101 waiting to run..,
11:21 ^🔗	arkiver	is that why S3 is doing weird?
11:21 ^🔗	Nemo_bis	I've had s3 reject my uploads when I had too many in the queue sometimes
11:21 ^🔗	Nemo_bis	that may be the reason or not
11:22 ^🔗	arkiver	ah
11:22 ^🔗	arkiver	maybe it is
11:22 ^🔗	arkiver	https://catalogd.archive.org/log/316904585
11:22 ^🔗	arkiver	I got 6 of these errors
11:22 ^🔗	arkiver	rerunning one and it looks like it's working, so I'm going to rerun them
11:27 ^🔗	arkiver	Nemo_bis: looks like it's working again. Thank you!
11:36 ^🔗	midas	s3 had alot of slowdowns, probably just going mental for a bit
12:22 ^🔗	midas	getting some bigass originals from rawporter now
12:45 ^🔗	arkiver	http://www.codespaces.com/
13:19 ^🔗	Cameron_D	http://blog.snappytv.com/?p=2101
13:19 ^🔗	Cameron_D	https://blog.twitter.com/2014/snappytv-is-joining-the-flock
13:21 ^🔗	Cameron_D	doesn't seem like they really host anything
13:21 ^🔗	Cameron_D	(Nor does it seem like they are closing up yet)
18:36 ^🔗	schbirid	thanks for cleaning up https://archive.org/details/nycTaxiTripData2013 whoever did that :)
18:42 ^🔗	DFJustin	you might want to put a better title/description on it
18:53 ^🔗	schbirid	true
18:54 ^🔗	SN4T14	I want to make a shitty "acid trip" joke. :p
19:00 ^🔗	midas	about 1000 more items to go in the rawporter grab from S3
19:10 ^🔗	swebb	Pixorial video-, photo-sharing service shutting down - Denver... Giving 30-days notice. http://www.bizjournals.com/denver/news/2014/06/17/pixorial-video-photo-sharing-service-shutting-down.html?ana=twt
19:19 ^🔗	swebb	"636,000 users, 2.7 million minutes of video and countless photos"
19:27 ^🔗	ivan`	google doesn't seem to know much of their user content, so I'm guessing it's almost entirely private?
19:27 ^🔗	ivan`	did spot some "www.pixorial.com/watch/" https://encrypted.google.com/search?q=site:pixorial.com&gbv=1&prmd=ivns&ei=EzmjU5SWA4bxoATxyIKYAg&start=20&sa=N
19:27 ^🔗	db48x	signups are disabled
19:27 ^🔗	db48x	no information about their api
19:27 ^🔗	swebb	Yea, I'm guessing so. It seems like the company is only 6 months old or so.
19:28 ^🔗	db48x	they don't actually display any of that content except through embeding
19:29 ^🔗	db48x	wikipedia says founded in 2007 and launched in 2009
19:30 ^🔗	db48x	hmm, on the other hand: http://myhub.pixorial.com/watch/636a65d9042aa8416876197a9c44e38b
19:31 ^🔗	db48x	<source id="mp4Source" src="http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b" type="video/mp4"></source><source id="webmSource" src="http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b.webm" type="video/webm"></source>
19:51 ^🔗	arkiver	so for pixorial
19:51 ^🔗	arkiver	I think it's the best to just scan all the links like http://pixori.al/26e01
19:51 ^🔗	arkiver	with http://pixori.al/*****
19:51 ^🔗	arkiver	* = 0-9 or a-f
19:52 ^🔗	arkiver	http://pixori.al/26e01 goes to http://myhub.pixorial.com/watch/636a65d9042aa8416876197a9c44e38b
19:54 ^🔗	arkiver	change http://myhub.pixorial.com/watch/636a65d9042aa8416876197a9c44e38b to http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b and you have video
19:54 ^🔗	arkiver	there is mp4 and webm
19:54 ^🔗	arkiver	mp4: http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b.mp4
19:54 ^🔗	arkiver	webm: http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b.webm
19:54 ^🔗	exmic	that's easy enough
19:54 ^🔗	arkiver	so I don't think it's that hard to do...
19:55 ^🔗	arkiver	just scanning all the http://pixori.al/***** links
19:55 ^🔗	arkiver	and you have it :)
19:55 ^🔗	arkiver	let's make a channel
19:56 ^🔗	arkiver	#pixi-death
19:56 ^🔗	arkiver	? :P
19:58 ^🔗	arkiver	found s3 again :) just like with earbits
20:00 ^🔗	db48x	hah
20:02 ^🔗	arkiver	http://pix-wp-assets.s3.amazonaws.com/
20:14 ^🔗	midas	arkiver: http://pastebin.com/TNR3JXZy
20:14 ^🔗	midas	open s3 bucket, again
20:14 ^🔗	*	db48x revives his aws account
20:15 ^🔗	midas	running du on it now
20:15 ^🔗	db48x	was about to ask :)
20:15 ^🔗	midas	did you find a bucket for the images/movies too arkiver ?
20:15 ^🔗	SN4T14	arkiver, hate to burst your bubble, but that's 60466176 unique addresses
20:15 ^🔗	midas	319591281 s3://pix-wp-assets/
20:16 ^🔗	midas	SN4T14: soo? thats not that much
20:16 ^🔗	db48x	probably all just wordpress stuff like the name implies
20:16 ^🔗	midas	yeah
20:17 ^🔗	db48x	wouldn't surprise me if there were others though
20:17 ^🔗	arkiver	SN4T14: http://pixori.al/***** is 1048576 adresses...
20:17 ^🔗	arkiver	0-9 and e-f
20:17 ^🔗	schbirid	next archiveteam project will be "s3cmd ls s3://*"
20:17 ^🔗	db48x	heh
20:17 ^🔗	schbirid	there were some guys who index that years ago btw
20:18 ^🔗	schbirid	based on keywords
20:18 ^🔗	schbirid	err, *wordlists
20:18 ^🔗	SN4T14	arkiver, that's 36 letters, 36^5=60466176
20:18 ^🔗	midas	schbirid: you know, would suprise me if that would be next :p
20:18 ^🔗	midas	grabbing every s3 bucket that is public
20:18 ^🔗	SN4T14	Oh, a-f
20:18 ^🔗	SN4T14	Whoops
20:19 ^🔗	arkiver	hehe
20:19 ^🔗	SN4T14	Yeah, then it's definitely doable
20:19 ^🔗	arkiver	running crawl of 1048576 links now
20:19 ^🔗	arkiver	see what's coming
20:19 ^🔗	arkiver	the videos are embedded in html so that's easy
20:22 ^🔗	db48x	even easier; the urls are predicitable so we don't even have to parse the html
20:24 ^🔗	arkiver	tested some urls: the urls with 5 things are indeed a-f: http://pixori.al/b532f http://pixori.al/3bcbb http://pixori.al/5a5b2 http://pixori.al/dcc94 http://pixori.al/e5cef http://pixori.al/cd75e
20:25 ^🔗	arkiver	looks like there no capital letters on anything
20:25 ^🔗	arkiver	but, I found this one too: http://myhub.pixorial.com/watch/045ad22847837e2cc150484f7421e950
20:25 ^🔗	arkiver	it has http://pixori.al/7sAR
20:25 ^🔗	arkiver	:/
20:26 ^🔗	schbirid	redirects to the same url
20:26 ^🔗	schbirid	as lowercase
20:26 ^🔗	midas	nice
20:27 ^🔗	arkiver	schbirid: great!
20:27 ^🔗	arkiver	so http://pixori.al/7sAR is http://pixori.al/7sar
20:28 ^🔗	arkiver	so there are for http://pixori.al/**** 1679616 urls
20:30 ^🔗	arkiver	wow.
20:30 ^🔗	arkiver	going fast, running crawl on the 1048576 urls from the http://pixori.al/***** and it's going 3 urls per second
20:30 ^🔗	arkiver	well, not extremely fast, but better then earbits
20:30 ^🔗	SN4T14	A whole 3 urls per second. :p
20:31 ^🔗	SN4T14	Why don't you start multiple instances of your script?
20:31 ^🔗	arkiver	SN4T14: will do that
20:31 ^🔗	arkiver	will start 5
20:32 ^🔗	db48x	it's more than just 0-9a-f
20:32 ^🔗	db48x	alas
20:32 ^🔗	arkiver	db48x: do you have an example url?
20:32 ^🔗	db48x	arkiver: oh, good, we don't have to deal with capitals as well :)
20:33 ^🔗	arkiver	db48x: oh sorry, I see
20:33 ^🔗	arkiver	so on http://pixori.al/** it is 0-9a-z and http://pixori.al/*** is 0-9a-f
20:34 ^🔗	midas	Rawporter data is downloaded
20:34 ^🔗	arkiver	I tested around 15 urls and they were all up to f from http://pixori.al/***** , but there could be more then f...
20:34 ^🔗	arkiver	if anyone finds such a case
20:34 ^🔗	midas	f seems logical, being hex
20:35 ^🔗	midas	Rawporter ended up being 75GB of data
20:35 ^🔗	arkiver	:)
20:35 ^🔗	arkiver	o images only?
20:35 ^🔗	arkiver	of*
20:36 ^🔗	db48x	if the final digit only goes up to f then that's still 26.8 million
20:36 ^🔗	midas	images and video
20:37 ^🔗	arkiver	db48x: how do you get that number?
20:37 ^🔗	arkiver	16^5 is 1048576
20:38 ^🔗	db48x	36^4*16
20:39 ^🔗	db48x	although it could be two separate namespaces, in which case it's 36^4 + 16^5
20:39 ^🔗	arkiver	yes....
20:40 ^🔗	arkiver	it's 36^4 + 16^5, so that's 1679616 + 1048576 = 2728192 urls
20:40 ^🔗	arkiver	but maybe also the http://pixori.al/* , http://pixori.al/ and http://pixori.al/* exist..
20:44 ^🔗	schbirid	http://pixori.al/* always redirects to http://myhub.pixorial.com/s/* which then redirects to the actual URL. you can save one redirect by using the myhub URL directly. saved almost 20% on my simple tiny test
20:47 ^🔗	schbirid	ok, the second test just gave ~4% though :P
20:52 ^🔗	arkiver	cool
20:52 ^🔗	arkiver	Wayback is also playing the videos very well in the browser
20:52 ^🔗	arkiver	http://web.archive.org/web/20140619195035/http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b
20:55 ^🔗	arkiver	going to split list of links from http://pixori.al/***** into 5 packs
20:55 ^🔗	arkiver	no, 10 packs
20:55 ^🔗	arkiver	then download them simultaneously
20:55 ^🔗	arkiver	hopefully, they don't have some kind of stupid banning system...
20:56 ^🔗	arkiver	will put soTimeoutMs to 20000 milliseconds and timeoutSeconds to 120000 seconds
20:56 ^🔗	arkiver	to stay save
20:57 ^🔗	arkiver	the videos don't download very fast (I had 50-100 kpbs) for one video
20:57 ^🔗	arkiver	so if video is multiple GB's...
20:57 ^🔗	Smiley	can you create url lists?
20:57 ^🔗	Smiley	damnit we need to get a warrior type project which we can just insert lists of urls, like archivebot does.
20:57 ^🔗	arkiver	haha no need man
20:57 ^🔗	arkiver	:P
20:57 ^🔗	arkiver	we got 30 days
20:58 ^🔗	arkiver	going to start 10 simultaneous crawls on http://pixori.al/***** in some time
20:59 ^🔗	Smiley	it'd still be a awesomely useful project
20:59 ^🔗	arkiver	yep
21:00 ^🔗	arkiver	Smiley: http://www.onetimebox.org/box/wi7sHp4t4Qrc26w58
21:00 ^🔗	arkiver	those are the links
21:02 ^🔗	db48x	I say we do it with the warrior
21:03 ^🔗	db48x	it's a more sure way to go, and if we get it done faster than people are less likely to delete things
21:03 ^🔗	arkiver	yeah
21:04 ^🔗	arkiver	but I have zero knowledge about the warrior...
21:04 ^🔗	arkiver	:/
21:04 ^🔗	db48x	but we should start scanning the urls now, before that is set up
21:04 ^🔗	Smiley	do we have some way of feeding the warrior loads of urls?
21:04 ^🔗	Smiley	we really need a framework D:
21:04 ^🔗	db48x	it's pretty easy
21:04 ^🔗	arkiver	db48x: why scanning the urls?
21:04 ^🔗	Smiley	like a generic one...
21:04 ^🔗	arkiver	why not doing short urls in the warrior?
21:04 ^🔗	arkiver	and people just downloading those?
21:04 ^🔗	db48x	we could be finished scanning the urls in a day or two
21:05 ^🔗	arkiver	yeah
21:05 ^🔗	arkiver	but I have now idea how to scan urls...
21:05 ^🔗	db48x	the pipeline for the warrior task may take that long to write and test
21:05 ^🔗	arkiver	O.o
21:05 ^🔗	arkiver	I know how to download and use heritrix and such things
21:05 ^🔗	arkiver	but someone else would have to do the short url scanning
21:05 ^🔗	db48x	ah
21:06 ^🔗	arkiver	sorry
21:06 ^🔗	arkiver	:(
21:06 ^🔗	db48x	I thought you were just scanning
21:06 ^🔗	arkiver	no no
21:06 ^🔗	arkiver	I was already downloading
21:06 ^🔗	arkiver	3 urls per second
21:06 ^🔗	arkiver	would be done in around 5 to 10 days
21:06 ^🔗	arkiver	with all the urls
21:06 ^🔗	arkiver	But it's more fun with a warrior project :)
21:06 ^🔗	db48x	assuming it's only a few million
21:06 ^🔗	arkiver	#pixi-death
21:07 ^🔗	arkiver	couldn't think of anything better... :P
21:09 ^🔗	db48x	#pixofail :D
21:10 ^🔗	arkiver	:D muuuuch better
21:10 ^🔗	arkiver	:P
21:15 ^🔗	db48x	yipdw: ping?
21:34 ^🔗	yipdw	db48x: hi
21:35 ^🔗	db48x	yipdw: howdy. how do you usually create a new job for the warriors? fork seesaw-kit?
21:36 ^🔗	yipdw	yeah
21:36 ^🔗	yipdw	I usually also subtree merge the readme repo in, but copy is fine too
21:38 ^🔗	db48x	I guess you don't actually use a github fork
21:38 ^🔗	yipdw	you can
22:27 ^🔗	deathy	wondering if someone studied CKAN archival/backup... it's a platform for "open data". Not that governments collapse and may delete some of their published data...but best be safe...
22:49 ^🔗	Nemo_bis	someone claims it's useful to upload data there in order to not have all eggs in IA's basket https://meta.wikimedia.org/wiki/Talk:Requests_for_comment/How_to_deal_with_open_datasets
22:50 ^🔗	Nemo_bis	(I simplified)
22:52 ^🔗	DFJustin	the more baskets the merrier
23:05 ^🔗	deathy	heh.. "lots of copies keeps stuff safe"
23:06 ^🔗	SN4T14	Yeah, that's why I always clone myself in three places.
23:07 ^🔗	db48x	I want to back myself up outside my current light-cone
23:07 ^🔗	SN4T14	Dude, set up a RAIC array. :p

irclogger-viewer