#archiveteam-bs 2014-05-08,Thu

↑back Search

Time	Nickname	Message
00:10 ^🔗	godane	i'm uploading more of marxists.org
00:49 ^🔗	godane	i'm uploading the lastest DIY Tryin and tekzilla episodes
00:54 ^🔗	godane	also Industrial Worker pdfs from marxists.org is getting upload too
02:33 ^🔗	balrog	SketchCow: was there a talk by Brewster alluding to Yahoo destroying culture by burning Geocities?
02:33 ^🔗	balrog	or was that you
02:36 ^🔗	SketchCow	I'm sure both
04:58 ^🔗	yipdw	https://www.youtube.com/watch?v=Tsa4ogtBiv0 is pretty excellent
05:54 ^🔗	Coderjoe	is this fhqwhgads on every track of the album?
05:59 ^🔗	Coderjoe	ok. i'm curious of the downbeat tracks
06:01 ^🔗	Coderjoe	meh
06:17 ^🔗	Coderjoe	Fhqwhgads all the way down
09:32 ^🔗	drspangle	Is it better to download a website as just a collection of raw pages, or to extract the data from it all and create a kind of simulation of their database?
09:38 ^🔗	schbirid	depends what your goal is
09:43 ^🔗	drspangle	I guess
09:44 ^🔗	drspangle	this is not something like.. geocities or whatnot, but a more database orientated website. every page has standard output, tags, comments, etc
09:46 ^🔗	drspangle	the more I think about it, the more I'm sure that pulling the data out and building a simulation of their database is best
09:46 ^🔗	drspangle	but in the general case, kind of wondering whether there's more of a tendancy towards processed or raw data in archiveteam stuff
09:47 ^🔗	midas	archiveteam tends to grab everything
09:47 ^🔗	midas	never know what you have missed untill it's gone right
09:47 ^🔗	tsp__	drspangle: I'm pretty new to all this, been sorta lirking for a while. Archiveteam seems to just grab everything raw, I tend to grab just the bits I want so my stuff isn't that useful
09:48 ^🔗	tsp__	e.g. archiveteam would grab fanfiction stories, the author page, reviews... while I just grab the chapters because it's all I need to process
09:50 ^🔗	drspangle	well, in the fanfic scenario, I'm thinking grab all the data from the page, including the author page, reviews, etc, but then format it into a database, rather than grabbing it all as story0237.htm, story0238.htm, ... author012.htm, author013.htm
09:51 ^🔗	tsp__	That can work too. It just won't show up on the wayback machine I don't think, because that needs warcs
09:51 ^🔗	drspangle	sorry, 'warcs'? Not sure the definition
09:52 ^🔗	tsp__	Some kind of web archive format. It describes the request and response, including headers
09:52 ^🔗	drspangle	ah, okay
09:53 ^🔗	tsp__	The scripts I"ve seen seem to use wget-lua to do a lot of their link finding, maybe because wget has --page-requisites to grab a lot of the required css, javascript, images, etc that wouldn't be gotten otherwise
09:55 ^🔗	drspangle	yeah
09:56 ^🔗	drspangle	It seems more efficient to pull the data out of the website, if the style is just a tool to display the data
09:56 ^🔗	drspangle	if the style is a feature in it's own right, then it does want saving
09:56 ^🔗	drspangle	but I guess defining that difference is stupid, everyone will define differently, so download it all so they can decide later
09:57 ^🔗	tsp__	For me, I'm not really sure how to do it properly, I can wget --mirror a static site and be fine, but otherwise I just scrape the stuff I want. I don't think there's a how to archive dynamic forums tutorial yet
09:58 ^🔗	midas	thats why there is #archivebot
09:58 ^🔗	midas	easiest way to feed sites into archive.org
10:01 ^🔗	tsp__	I want to rip the spacebattles forums, particularly the fanfiction forum so I can read its posts with my editor. I can write some scripts to take out the posts I don't want
10:02 ^🔗	drspangle	I'm kind of tempted to just attempt to simulate the website. Download the data, convert it into a simulated database, get a copy of the style, poke the data into where it goes
10:02 ^🔗	drspangle	but pretty sure that's a bad idea and would be discouraged
10:02 ^🔗	drspangle	but would mean easier processing of the data
10:29 ^🔗	schbirid	drspangle: for archiving grab it as it is. for rehosting/displaying it differently you might want to parse it
10:56 ^🔗	drspangle	schbirid: ah, okay
12:12 ^🔗	schbirid	midas: harddisk, harddisk, harddisk? :>
12:12 ^🔗	midas	i've ordered a new 4TB disk for travel :p
12:12 ^🔗	schbirid	woot
12:25 ^🔗	schbirid	you know what would be stupid fun? people running encrypted garbage faucet services to generate lots of encrypted traffic for fun
12:26 ^🔗	schbirid	to annoy secret agencies and to test one's ISP's monthly limits by downloading it to /dev/null
12:26 ^🔗	ersi	monthly limits? lol, how barbarian!
12:26 ^🔗	schbirid	:)
12:37 ^🔗	schbirid	yay for netcat
15:04 ^🔗	SketchCow	http://imgur.com/gallery/nKkgfWK
15:09 ^🔗	exmic	oh yes
15:09 ^🔗	exmic	this is exactly what I needed
16:00 ^🔗	DFJustin	http://apnews.excite.com/article/20140507/alibaba_ipo-yahoos_windfall-08fb26a635.html
16:09 ^🔗	godane	a video about the search for James Kim: http://download.cbsnews.com/media/2006/12/05/video2232655.flv
17:54 ^🔗	Smiley	so....
17:54 ^🔗	Smiley	i wish we could pull all the old cartoons from that app on my phone
17:54 ^🔗	Smiley	cartoon HD
18:27 ^🔗	exmic	book i just uploaded about driving the SR-71, read it before someone submits a DMCA or something https://archive.org/stream/SledDriver
19:42 ^🔗	Smiley	So, anyone else up for pulling as many cartons from cartoon HD as possible before it gets shut down? (I can't imagine that it won't, but theres some old movies on there).
21:08 ^🔗	midas	tried to find the app (ios) but seems to be missing here. but im willing to grab as much as possible
21:14 ^🔗	Smiley	midas: it's... hidden
21:14 ^🔗	Smiley	http://gappcenter.com/cartoon/index.html
21:14 ^🔗	Smiley	i guess for iDevices you have to be rooted to install it now?
21:22 ^🔗	midas	doesnt say jailbroken only
21:24 ^🔗	midas	right
21:24 ^🔗	midas	it sideloaded on my iphone...
21:24 ^🔗	midas	thats kinda freaky
21:30 ^🔗	balrog	they're exploiting enterprice certs
21:30 ^🔗	balrog	can't see it lasting long
21:30 ^🔗	Smiley	yeah I aint insuring the security of it.
21:31 ^🔗	Smiley	so i was thinking run a android image, setup a mount, pull the movies down in the image and then pull onto the system via that
21:33 ^🔗	midas	im checking out the apk to see what it does, cant be that hard
21:33 ^🔗	Smiley	midas: nod, it's prob just a case of finding the url ;)

irclogger-viewer