#archiveteam 2014-08-23,Sat

↑back Search

Time	Nickname	Message
00:54 ^🔗	SketchCow	https://archive.org/details/JasonScottPresentationAtAmericanArchivistsMeetingWebArchivingGroupAugust2014
15:48 ^🔗	Smiley	Anyone able to find a caches of https://torrentfreak.com/bbc-fact-shut-down-doctor-who-fansite-140823/
15:48 ^🔗	Smiley	the site mentioned there#?
16:09 ^🔗	antomatic	wow, that sounds seriously dodgy
16:10 ^🔗	antomatic	(i.e. fact and bbc just demanding the domain is handed over virtually then and there0
16:10 ^🔗	antomatic	0
16:10 ^🔗	antomatic	hm, i can't type close-brackets for some reason
16:15 ^🔗	midas	they have done something like this with finalgear
16:36 ^🔗	midas	Smiley: http://webcache.googleusercontent.com/search?q=cache:Gh-eo6mhGnEJ:doctorwhomedia.co.uk/+&cd=1&hl=nl&ct=clnk&gl=nl
17:53 ^🔗	filippo	Hello AT. A few days ago I was asking SketchCow for some help with a project I proposed in a HOPE X lightning session: http://ar.chiv.io
17:53 ^🔗	filippo	The very basic idea is a private archive.today (was archive.is) + Pinboard.in
17:54 ^🔗	filippo	With one fundamental feature: archiving every single page you visit
17:55 ^🔗	filippo	The long term plan is community driven plugins for better archiving and db federation to build a global public archive, but let's keep things simple for starting
17:55 ^🔗	filippo	I know I could pull that together with a few lines of scripting, but I don't want the dead HTML, I want to fetch the real page
17:56 ^🔗	filippo	Everything is dynamic, API based, AJAX ridden, and wget is not enough anymore
17:56 ^🔗	filippo	I really like what https://archive.today does, but it's stubbornly closed source, even if free
17:57 ^🔗	filippo	So I wanted to ask if anyone started a project with similar goals already -- load pages in real browsers and save the result -- from which I can get code
17:58 ^🔗	filippo	Anyone? <3
18:09 ^🔗	chfoo	filippo: i'm working on wpull which has experimental phantomjs support that supports saving snapshots into the warc under a non-standard uri. archivebot is using wpull.
18:13 ^🔗	Nemo_bis	ciao filippo
18:26 ^🔗	filippo	chfoo: are you speaking about bare image screenshot or static HTML?
18:27 ^🔗	filippo	(I think I checked out wpull while documenting)
18:27 ^🔗	filippo	Hey Nemo_bis :)
18:27 ^🔗	filippo	If you have a look at the archive.today zips, they are a embedded CSS, static HTML and images packet
18:28 ^🔗	filippo	I really like that approach, because as long as we will have a VM we will be able to render the page, but the text content is available too
18:28 ^🔗	chfoo	filippo: right now wpull takes a DOM snapshot and saves it to a PDF and HTML resource. any phantomjs traffic is proxied and recorded into the warc
18:33 ^🔗	filippo	chfoo: nice! So no scripts left, what about images and external styles? Are they only recorded or also linked/embedded into the HTML?
18:37 ^🔗	chfoo	filippo: it only records it. the idea was that maybe someone will index the WARC file for the special Wpull resources and then rewrite the HTML so it displays properly.
18:44 ^🔗	filippo	chfoo: I see, so one way to build my tool might be to do the post-processing on the WARCs or reuse the code from wpull
18:45 ^🔗	filippo	The only other tool I was able to find up to now is a browser extension, SinglePage
18:49 ^🔗	xmc	filippo: https://github.com/odie5533/WarcProxy ?
18:49 ^🔗	xmc	not exactly what you're talking about but it's ok
18:52 ^🔗	filippo	xmc: neat! I would not use it for privacy concerns but it can be useful
18:52 ^🔗	filippo	(I want to archive from a external server)
18:54 ^🔗	xmc	aye
18:54 ^🔗	xmc	I was running it at home for a while
18:55 ^🔗	xmc	it could use a few patches, such as a way to swap out the output file without restarting it, but that's fine
20:01 ^🔗	SketchCow	Huzzah
23:44 ^🔗	JKLman	I just get a bunch of Tracker rate limiting is active
23:44 ^🔗	JKLman	just downloads a file now and then
23:45 ^🔗	JKLman	maybe because alot of others with low ping is also working that project?
23:47 ^🔗	Rotab	its to not crash the site (again)
23:47 ^🔗	JKLman	thats good
23:48 ^🔗	JKLman	but if I look at the stats it like 5 or 6 peaople are running alot of files fast

irclogger-viewer