#archiveteam 2014-08-23,Sat

↑back Search

Time Nickname Message
00:54 🔗 SketchCow https://archive.org/details/JasonScottPresentationAtAmericanArchivistsMeetingWebArchivingGroupAugust2014
15:48 🔗 Smiley Anyone able to find a caches of https://torrentfreak.com/bbc-fact-shut-down-doctor-who-fansite-140823/
15:48 🔗 Smiley the site mentioned there#?
16:09 🔗 antomatic wow, that sounds seriously dodgy
16:10 🔗 antomatic (i.e. fact and bbc just demanding the domain is handed over virtually then and there0
16:10 🔗 antomatic 0
16:10 🔗 antomatic hm, i can't type close-brackets for some reason
16:15 🔗 midas they have done something like this with finalgear
16:36 🔗 midas Smiley: http://webcache.googleusercontent.com/search?q=cache:Gh-eo6mhGnEJ:doctorwhomedia.co.uk/+&cd=1&hl=nl&ct=clnk&gl=nl
17:53 🔗 filippo Hello AT. A few days ago I was asking SketchCow for some help with a project I proposed in a HOPE X lightning session: http://ar.chiv.io
17:53 🔗 filippo The very basic idea is a private archive.today (was archive.is) + Pinboard.in
17:54 🔗 filippo With one fundamental feature: archiving every single page you visit
17:55 🔗 filippo The long term plan is community driven plugins for better archiving and db federation to build a global public archive, but let's keep things simple for starting
17:55 🔗 filippo I know I could pull that together with a few lines of scripting, but I don't want the dead HTML, I want to fetch the real page
17:56 🔗 filippo Everything is dynamic, API based, AJAX ridden, and wget is not enough anymore
17:56 🔗 filippo I *really* like what https://archive.today does, but it's stubbornly closed source, even if free
17:57 🔗 filippo So I wanted to ask if anyone started a project with similar goals already -- load pages in real browsers and save the result -- from which I can get code
17:58 🔗 filippo Anyone? <3
18:09 🔗 chfoo filippo: i'm working on wpull which has experimental phantomjs support that supports saving snapshots into the warc under a non-standard uri. archivebot is using wpull.
18:13 🔗 Nemo_bis ciao filippo
18:26 🔗 filippo chfoo: are you speaking about bare image screenshot or static HTML?
18:27 🔗 filippo (I think I checked out wpull while documenting)
18:27 🔗 filippo Hey Nemo_bis :)
18:27 🔗 filippo If you have a look at the archive.today zips, they are a embedded CSS, static HTML and images packet
18:28 🔗 filippo I really like that approach, because as long as we will have a VM we will be able to render the page, but the text content is available too
18:28 🔗 chfoo filippo: right now wpull takes a DOM snapshot and saves it to a PDF and HTML resource. any phantomjs traffic is proxied and recorded into the warc
18:33 🔗 filippo chfoo: nice! So no scripts left, what about images and external styles? Are they only recorded or also linked/embedded into the HTML?
18:37 🔗 chfoo filippo: it only records it. the idea was that maybe someone will index the WARC file for the special Wpull resources and then rewrite the HTML so it displays properly.
18:44 🔗 filippo chfoo: I see, so one way to build my tool might be to do the post-processing on the WARCs or reuse the code from wpull
18:45 🔗 filippo The only other tool I was able to find up to now is a browser extension, SinglePage
18:49 🔗 xmc filippo: https://github.com/odie5533/WarcProxy ?
18:49 🔗 xmc not exactly what you're talking about but it's ok
18:52 🔗 filippo xmc: neat! I would not use it for privacy concerns but it can be useful
18:52 🔗 filippo (I want to archive from a external server)
18:54 🔗 xmc aye
18:54 🔗 xmc I was running it at home for a while
18:55 🔗 xmc it could use a few patches, such as a way to swap out the output file without restarting it, but that's fine
20:01 🔗 SketchCow Huzzah
23:44 🔗 JKLman I just get a bunch of Tracker rate limiting is active
23:44 🔗 JKLman just downloads a file now and then
23:45 🔗 JKLman maybe because alot of others with low ping is also working that project?
23:47 🔗 Rotab its to not crash the site (again)
23:47 🔗 JKLman thats good
23:48 🔗 JKLman but if I look at the stats it like 5 or 6 peaople are running alot of files fast

irclogger-viewer