Time |
Nickname |
Message |
00:54
🔗
|
SketchCow |
https://archive.org/details/JasonScottPresentationAtAmericanArchivistsMeetingWebArchivingGroupAugust2014 |
15:48
🔗
|
Smiley |
Anyone able to find a caches of https://torrentfreak.com/bbc-fact-shut-down-doctor-who-fansite-140823/ |
15:48
🔗
|
Smiley |
the site mentioned there#? |
16:09
🔗
|
antomatic |
wow, that sounds seriously dodgy |
16:10
🔗
|
antomatic |
(i.e. fact and bbc just demanding the domain is handed over virtually then and there0 |
16:10
🔗
|
antomatic |
0 |
16:10
🔗
|
antomatic |
hm, i can't type close-brackets for some reason |
16:15
🔗
|
midas |
they have done something like this with finalgear |
16:36
🔗
|
midas |
Smiley: http://webcache.googleusercontent.com/search?q=cache:Gh-eo6mhGnEJ:doctorwhomedia.co.uk/+&cd=1&hl=nl&ct=clnk&gl=nl |
17:53
🔗
|
filippo |
Hello AT. A few days ago I was asking SketchCow for some help with a project I proposed in a HOPE X lightning session: http://ar.chiv.io |
17:53
🔗
|
filippo |
The very basic idea is a private archive.today (was archive.is) + Pinboard.in |
17:54
🔗
|
filippo |
With one fundamental feature: archiving every single page you visit |
17:55
🔗
|
filippo |
The long term plan is community driven plugins for better archiving and db federation to build a global public archive, but let's keep things simple for starting |
17:55
🔗
|
filippo |
I know I could pull that together with a few lines of scripting, but I don't want the dead HTML, I want to fetch the real page |
17:56
🔗
|
filippo |
Everything is dynamic, API based, AJAX ridden, and wget is not enough anymore |
17:56
🔗
|
filippo |
I *really* like what https://archive.today does, but it's stubbornly closed source, even if free |
17:57
🔗
|
filippo |
So I wanted to ask if anyone started a project with similar goals already -- load pages in real browsers and save the result -- from which I can get code |
17:58
🔗
|
filippo |
Anyone? <3 |
18:09
🔗
|
chfoo |
filippo: i'm working on wpull which has experimental phantomjs support that supports saving snapshots into the warc under a non-standard uri. archivebot is using wpull. |
18:13
🔗
|
Nemo_bis |
ciao filippo |
18:26
🔗
|
filippo |
chfoo: are you speaking about bare image screenshot or static HTML? |
18:27
🔗
|
filippo |
(I think I checked out wpull while documenting) |
18:27
🔗
|
filippo |
Hey Nemo_bis :) |
18:27
🔗
|
filippo |
If you have a look at the archive.today zips, they are a embedded CSS, static HTML and images packet |
18:28
🔗
|
filippo |
I really like that approach, because as long as we will have a VM we will be able to render the page, but the text content is available too |
18:28
🔗
|
chfoo |
filippo: right now wpull takes a DOM snapshot and saves it to a PDF and HTML resource. any phantomjs traffic is proxied and recorded into the warc |
18:33
🔗
|
filippo |
chfoo: nice! So no scripts left, what about images and external styles? Are they only recorded or also linked/embedded into the HTML? |
18:37
🔗
|
chfoo |
filippo: it only records it. the idea was that maybe someone will index the WARC file for the special Wpull resources and then rewrite the HTML so it displays properly. |
18:44
🔗
|
filippo |
chfoo: I see, so one way to build my tool might be to do the post-processing on the WARCs or reuse the code from wpull |
18:45
🔗
|
filippo |
The only other tool I was able to find up to now is a browser extension, SinglePage |
18:49
🔗
|
xmc |
filippo: https://github.com/odie5533/WarcProxy ? |
18:49
🔗
|
xmc |
not exactly what you're talking about but it's ok |
18:52
🔗
|
filippo |
xmc: neat! I would not use it for privacy concerns but it can be useful |
18:52
🔗
|
filippo |
(I want to archive from a external server) |
18:54
🔗
|
xmc |
aye |
18:54
🔗
|
xmc |
I was running it at home for a while |
18:55
🔗
|
xmc |
it could use a few patches, such as a way to swap out the output file without restarting it, but that's fine |
20:01
🔗
|
SketchCow |
Huzzah |
23:44
🔗
|
JKLman |
I just get a bunch of Tracker rate limiting is active |
23:44
🔗
|
JKLman |
just downloads a file now and then |
23:45
🔗
|
JKLman |
maybe because alot of others with low ping is also working that project? |
23:47
🔗
|
Rotab |
its to not crash the site (again) |
23:47
🔗
|
JKLman |
thats good |
23:48
🔗
|
JKLman |
but if I look at the stats it like 5 or 6 peaople are running alot of files fast |