#warrior 2018-12-17,Mon

↑back Search

Time Nickname Message
00:04 🔗 Magirot has quit IRC (Quit: Magirot)
00:34 🔗 GitHub181 has joined #warrior
00:34 🔗 GitHub181 has left
00:34 🔗 GitHub163 has joined #warrior
00:34 🔗 GitHub163 has left
02:09 🔗 SentientC has joined #warrior
02:11 🔗 SentientC has left
02:30 🔗 bithippo has quit IRC (Textual IRC Client: www.textualapp.com)
03:05 🔗 balrog has quit IRC (Quit: Bye)
03:14 🔗 balrog has joined #warrior
03:14 🔗 swebb sets mode: +o balrog
05:40 🔗 Kaz has quit IRC (Ping timeout: 265 seconds)
05:40 🔗 Kaz has joined #warrior
06:10 🔗 Magirot has joined #warrior
07:47 🔗 Atom has quit IRC (Read error: Operation timed out)
07:51 🔗 bibop has joined #warrior
07:51 🔗 bibop has left Leaving
07:53 🔗 Atom has joined #warrior
08:12 🔗 Atom has quit IRC (Ping timeout: 252 seconds)
08:13 🔗 Atom has joined #warrior
08:55 🔗 chferfa has quit IRC (Quit: Connection closed for inactivity)
09:23 🔗 AnEggFore has quit IRC (Ping timeout: 255 seconds)
10:02 🔗 surewhyno has joined #warrior
10:03 🔗 surewhyno hey folks. i'm curious is there a reason why there isn't a version of warrior for web browsers using javascript? i bet it would be easier to convince people who have no technical knoweldge to help with archiving that way
10:04 🔗 surewhyno is there a technical reason why it hasn't been done?
10:22 🔗 kiska Because chrome and all other browsers don't give us the raw headers. Warcs rely on request - response pairs. Which don't happen in a browser
10:22 🔗 kiska And also wanna imagine the performance of running a hypervisor in a browser?
10:22 🔗 surewhyno not sure how warcs work. but javascript can see the response headers of a url it fetches
10:22 🔗 surewhyno why run a hypervisor in a browser? O_o
11:10 🔗 kpcyrd surewhyno: unless the target website disables the same origin policy with cross origin resource sharing you can't archive from a browser
11:11 🔗 surewhyno ah...
11:12 🔗 surewhyno i'd have thought that the archiver website would be able to disable that and ignore any policies the archivee sets. i guess i don't know much about browsers heh
11:18 🔗 kpcyrd surewhyno: the target needs to do that, not the source :) but if eg. tumblr does that we would get a large number of clean ips and you can use some javascript black magic to cloak the source
11:19 🔗 surewhyno then why not a browser plugin?
11:19 🔗 surewhyno that should be able to bypass any SOP nastiness that prevents JS from doing the archival, and would be easier for people to use and install than a VM
11:21 🔗 kpcyrd a browser plugin should work if you rewrite everything to javascript
11:22 🔗 surewhyno would that be a useful thing to work on? i could probably do that
11:28 🔗 kpcyrd surewhyno: I've looked at tumblr specifically, you can't crawl the blogs unless you have an api key, but you could theoretically use javascript as a proxy to ban evade on 66.media.tumblr.com
11:30 🔗 surewhyno well i was thinking of having it interface with the warrior tracker
11:31 🔗 surewhyno not thinking about tumblr specifically, but for future cases
11:31 🔗 surewhyno especially when there's a short period of time. i mean i asked a few people to run their own warrior but all but one did not want to because either they were worried about running a whole vm or it was too complex
13:49 🔗 surewhyno has quit IRC (Quit: leaving)
16:03 🔗 Chewable has joined #warrior
16:05 🔗 Chewable has quit IRC (Client Quit)
16:06 🔗 Chewable has joined #warrior
16:08 🔗 Chewable has quit IRC (Client Quit)
16:08 🔗 Chewable has joined #warrior
16:08 🔗 Chewable has quit IRC (Client Quit)
16:08 🔗 Chewable has joined #warrior
16:09 🔗 Chewable has quit IRC (Client Quit)
16:09 🔗 Chewable has joined #warrior
16:09 🔗 Chewable has quit IRC (Client Quit)
16:09 🔗 Chewable has joined #warrior
16:09 🔗 Chewable has quit IRC (Client Quit)
16:09 🔗 Chewable has joined #warrior
16:09 🔗 Chewable has quit IRC (Client Quit)
16:09 🔗 Chewable has joined #warrior
16:09 🔗 Chewable has quit IRC (Client Quit)
16:10 🔗 Chewable has joined #warrior
16:10 🔗 Chewable has quit IRC (Client Quit)
16:10 🔗 Chewable has joined #warrior
16:10 🔗 Chewable has quit IRC (Client Quit)
16:10 🔗 Chewable has joined #warrior
16:10 🔗 Chewable has quit IRC (Client Quit)
16:10 🔗 Chewable has joined #warrior
16:10 🔗 Chewable has quit IRC (Client Quit)
16:10 🔗 Chewable has joined #warrior
16:10 🔗 Chewable has quit IRC (Client Quit)
16:12 🔗 Chewable has joined #warrior
16:12 🔗 Chewable has quit IRC (Client Quit)
16:12 🔗 Chewable has joined #warrior
16:12 🔗 Chewable has quit IRC (Client Quit)
16:17 🔗 Chewable has joined #warrior
16:18 🔗 Chewable can anyone see me?
16:23 🔗 Chewable The item I was working on is downloading thousands of URLs and it's taking hours.
16:34 🔗 chfoo item for what project?
16:36 🔗 WantedFre Chewable, I've got two items with 13000+ urls and have been going for over 9 hours.
16:36 🔗 WantedFre This will take time
16:38 🔗 Chewable is it ok if I shutdown the warrior when the project is still running?
16:38 🔗 Chewable It was tumblr btw
16:41 🔗 chfoo shutting down the warrior without letting it finish will abandon the item. you can save and resume the virtual machine if you need to pause for a few hours
16:42 🔗 Chewable Got it. Thank you for the response guys :)
16:46 🔗 teej_ Chewable: Also check the #tumbledown channel for Tumblr stuff.
16:47 🔗 teej_ You'll get answers faster.
16:47 🔗 Chewable Thank you!
17:13 🔗 chimyatta has joined #warrior
17:20 🔗 JAA surewhyno, kiska, kpcyrd: Besides SOP, one big issue is that you can't get the raw data as sent across the wire, which is what's supposed to be stored in the WARCs. Browsers simply don't expose such low-level APIs. In the best case, you get a normalised version of it with transfer encoding removed, capitalisation of the header fields standardised, whitespace in header fields trimmed, etc. That's not
17:20 🔗 JAA acceptable for proper archival.
18:08 🔗 kpcyrd JAA: is there a reason we care that much about the headers? I would assume a correct archival of the response body is sufficient for most users.
18:39 🔗 JL421_2 has joined #warrior
18:47 🔗 Chewable has quit IRC (Leaving)
18:52 🔗 chfoo there's a difference between an archive and just hosting files. it would be like a stripping the cover pages off a book. you'd still have the story intact, but you'd lose info about the author, publisher, etc things that you need to catalog it
20:05 🔗 JL421_2 has quit IRC (Quit: http://www.kiwiirc.com/ - A hand crafted IRC client)
20:26 🔗 WantedFre has quit IRC (Quit: brb)

irclogger-viewer