[00:00] ranma: Just try loading normal pages. [00:00] also try loading domains that don't exist [00:00] Most do not do that. Nonexistent redirects are more often though [00:00] https://www.cs.washington.edu/research/security/web-tripwire.html [00:01] does Archive Team Warrior download 2-4 copies of a page to compare? [00:01] no :p [00:01] arkiver: Thanks, Yeah. Or I can modify the warrior, would that be fine? (Wonder where, haha) There should be limit for active tasks based on tasks. One with good connection should be able to run all tasks at once. [00:02] when a roomie gets retarded and torrents dangerously, the ISP injects a warning into webpages [00:02] copyright warning (comcast) [00:02] Then its not clean internet. [00:02] outside of that it's clean [00:03] ranma: He should setup the torrent better though. Max connections: 40-80, max per torrent: 10-15 [00:03] or just not use public trackers *facepalm* [00:03] These are my personal recommendations. Most clients have far too high max conn counts. [00:04] IDK how they detect it. I use only public trackers as I seed OSS. [00:05] is going to the http version of that tripwire page accurate? [00:05] i know the isp doesn't inject junk in https connections [00:06] Not sure what is tripwire except intrusion detection on Linux [00:06] my ISP has taken to injecting data cap notices into my pages, actually. it doesn't happen anymore because we now have unlimited usage, but I'd really much prefer that they leave my data alone [00:06] the link bwn just pasted [00:07] also, assuming you guys haven't changed the default vm settings, how much space are your Archive Team Warrior VM sizes? [00:07] that said I'm not going to start using HTTPS-Everywhere or anything :p [00:07] ranma: 60GB IIRC [00:07] it's using all 60gb? [00:07] ranma: Its publisher thing not user [00:07] ranma: It'll only use up to 60GB. It's a sparse file: https://wiki.archlinux.org/index.php/sparse_file [00:08] (same concept anyway, maybe different implementation) [00:08] yeah, i'm doing the URL project due to having little free space [00:08] *doing just the [00:09] In my opinion the crawl needs to be managed more efficiently, that would give more advantage then one unclean net user. [00:09] Q: Is wget of Warrior different then the usual one? [00:10] i'm guessing most servers aren't http/2? [00:10] most servers the project is crawling [00:10] Nope, only google facebook etc. Also they have to be backwards compatible. [00:10] Most people are in stone age when it comes to technology. [00:11] Like Madison leak. [00:11] ranma: it's older site, but it just checks to see if the html has been changed from what it expected. it wouldn't detect those periodic things like the data cap warnings frogging mentioned, unless it happened to inject when you visited [00:12] But what you could do is make a proxy, transparent, that would block and warn if specific crap is injected [00:12] and i'm sure there's ways around it, depending on how much anyone would want to invest in doing so [00:13] ranma: if your connection stability isn't under your control it would be greatly appreciated if you did not run a warrior on that connection [00:13] there are some checks, and yes they are defeatable checks [00:13] please use a VPS or something else instead [00:14] is it not viable to request for the eventual future an option to only crawl sites that are https-able? [00:15] *for a client to only crawl https-able sites [00:15] file an issue [00:15] kk [00:16] here? https://github.com/ArchiveTeam/warrior-code [00:16] no https://github.com/ArchiveTeam/seesaw-kit/issues [00:17] Q: Is wget of Warrior different then the usual one? Running headless cannot log to console to check or try to check xD [00:18] ssh to Warrior! [00:18] it's wget modified with Lua hooks so yes [00:19] Okay, I have the problem with normal Wget, that the ... dedup is not working at all. [00:19] Tried even modifying the link to literally same values as in the index file. [00:21] I don't know what you're referring to [00:21] an example would be nice [00:23] Have to study the hooks and the platform more. I started working on my own stuff. Got SSL evented fast stuff working. Need to handle the request response stuff and make it into warc. Or will modify existing stuff. But nice thing to learn on. [00:24] wget has WARC support, why not reuse it [00:24] I was refering using wget on usual Linux distro. Got segfault, out of space, now running again to different file, but dedup does not work at all. [00:24] yeah the WARC patches were upstreamed [00:24] it's going to be in wget [00:25] So now got few hundred thousand pages but would need to filter all links and run again. [00:25] Its in wget, but either there is unpatched bug or got little older version. [00:25] l [00:25] k [00:26] GNU Wget 1.17.1 built on linux-gnu. [00:39] has anyone proposed backing up minus.com? [00:39] (imgur clone) [00:39] or is it too far gone to capture? [00:44] * Yoshimura notes about the reusing... I can make it to my picture from start to finish. [00:45] ranma: IDK one could also make comact archives (recompressing) [00:45] Archivation is more about knowledge then personal, at least to me. Personal stuff will go and people will die. [00:46] ah [00:50] ACTION also note that he had 50 threads of wget running. Can imagine using events and single thread could be more efficient. Wget does by some magic use only 7mb ram, mine does 10mb (diff ssl) but one more parallel download does +2mb. Having that packed in small efficient virtual, one can run it headless virtually anywhere in VM so can use much wider audience of people. Also being disk space efficient, people with small laptop drives or small boar [02:00] *** Yoshimura has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) [02:01] *** Yoshimura has joined #warrior [03:19] *** bwn has quit IRC (Ping timeout: 492 seconds) [03:20] *** BnA-Rob1n has quit IRC (Ping timeout: 244 seconds) [03:22] *** BnA-Rob1n has joined #warrior [03:27] *** Yoshimura has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) [03:31] *** Yoshimura has joined #warrior [03:41] 14516=200 http://duxrampant.yuku.com/forum/view/id/94/mode/or/addtags/conversions/?view=forum_blockview&view=forum_blockview&view=forum_blockview&view=forum_blockview&view=forum_blockview&view=forum_tableview&view=forum_blockview&view=forum_blockview&view=forum_blockview&view=forum_tableview&view=forum_blockview. [03:41] Stuck on this. Repeatedly downloading same thing. [03:43] Its nonstop switching between tableview and blockview. [03:44] Recursive loop. [04:09] See you later, leave messages if anything. [04:12] *** bwn has joined #warrior [05:17] *** Honno has joined #warrior [06:55] *** ariscop has quit IRC (Quit: Leaving) [07:20] *** Honno has quit IRC (Read error: Operation timed out) [07:31] *** bwn has quit IRC (Read error: Operation timed out) [07:41] *** ariscop has joined #warrior [08:11] *** bwn has joined #warrior [09:58] *** Yoshimura has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) [10:26] anyone else having trouble with the warrior lagging browsers? [10:29] oh i think i know what its doing [10:36] and it's hard to measure because it's locking up my browser :/ [10:40] itemLog.data = processCarriageReturns(itemLog.data + msg.data); < ever growing cpu usage [10:59] https://github.com/ArchiveTeam/seesaw-kit/pull/99 [13:27] *** GLaDOS has quit IRC (Ping timeout: 260 seconds) [13:30] *** GLaDOS has joined #warrior [13:56] *** Honno has joined #warrior [13:59] *** Start has quit IRC (Quit: Disconnected.) [14:29] *** Yoshimura has joined #warrior [14:54] *** Start has joined #warrior [16:06] *** Start has quit IRC (Quit: Disconnected.) [16:23] * Yoshimura ranma: I think there should be a way to limit upload speed. trickle can do that. Limiting BW for whole virtual machine helps with download. But if upload is much lower, then its unusable. You might file one more issue :P [17:05] virtualbox can do that too [17:29] *** Honno has quit IRC (Read error: Operation timed out) [17:39] yipdw_: Provide link or information. [17:41] http://archiveteam.org/index.php?title=ArchiveTeam_Warrior [17:50] *** Start has joined #warrior [17:52] *** bwn has quit IRC (Ping timeout: 246 seconds) [17:55] That link has only BW limit, not upload. [17:56] Limiting upload to 1M would mean limiting download to 1M also. Or aggregate, not sure how its implemented. Which would be counterproductive. [18:05] *** Honno has joined #warrior [18:06] upload is gzipped and download is whatever you get from the site (usually not gzipped) [18:09] *** Start has quit IRC (Quit: Disconnected.) [18:17] *** Honno_ has joined #warrior [18:21] *** Honno has quit IRC (Read error: Operation timed out) [18:22] *** Start has joined #warrior [18:25] *** bwn has joined #warrior [18:29] *** Honno has joined #warrior [18:38] *** Honno_ has quit IRC (Read error: Operation timed out) [19:44] *** Start has quit IRC (Quit: Disconnected.) [21:47] *** ariscop has quit IRC (Leaving) [22:14] *** Start has joined #warrior [22:19] *** chfoo- has quit IRC (Read error: Operation timed out) [22:19] *** chfoo- has joined #warrior [22:19] *** svchfoo1 sets mode: +o chfoo- [22:54] *** ariscop has joined #warrior