#warrior 2016-04-11,Mon

↑back Search

Time Nickname Message
00:00 🔗 Yoshimura ranma: Just try loading normal pages.
00:00 🔗 Frogging also try loading domains that don't exist
00:00 🔗 Yoshimura Most do not do that. Nonexistent redirects are more often though
00:00 🔗 bwn https://www.cs.washington.edu/research/security/web-tripwire.html
00:01 🔗 ranma does Archive Team Warrior download 2-4 copies of a page to compare?
00:01 🔗 Frogging no :p
00:01 🔗 Yoshimura arkiver: Thanks, Yeah. Or I can modify the warrior, would that be fine? (Wonder where, haha) There should be limit for active tasks based on tasks. One with good connection should be able to run all tasks at once.
00:02 🔗 ranma when a roomie gets retarded and torrents dangerously, the ISP injects a warning into webpages
00:02 🔗 ranma copyright warning (comcast)
00:02 🔗 Yoshimura Then its not clean internet.
00:02 🔗 ranma outside of that it's clean
00:03 🔗 Yoshimura ranma: He should setup the torrent better though. Max connections: 40-80, max per torrent: 10-15
00:03 🔗 ranma or just not use public trackers *facepalm*
00:03 🔗 Yoshimura These are my personal recommendations. Most clients have far too high max conn counts.
00:04 🔗 Yoshimura IDK how they detect it. I use only public trackers as I seed OSS.
00:05 🔗 ranma is going to the http version of that tripwire page accurate?
00:05 🔗 ranma i know the isp doesn't inject junk in https connections
00:06 🔗 Yoshimura Not sure what is tripwire except intrusion detection on Linux
00:06 🔗 Frogging my ISP has taken to injecting data cap notices into my pages, actually. it doesn't happen anymore because we now have unlimited usage, but I'd really much prefer that they leave my data alone
00:06 🔗 ranma the link bwn just pasted
00:07 🔗 ranma also, assuming you guys haven't changed the default vm settings, how much space are your Archive Team Warrior VM sizes?
00:07 🔗 Frogging that said I'm not going to start using HTTPS-Everywhere or anything :p
00:07 🔗 Frogging ranma: 60GB IIRC
00:07 🔗 ranma it's using all 60gb?
00:07 🔗 Yoshimura ranma: Its publisher thing not user
00:07 🔗 Frogging ranma: It'll only use up to 60GB. It's a sparse file: https://wiki.archlinux.org/index.php/sparse_file
00:08 🔗 Frogging (same concept anyway, maybe different implementation)
00:08 🔗 ranma yeah, i'm doing the URL project due to having little free space
00:08 🔗 ranma *doing just the
00:09 🔗 Yoshimura In my opinion the crawl needs to be managed more efficiently, that would give more advantage then one unclean net user.
00:09 🔗 Yoshimura Q: Is wget of Warrior different then the usual one?
00:10 🔗 ranma i'm guessing most servers aren't http/2?
00:10 🔗 ranma most servers the project is crawling
00:10 🔗 Yoshimura Nope, only google facebook etc. Also they have to be backwards compatible.
00:10 🔗 Yoshimura Most people are in stone age when it comes to technology.
00:11 🔗 Yoshimura Like Madison leak.
00:11 🔗 bwn ranma: it's older site, but it just checks to see if the html has been changed from what it expected. it wouldn't detect those periodic things like the data cap warnings frogging mentioned, unless it happened to inject when you visited
00:12 🔗 Yoshimura But what you could do is make a proxy, transparent, that would block and warn if specific crap is injected
00:12 🔗 bwn and i'm sure there's ways around it, depending on how much anyone would want to invest in doing so
00:13 🔗 yipdw_ ranma: if your connection stability isn't under your control it would be greatly appreciated if you did not run a warrior on that connection
00:13 🔗 yipdw_ there are some checks, and yes they are defeatable checks
00:13 🔗 yipdw_ please use a VPS or something else instead
00:14 🔗 ranma is it not viable to request for the eventual future an option to only crawl sites that are https-able?
00:15 🔗 ranma *for a client to only crawl https-able sites
00:15 🔗 yipdw_ file an issue
00:15 🔗 ranma kk
00:16 🔗 ranma here? https://github.com/ArchiveTeam/warrior-code
00:16 🔗 yipdw_ no https://github.com/ArchiveTeam/seesaw-kit/issues
00:17 🔗 Yoshimura Q: Is wget of Warrior different then the usual one? Running headless cannot log to console to check or try to check xD
00:18 🔗 ranma ssh to Warrior!
00:18 🔗 yipdw_ it's wget modified with Lua hooks so yes
00:19 🔗 Yoshimura Okay, I have the problem with normal Wget, that the ... dedup is not working at all.
00:19 🔗 Yoshimura Tried even modifying the link to literally same values as in the index file.
00:21 🔗 yipdw_ I don't know what you're referring to
00:21 🔗 yipdw_ an example would be nice
00:23 🔗 Yoshimura Have to study the hooks and the platform more. I started working on my own stuff. Got SSL evented fast stuff working. Need to handle the request response stuff and make it into warc. Or will modify existing stuff. But nice thing to learn on.
00:24 🔗 yipdw_ wget has WARC support, why not reuse it
00:24 🔗 Yoshimura I was refering using wget on usual Linux distro. Got segfault, out of space, now running again to different file, but dedup does not work at all.
00:24 🔗 yipdw_ yeah the WARC patches were upstreamed
00:24 🔗 yipdw_ it's going to be in wget
00:25 🔗 Yoshimura So now got few hundred thousand pages but would need to filter all links and run again.
00:25 🔗 Yoshimura Its in wget, but either there is unpatched bug or got little older version.
00:25 🔗 yipdw_ l
00:25 🔗 yipdw_ k
00:26 🔗 Yoshimura GNU Wget 1.17.1 built on linux-gnu.
00:39 🔗 ranma has anyone proposed backing up minus.com?
00:39 🔗 ranma (imgur clone)
00:39 🔗 ranma or is it too far gone to capture?
00:44 🔗 * Yoshimura notes about the reusing... I can make it to my picture from start to finish.
00:45 🔗 Yoshimura ranma: IDK one could also make comact archives (recompressing)
00:45 🔗 Yoshimura Archivation is more about knowledge then personal, at least to me. Personal stuff will go and people will die.
00:46 🔗 ranma ah
00:50 🔗 Yoshimura ACTION also note that he had 50 threads of wget running. Can imagine using events and single thread could be more efficient. Wget does by some magic use only 7mb ram, mine does 10mb (diff ssl) but one more parallel download does +2mb. Having that packed in small efficient virtual, one can run it headless virtually anywhere in VM so can use much wider audience of people. Also being disk space efficient, people with small laptop drives or small boar
02:00 🔗 Yoshimura has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client)
02:01 🔗 Yoshimura has joined #warrior
03:19 🔗 bwn has quit IRC (Ping timeout: 492 seconds)
03:20 🔗 BnA-Rob1n has quit IRC (Ping timeout: 244 seconds)
03:22 🔗 BnA-Rob1n has joined #warrior
03:27 🔗 Yoshimura has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client)
03:31 🔗 Yoshimura has joined #warrior
03:41 🔗 Yoshimura 14516=200 http://duxrampant.yuku.com/forum/view/id/94/mode/or/addtags/conversions/?view=forum_blockview&view=forum_blockview&view=forum_blockview&view=forum_blockview&view=forum_blockview&view=forum_tableview&view=forum_blockview&view=forum_blockview&view=forum_blockview&view=forum_tableview&view=forum_blockview.
03:41 🔗 Yoshimura Stuck on this. Repeatedly downloading same thing.
03:43 🔗 Yoshimura Its nonstop switching between tableview and blockview.
03:44 🔗 Yoshimura Recursive loop.
04:09 🔗 Yoshimura See you later, leave messages if anything.
04:12 🔗 bwn has joined #warrior
05:17 🔗 Honno has joined #warrior
06:55 🔗 ariscop has quit IRC (Quit: Leaving)
07:20 🔗 Honno has quit IRC (Read error: Operation timed out)
07:31 🔗 bwn has quit IRC (Read error: Operation timed out)
07:41 🔗 ariscop has joined #warrior
08:11 🔗 bwn has joined #warrior
09:58 🔗 Yoshimura has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client)
10:26 🔗 ariscop anyone else having trouble with the warrior lagging browsers?
10:29 🔗 ariscop oh i think i know what its doing
10:36 🔗 ariscop and it's hard to measure because it's locking up my browser :/
10:40 🔗 ariscop itemLog.data = processCarriageReturns(itemLog.data + msg.data); < ever growing cpu usage
10:59 🔗 ariscop https://github.com/ArchiveTeam/seesaw-kit/pull/99
13:27 🔗 GLaDOS has quit IRC (Ping timeout: 260 seconds)
13:30 🔗 GLaDOS has joined #warrior
13:56 🔗 Honno has joined #warrior
13:59 🔗 Start has quit IRC (Quit: Disconnected.)
14:29 🔗 Yoshimura has joined #warrior
14:54 🔗 Start has joined #warrior
16:06 🔗 Start has quit IRC (Quit: Disconnected.)
16:23 🔗 * Yoshimura ranma: I think there should be a way to limit upload speed. trickle can do that. Limiting BW for whole virtual machine helps with download. But if upload is much lower, then its unusable. You might file one more issue :P
17:05 🔗 yipdw_ virtualbox can do that too
17:29 🔗 Honno has quit IRC (Read error: Operation timed out)
17:39 🔗 Yoshimura yipdw_: Provide link or information.
17:41 🔗 bwn http://archiveteam.org/index.php?title=ArchiveTeam_Warrior
17:50 🔗 Start has joined #warrior
17:52 🔗 bwn has quit IRC (Ping timeout: 246 seconds)
17:55 🔗 Yoshimura That link has only BW limit, not upload.
17:56 🔗 Yoshimura Limiting upload to 1M would mean limiting download to 1M also. Or aggregate, not sure how its implemented. Which would be counterproductive.
18:05 🔗 Honno has joined #warrior
18:06 🔗 xmc upload is gzipped and download is whatever you get from the site (usually not gzipped)
18:09 🔗 Start has quit IRC (Quit: Disconnected.)
18:17 🔗 Honno_ has joined #warrior
18:21 🔗 Honno has quit IRC (Read error: Operation timed out)
18:22 🔗 Start has joined #warrior
18:25 🔗 bwn has joined #warrior
18:29 🔗 Honno has joined #warrior
18:38 🔗 Honno_ has quit IRC (Read error: Operation timed out)
19:44 🔗 Start has quit IRC (Quit: Disconnected.)
21:47 🔗 ariscop has quit IRC (Leaving)
22:14 🔗 Start has joined #warrior
22:19 🔗 chfoo- has quit IRC (Read error: Operation timed out)
22:19 🔗 chfoo- has joined #warrior
22:19 🔗 svchfoo1 sets mode: +o chfoo-
22:54 🔗 ariscop has joined #warrior

irclogger-viewer