[00:00] *** j08nY has quit IRC (Quit: Leaving) [00:22] MrRadar2: what URL? [00:25] *** sheaf has quit IRC (Quit: sheaf) [01:23] *** bitBaron has joined #archiveteam-bs [01:35] it does seem like something's up with imzy [01:36] i'm using the warrior script, i haven't seen any of my threads actually upload any data [01:40] my urlte.am and eroshare threads are archiving just fine though... [03:11] *** th1x has joined #archiveteam-bs [03:17] *** dashcloud has quit IRC (Remote host closed the connection) [03:18] *** dashcloud has joined #archiveteam-bs [04:43] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:48] *** Sk1d has joined #archiveteam-bs [04:57] *** crusher has quit IRC (Ping timeout: 268 seconds) [05:10] *** Aranje has quit IRC (Quit: Three sheets to the wind) [05:19] *** Aranje has joined #archiveteam-bs [05:25] *** th1x has quit IRC (Read error: Operation timed out) [05:46] *** Aranje has quit IRC (Three sheets to the wind) [06:09] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…) [06:22] *** schbirid has joined #archiveteam-bs [06:40] *** kyounko has joined #archiveteam-bs [06:41] *** voidsta has quit IRC (Remote host closed the connection) [06:43] *** voidsta- has joined #archiveteam-bs [06:44] *** voidsta- has quit IRC (Client Quit) [06:44] *** voidsta- has joined #archiveteam-bs [06:49] *** voidsta- is now known as voidsta [06:52] *** Jonison has joined #archiveteam-bs [07:17] *** j08nY has joined #archiveteam-bs [07:44] *** SHODAN_UI has joined #archiveteam-bs [07:59] *** jtn2 has joined #archiveteam-bs [08:14] *** jrwr has quit IRC (Read error: Operation timed out) [08:15] *** robogoat has quit IRC (Read error: Operation timed out) [08:15] *** robogoat has joined #archiveteam-bs [08:39] *** j08nY has quit IRC (Quit: Leaving) [09:10] *** SHODAN_UI has quit IRC (Remote host closed the connection) [09:19] *** BlueMaxim has quit IRC (Read error: Operation timed out) [10:28] *** logchfoo2 starts logging #archiveteam-bs at Wed Jun 21 10:28:19 2017 [10:28] *** logchfoo2 has joined #archiveteam-bs [11:10] *** SHODAN_UI has joined #archiveteam-bs [11:48] *** victorbje has joined #archiveteam-bs [12:27] *** C4K3_ has joined #archiveteam-bs [12:30] *** C4K3 has quit IRC (Ping timeout: 260 seconds) [12:36] *** icedice has joined #archiveteam-bs [12:37] *** th1x has joined #archiveteam-bs [12:56] arkiver: For example https://www.imzy.com/api/accounts/profiles/daylen?check=true [12:57] I can send you one of the partial WARCs if that would help [12:57] They all have that ?check=true parameter [13:10] *** vbdc has joined #archiveteam-bs [13:11] getting rate-limited when doing the upload, 120 connections seems like a small amount. Anything I can do to help workaround this bottleneck? [13:34] MrRadar: If I'm logged in and view my profile, the ?check=true API call comes back with empty 200 OK; for someone else's profile it is an empty 401. I suspect this is an authenticated API call and isn't suitable for archiving. [13:35] The 206 when unauthenticated is weird, though. I can check with weffey... [13:36] *** vbdc has quit IRC (Ping timeout: 268 seconds) [13:42] arkiver, MrRadar: weffey says any ?check call should be skipped -- it's an auth'd call to see if an object exists (and whether the user has permissions to it) without getting the full payload. [13:47] *** crusher_ has joined #archiveteam-bs [13:48] vbdc: That's just the limit for FOS [13:48] We've tried raising it in the past but it actually slows down due to the server's disk IO getting saturated [13:49] We could possibly add another rsync target [13:49] If someone wants to volunteer one [14:04] MrRadar: what's the requirements for adding a rsync target? Might be able to host one [14:21] *** icedice has quit IRC (Read error: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac) [14:22] *** icedice has joined #archiveteam-bs [14:28] victorbje: IIRC at least 500 Mbit Internet and several TB storage. arkiver can give you more details [14:35] i wonder if how hard it would be to redirect path the warrior uses to cache the scraped files based on file size [14:35] wonder how hard* [14:36] My biggest bottleneck right now is my lack of RAID disks and / or that it's a pretty slow drive [14:37] so i was thinking of using a ramdisks to cache all the small files and selectively throw large ones to disk [14:39] So according to the Tilt API, the US and Australia are states, Canada and the UK are provinces, and Ireland's a county (no, not "country"). ¯\_(ツ)_/¯ [14:41] lol [14:41] cruster_: if you're referring to FOS, that actually wouldn't help too much. FOS runs the "megawarc factory" which combines individual items together into "megawarcs" so when there are projects with tons of small files it uses pretty much all the I/O resources it can [14:41] I see. [14:42] on average how big can those file-balls get? [14:42] 40 GB is the usual size [14:42] Some projects with extra-large items use 80 GB [14:42] *spits cereal* that's what a warrior spits out per thread? [14:42] that doesn't sound right... [14:42] no [14:43] No, that's the size of a megawarc [14:43] Individual items can range from the KB to a dozen or so GB depending on the project [14:43] *** bitBaron has joined #archiveteam-bs [14:44] that makes more sense [14:44] You can get a sense of it from here: http://fos.textfiles.com/pipeline.html [14:44] The "inbox" is the items waiting to be megawarc'ed [14:44] The outbox are megawarcs waiting to be uploaded to IA [14:45] Another useful page here lists the items as they are uploaded: http://fos.textfiles.com/ARCHIVETEAM/ [14:47] interesting [14:48] so on the client side, a ramdisk could be useful for small files provided the connection isn't saturated, correct? [14:48] Nice. I've never seen that before. [14:49] From what I understand, the data is saved to a temporary file, gzipped, and then concatenated on to the end of the result WARC file [14:50] So I'm not sure how much a ramdisk would help, especially since Linux in general has a very good disk cache [14:50] (As long as it has enough free memory) [14:51] so in essence, i should allocate more ram to the warriors and let them do their thing [14:51] whatever it takes to give the poor HDD some breathing room [14:51] Yes, it's worth a try. I think it does flush the file out to disk when it's done downloading it but when it gets read back it should be reading from the cache [14:52] right now it's getting hammered to 100% with non-sequential I/O [14:52] probably because it's running 10 warriors but.... [14:52] (shhh details) [14:55] *** odemg has quit IRC (Read error: Operation timed out) [15:07] looking at the ram usage, i can see of the 400MB allocated, it's only using between 60-82 Megs [15:07] (400 each) [15:08] Check out what top says inside the VM [15:12] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…) [15:15] *** odemg has joined #archiveteam-bs [15:18] crusher_: You can also try running the scripts directly to reduce the overhead. [15:18] Yeah, having 1 kernel schedule the I/O for everything would probably do a better job than 10 kernels that aren't aware of what the others are doing [15:20] yeah... [15:21] scrolling, catching up [15:22] *** odemg has quit IRC (Read error: Operation timed out) [15:22] crusher_: ram disk won't really provide any benefit [15:22] I've seen projects that we've maxed out multiple gigabit links constantly, you wouldn't be able to get it into ram, megawarc it, then offload quick enough [15:22] *** odemg has joined #archiveteam-bs [15:23] i'm talking client (warrior) side [15:23] ah, warrior side I've run stuff in ram before [15:24] as long as you run under capacity - knwing that some items will obviously be a long way outside the average, and cause issues [15:24] For example, items for Eroshare have *huge* variation in size. From a few MB to 15+ GB [15:25] that's what i'm currently using most of the warriors on [15:25] But even for projects like Yahoo Answers which is mostly in the neighborhood of 100 MB per item I've hard a few GB-sized items [15:25] it seems to be the only current project that isn't 100% saturated or done [15:27] i'd help with newsgrabber, but the warrior seems to freeze and do nothing on that one... [15:28] hmm [15:28] what does the webui show when it's frozen? [15:28] current project screen is blank [15:28] oh hm [15:28] available shows it is working on newsgrabber [15:29] newsgrabber has a ton of requirements, possible that the warrior install script doesn't get it all [15:29] hmm. [15:34] so if i was to run them in the host OS, how difficult of a process would that be [15:35] not too much work, all the setup instructions are in the git repo [15:38] *** LastNinja has quit IRC (Read error: Operation timed out) [15:44] dumb question, but there are 12 pages of projects... Which one should i be looking for? [15:44] For people using OpenVPN: https://guidovranken.wordpress.com/2017/06/21/the-openvpn-post-audit-bug-bonanza/ [15:45] crusher_: I'd start from the wiki homepage, where the currently active projects are listed. [15:45] ah, so there's no way to run them warrior like in the host [15:45] it's all manual? [15:46] Not sure what you mean by "all", but yes, more things are manual than in the warrior. The code doesn't update automatically, for example, and there is no "ArchiveTeam's Choice" equivalent for scripts. [15:46] right [15:47] For most projects, you clone the corresponding git repository and execute something like run-pipeline pipeline.py --concurrent N NICK [15:47] Once you have the dependencies installed, that is. [15:47] You may also want to use --disable-web-server, depending on your setup. [15:48] if you had a spare i5 with 8Gigs of ram and a 300 / 300 internet connection, what would you run? [15:48] If you run multiple pipelines and you *do* want to run the web UI make sure you assign each pipeline a different port [15:49] I'd run URLTeam, Yahoo Answers, Eroshare, and maybe Imzy (though I suspect that needs a script update) [15:50] how many concurrent runs for each? [15:51] this machine is 100% available [15:51] I'm running 10 on URLTeam and 3 on Yahoo Answers. [15:51] github.com/archiveteam/newsgrabber-warrior [15:51] Yahoo bans pretty quickly if you go too fast. [15:52] something i noticed with the urlteam on warrior was that it was constantly running out of tasks to do [15:52] Imzy was running fine with 6 concurrent threads before, but I haven't tried again since the latest updates. [15:53] newsgrabber will never run out of jobs, that's part of the fun [15:59] is the "time left" the time until the service shuts down or until all items are done at current speed? [15:59] in the web ui, top right [16:00] *** bitBaron has joined #archiveteam-bs [16:01] Until the service shuts down [16:01] Though they sometimes stay up for hours or even days past their official shutdown time [16:02] or shut down sooner than they said they would [16:02] Or occaisonally they whitelist us to let us access the service after the official shutdown [16:07] all right, thanks [16:27] Speaking of which, I heard from weffey that Imzy will probably go dark at *around* 06:00 UTC on 2017-06-23, depending on other scheduling constraints. [16:28] well, either the script broke or there's nothing left to grab [16:28] crusher_: The Imzy script needs an update to ignore the ?check=true URLs and then a requeue [16:29] I'm sure arkiver will get around to it when he has time [17:24] http://sarahcandersen.com/post/162085779429 ;-) [17:30] An interesting sci-fi story on that subject from the author of the story Arrival was based on: http://subterraneanpress.com/magazine/fall_2013/the_truth_of_fact_the_truth_of_feeling_by_ted_chiang [17:31] It's food for thought [17:53] *** jrwr has joined #archiveteam-bs [17:58] Is there a tool to download wordpress sites? (other than wget) [18:05] what would such a tool do that wget doesn't do? [18:09] wpull :P [18:09] same sort of thing as tumblr tools I've seen. a big one is dealing with pagination changing between crawls and tags. any kind of parsing (here are the comments, here is the post, here are the tags) would be extra. i'd also be happy being pointed at a good wget config, though. [18:10] so, if I visit a site again after 1 blog post, I'd rather not download the ENTIRE blog worth of index, which is what will happen right now [18:13] hmm, good point [18:15] This is actually a problem with lots of stuff, not just wordpress :( [18:44] *** SHODAN_UI has quit IRC (Remote host closed the connection) [18:48] Was watching the slides on how Jason got sued for two billion dollars, found this on the IA https://archive.org/details/ModeleskiCompOrder [18:48] Pretty much he has been marked insane by the courts [18:49] Founder Paul Andrew Mitchell, an advanced systems development consultant for 35 years, has spent the past sixteen years since 1990 A.D. doing a detailed investigation of the United States Constitution, federal statute laws, and the important court cases. [18:49] AD [18:49] LOL [18:50] Yep [18:50] I'm reading the whole thing now [18:50] holy shit [18:51] the top of page 3 is fucking gold [18:51] GOLD [18:52] Another "entertaining" "sued by an insane idiot" story is the time "game studio" Digital Homocide sued Jim Sterling for $10M+ for trashing one of their garbage shovelware games: https://www.youtube.com/watch?v=qS-LXvhy1Do [18:52] SketchCo1: This guy is a hoot [18:54] "Defendant Mitchell shall undertake formal competency restoration procedures at a qualified federal medical center" <-- what does *that* mean? [18:54] *** kisspunch has quit IRC (Quit: ZNC - http://znc.in) [18:54] In hindsight. I'm sure it was frustrating for him to deal with this baloney :/ [18:54] When the case was ongoing [18:54] Right [18:54] *** kisspunch has joined #archiveteam-bs [18:55] I feel bad for both parties but for different reasons. [18:55] he was deemed a "Mass Mailer" by the courts as well [18:55] https://www.plainsite.org/dockets/1z7yzelvr/washington-western-district-court/usa-v-modeleski/ [18:55] my god [18:55] there is so much content [18:55] *** powerKitt has joined #archiveteam-bs [18:55] wait that whole site links back to the IA [18:55] thats interesting [18:56] IA spat out an rsync error trying to transfer the files for something I uploaded via torrent, and I stupidly deleted the files off my hard drive since I thought it was done. Is there anyway they can be recovered from a backup on IA's end? [18:56] https://catalogd.archive.org/log_show.php?task_id=682448038&full=1 [19:03] ok wait re: the topic we found a way to scrape arbitrary dominos orders by enumerating urls... [19:03] wat [19:04] lol [19:04] LOL [19:04] yeah we were trying to automate ordering pizza like sensible programmers and typo-d something, and got someone else's order? [19:04] make a warrior [19:04] it will be good data for the future [19:05] find the pizza order and time that gets it to you the fastest [19:11] Hmm so re: valhalla, I don't feel like the approach will work, because it ultimately needs you to run some weird VM and it's a pain. Would anyone object if I wrote a (compatible) windows program? [19:11] for news grabber? [19:11] jwr: for IA.BAK [19:11] jrwr sorry [19:12] It's definitely trading off total space available and reliability of that space [19:13] But I feel like increasing redundancy can compensate for worse reliability? It's not clear, transfers aren't free if anyone has numbers to plug in [19:14] Also there are decent arguments against writing a 'compatible' program [19:16] I'm thinking here of the success of things like @Home, which is something like "double click to install, press OK" and then it runs forever across reboots by default [19:17] not a bad idea kisspunch [19:17] spreading the love is key [19:17] I though it really just used git + some magic [19:18] I thought it was using git-annex [19:18] Ya [19:18] Anyway yes step 2 is writing the program [19:18] Yep [19:18] I wanted to sound out peeps for whether they will object even once my program works though :) [19:18] No, We always love anything new [19:18] just don't expect much support besides the basics [19:19] That's totally fine [19:19] I approve, but I do suggest making it cross plat as well [19:19] I generally like that sort of thing, but any particular reason? [19:20] *** icedice has quit IRC (Ping timeout: 260 seconds) [19:38] i am a geo guy, if you want a map [19:40] of those pizzas [19:59] *** ruunyan has quit IRC (Read error: Operation timed out) [20:00] *** ruunyan has joined #archiveteam-bs [20:15] how many machines does mundus2018 have... [20:25] *** powerKitt has quit IRC (Quit: Page closed) [20:26] *** schbirid has quit IRC (Quit: Leaving) [20:28] *** kisspunch has quit IRC (Quit: ZNC - http://znc.in) [20:29] *** kisspunch has joined #archiveteam-bs [20:29] *** Jonison2 has joined #archiveteam-bs [20:31] *** Jonison has quit IRC (Ping timeout: 260 seconds) [20:52] *** Jonison2 has quit IRC (Quit: Leaving) [20:53] *** SHODAN_UI has joined #archiveteam-bs [21:12] *** _Crusher_ has joined #archiveteam-bs [21:12] *** crusher_ has quit IRC (Quit: Page closed) [21:12] *** _Crusher_ is now known as crusher [21:18] *** Jonison has joined #archiveteam-bs [21:20] Could someone with a Japanaese IP address help me grab a file? It's geoip filtered for some reason. [21:20] File is here: http://dambo.mydns.jp/uploader/giga/file/GigaPp8347.wav.html [21:20] "Password" is YM1980BD [21:21] *** crusher2 has joined #archiveteam-bs [21:21] Can you repost that link again [21:21] http://dambo.mydns.jp/uploader/giga/file/GigaPp8347.wav.html [21:22] There's a copy on YouTube but I'd prefer to get the original uncompressed version if possible [21:24] I'll have it in.... About half an hour [21:24] Thanks! [21:25] "Password" is YM1980BD in case you missed that too [21:25] I saw that, just was hoping to avoid typing the url into the browser :P [21:52] *** crusher has quit IRC (Ping timeout: 492 seconds) [21:59] *** Crusher has joined #archiveteam-bs [22:08] *** SHODAN_UI has quit IRC (Remote host closed the connection) [22:09] *** Crusher_ has joined #archiveteam-bs [22:09] *** Crusher has quit IRC (Read error: Connection reset by peer) [22:19] *** Jonison has quit IRC (Ping timeout: 260 seconds) [22:41] Imgur has started redirecting direct links on the desktop [22:47] E.g. from i.imgur.com/blah to imgur.com/bla ? [22:55] *** sheaf has joined #archiveteam-bs [22:57] sort off... I think it's more of a server side rewrite http://www.fastquake.com/images/screen-imgurredir-20170621-183414.png [22:58] *** sheaf has quit IRC (Remote host closed the connection) [22:59] Redirecting from direct image view to image embedded in page? [22:59] yes [22:59] Yeah, they've been trying to get away from being hotlink-friendly. [23:00] it's concerning to me [23:00] Running an image host is basically a sucker's game. I'm surprised they've lasted this long, honestly. [23:00] same [23:00] they might be on the way down [23:03] Any idea why the urlte.am warrior likes to report "no items available" [23:04] It seems like that would be something you'd expect to have loads available [23:06] just wait, you'll get items [23:06] the tracker doesn't generate items as fast as people take them [23:06] i do, it's just that my machine is outpacing it [23:07] oh i see [23:07] so in other words, pick a different project, this one's covered [23:07] right? [23:10] I'm not sure [23:10] Is the vine one still shut down? [23:11] looks like it http://tracker.archiveteam.org/vine/ [23:11] I requeued imzy [23:11] and also queued posts [23:11] I've been told though that doing URLTeam is useful. I don't know the details of how the tracker works or where the bottleneck really is [23:12] I just know that with high concurrency it often can't get items [23:12] but it still runs most of the time [23:13] arkiver: i'm still getting the same Server returned 0 error [23:14] arkiver: Did you see the comments from earlier about ignoring ?check=true URLs? [23:15] how would i go about doing that? [23:15] and no not really [23:15] MrRadar: we're already skipping those if a 206 is received [23:15] I tested it and it really should work [23:15] OK. I'll make sure my scripts are fully updated [23:15] yeah, server is being a little hammered right now [23:15] I'm making an update though the skip some URLs [23:19] all my threads are sleeping from the error [23:19] aside from a couple that say they are being limited [23:20] or there's maybe something wrong with you connection [23:20] mine do connect [23:20] hmm. [23:20] i can do eroshare and urlte.am just fine [23:21] I have updated imzy [23:22] i see it [23:23] am i the only one with a bunch of batch scripts to do basic control over the warriors? xD [23:26] odd... [23:26] arkiver: Nope. Still the same post reboot [23:28] Damn, I just came up with the perfect Imzy channel name: #thelasyimzy (reference to the 2007 film The Last Mimzy) [23:28] *#thelastimzy [23:28] haha that's good [23:28] Of course the project is nearly done at this point [23:29] well, i can still *connect* to their site, so i'm not blocked or anything [23:30] the only errors im getting are a pair of 422 on their splash page for some gifs [23:31] i'm loading up wireshark to see if that tells me anything [23:33] it must be something on my end, there are other warriors getting through... [23:33] Man, the huge items from Eroshare are really blocking up FOS's rsync connections. [23:34] i've still got two that are going to take another hour [23:40] *** lucysun has joined #archiveteam-bs [23:41] can someone help me find archives of aol forums or chat rooms from 1995 and before - does this even exist? [23:45] lucysun: archive team downloaded a bunch of the file collections from aol groups, some of them have logs I think https://archive.org/search.php?query=subject%3A%22aol+files%22&page=2 [23:45] er https://archive.org/search.php?query=subject%3A%22aol+files%22 [23:52] arkiver: i still have to narrow it down to see if these packets are for the imzy warriors or not, [23:53] but i'm getting loads of FCS errors from an ip that points right at archive.org [23:54] specifically a map telling me how many books were scanned in the last 12 hours. [23:56] would you like a short packet capture? [23:57] *** antomatic has quit IRC (Read error: Operation timed out)