[00:07] *** BlueMaxim has joined #archiveteam-bs [00:33] *** Start has quit IRC (Quit: Disconnected.) [01:43] xmc: here's some more juiciness from the kittyforums: http://www.howardforums.com/showthread.php/1864458-Verizon-MVNO-Puppy-Wirelss-Discussion?p=15935702#post15935702 [01:43] let me see if i can find the beginning of it all [01:44] hum [01:45] http://www.howardforums.com/showthread.php/1857887-Customer-Service-with-the-new-Page-Plus-support-center?p=15826603#post15826603 [01:45] my goodness that's a long fucking thread [01:45] ignore that thread [01:45] but the second is a history on i think the owner of kittywireless [01:45] fair [01:46] ugh, i don't care enough to read up on this shit [01:46] well, from someone that sounds like they have a bit of an axe to gring [01:46] not now, and maybe not ever [01:46] grind [01:46] yea [01:46] yep [01:46] drama drama drama [01:46] you sounded have intrigued about it one time :) [01:46] mhm [01:46] or was it just about it shutting down? [01:50] just 10 more seconds of your time: you should see the eyecancer (and dodginess) the Prepaid PIN reseller site was xD [01:50] https://web.archive.org/web/20140104141402/http://kittywireless.com/ [01:50] @ xmc [01:50] i mean, i'm interested in archiving these things [01:50] but not so much reading them :) [01:50] ah [01:50] yow [01:51] yep [01:52] owner is/was quite a character (was in lieu of company shutting down) [02:11] *** kristian_ has quit IRC (Remote host closed the connection) [02:18] *** Start has joined #archiveteam-bs [03:52] Frogging: re. archivebot: it looks like it makes 3 synchronous calls for every log line, at pipeline/archivebot/control.py:273 [03:53] hincrby, zadd, and publish [04:00] we could probably use twisted, though I am concerned it will create a great profusion of waiting tasks [04:09] actually it needs to have just one thread per job for sending logs, that has a queue in it [04:11] FalconK: threading gets tricky because it introduces the need for a supervisor [04:12] I wanted a way to integrate with the wpull event loop, which is what grab-site does [04:12] this seems to be the simplest possible case of it [04:12] then it was "well just use grab-site, all you need to do is switch the communication protocol to websockets" but then I never looked at it again [04:12] I'm not talking about anything at all except sending logs to redis asynchronously [04:13] so a single daemonic thread should do the trick [04:13] also the existing code just does pass on ConnectionError so it's not especially reliable as it stands anyway [04:14] it'll drop log lines, but there's no separate thread to monitor [04:14] the settings listener thread on the other hand does occasionally die for reasons I haven't been able to determine [04:16] yeah, I have no idea either [04:16] something else I was toying around with was using multiprocess to start a redis log shipper process [04:16] as multiprocessing seems more robust in a Python program [04:17] but I never got around to looking into it further [04:17] if a separate thread works, though, that'd be good [04:17] it's certainly easier! [04:17] I'll write it up right now, make a new pipeline, and run some unimportant job on it [04:18] run some long unimportant job [04:18] I could re-do infinity.disney.com [04:18] or something [04:18] I only noticed the settings listener lockups on like month-long jobs, which is not what archivebot was ever designed to do but that complaint is like Tim Berners-Lee complaining about porn on the internet [04:20] I'd like some sort of highly-accelerated failure test suite for this sort of stuff but I think to get that you need some idea of what the failure causes are [04:20] and I don't [04:21] chfoo has a huhhttp library that might be that, though [04:21] https://github.com/chfoo/huhhttp maybe [04:29] hey, is it a sin to share the same redis connection between threads? [04:30] maybe I need to mutex it or open a second one [04:30] it's fine [04:30] https://github.com/andymccurdy/redis-py#thread-safety [04:33] awesome! [04:33] this is a really simple change then [04:44] changes are at https://github.com/falconkirtaran/ArchiveBot [04:49] yipdw: would you prefer that ArchiveBot not be used for large jobs? [04:49] Frogging: well, I can say it was never built to handle gazillions of URLs [04:49] the README states this [04:50] but design considerations and actual use align approximately never so what I prefer is irrelevant [04:50] a [04:52] I see [04:53] i'm sure people will keep tweaking it to handle !a internet though [04:57] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:57] !ig 5tnqxpy5c5isnrjrgixu3jgcy ^https?://forum\.teksyndicate\.com/.*/INBOX\.EXE$ [05:02] *** tomwsmf_ has quit IRC (Ping timeout: 255 seconds) [05:04] *** Sk1d has joined #archiveteam-bs [05:20] good enough reason to archive? [05:20] "The US Secret Service censored YG's album Fuck Donald Trump. Some censored lyrics made it into this interview" [05:21] or should that be left out of an !explain [05:21] sounds fine to me [05:49] re-ask: what's up with teksyndicate? [05:52] aha. https://www.reddit.com/r/TekSyndicate/comments/502aao/for_everyone_going_what_the_fuck_is_happening_on/ [05:53] http://www.twitlonger.com/show/n_1sp29to [06:14] well. this sounds like a bit of a shitstorm [06:14] (the Tek business) [06:15] *** Frogging sets mode: +o joepie91 [06:48] just a bit :) [07:10] *** metalcamp has joined #archiveteam-bs [07:38] *** schbirid has joined #archiveteam-bs [07:42] *** Genericen has joined #archiveteam-bs [07:45] *** ravetcofx has quit IRC (Ping timeout: 370 seconds) [07:48] *** Genericen has quit IRC (Quit: zzz) [08:08] *** RichardG has quit IRC (Read error: Connection reset by peer) [08:10] *** RichardG has joined #archiveteam-bs [08:25] *** Honno has joined #archiveteam-bs [08:31] *** RichardG has quit IRC (Read error: Connection reset by peer) [08:31] *** RichardG_ has joined #archiveteam-bs [09:06] *** kristian_ has joined #archiveteam-bs [09:53] *** brayden_ has joined #archiveteam-bs [09:53] *** swebb sets mode: +o brayden_ [09:53] *** brayden has quit IRC (Read error: Connection reset by peer) [10:14] *** zerkalo has quit IRC (Ping timeout: 260 seconds) [10:19] *** zerkalo has joined #archiveteam-bs [11:21] *** VADemon has joined #archiveteam-bs [11:28] *** RichardG_ has quit IRC (Ping timeout: 370 seconds) [11:33] *** dashcloud has quit IRC (Read error: Operation timed out) [11:36] *** dashcloud has joined #archiveteam-bs [11:42] I'm grabbing rave tapes because why not [11:43] also, hiphop mixtape maniacs are helping me find missing ones [11:50] *** kristian_ has quit IRC (Quit: Leaving) [12:36] *** BlueMaxim has quit IRC (Quit: Leaving) [12:41] can archivebot archive this? https://app.box.com/shared/8ch5r5nms1 [12:42] youtube user last active in 2008, supposedly dead [12:43] "can...?" as in i have no idea how well it can deal with scripts [12:44] *** brayden_ has quit IRC (Read error: Operation timed out) [12:46] *** dashcloud has quit IRC (Read error: Operation timed out) [12:47] i'll check ranma [12:49] *** dashcloud has joined #archiveteam-bs [12:50] *** brayden has joined #archiveteam-bs [12:50] *** swebb sets mode: +o brayden [12:52] guess not? [12:52] the archive should be about 48MB [12:52] uploaded it to IA [12:53] https://archive.org/details/Piano_201609 [12:53] ok [12:58] *** Genericen has joined #archiveteam-bs [13:04] *** RichardG has joined #archiveteam-bs [13:09] *** dashcloud has quit IRC (Read error: Operation timed out) [13:13] *** dashcloud has joined #archiveteam-bs [14:12] *** Aranje has quit IRC (Quit: Three sheets to the wind) [14:36] *** jspiros has joined #archiveteam-bs [14:40] *** kristian_ has joined #archiveteam-bs [16:05] *** dashcloud has quit IRC (Read error: Operation timed out) [16:09] *** jspiros has quit IRC (leaving) [16:09] *** dashcloud has joined #archiveteam-bs [16:13] *** jspiros has joined #archiveteam-bs [17:18] *** JesseW has joined #archiveteam-bs [18:10] *** VADemon has quit IRC (Quit: left4dead) [18:11] *** tomwsmf_ has joined #archiveteam-bs [18:14] *** VADemon has joined #archiveteam-bs [18:59] *** zenguy_pc has joined #archiveteam-bs [19:02] *** Matt_Lock has joined #archiveteam-bs [19:02] hi [19:02] Hi again. [19:03] so when you go to the 'SHOW ALL' page on for example https://archive.org/details/archiveteam-fanfiction-warc-08 you get https://archive.org/download/archiveteam-fanfiction-warc-08 [19:03] you can then have a look at the *.cdx.gz file, which contains a list of URLs that were saved in this WARC [19:03] https://archive.org/download/archiveteam-fanfiction-warc-08/archiveteam-fanfiction-warc-08.cdx.gz in this case [19:04] Matt_Lock: I was semi-involved in repackaging the text-only one. [19:04] 187 megs... this is going to be a lot of downloading. [19:05] The repackaged form should be *somewhat* easier to find things in, although it's organized by fandom, not by author. [19:05] these URLs are also all in the wayback machine though [19:05] I've looked at the text-only version. It is useful, but doesn't contain all the fics I want [19:05] not wayback machine. Robots.txt is on the site [19:05] Matt_Lock: do you have specific URLs you are looking for? [19:06] I have the url for the guy's profile page: https://www.fanfiction.net/u/1155973/MatrixExplosion [19:07] thanks [19:07] *** edsu has joined #archiveteam-bs [19:07] *** swebb sets mode: +o edsu [19:08] Matt_Lock: do you have titles of the stories you are looking for? [19:08] https://www.fanfiction.net/s/9924634/ [19:08] Also this story [19:09] titles are Unbound (by matrixExplosion), Time can't heal every pain (linked above), and anything else by matrixExplosion (if it exists) [19:09] "Rise Of Naruto: Shinigami's Touch" was on the txt only archive, not the others [19:10] any idea about the dates they were written, and the dates they were deleted? [19:10] I have it written somewhere [19:10] "I'm Gonna Be Hokage!"? [19:10] one sec [19:10] I have I'm gonna be hokage, thanks, though [19:11] *** schbirid has quit IRC (Quit: Leaving) [19:11] Ah, here: this guy started writing Shinigami in 2008, and it was last updated in 2010. That fic for example, disappeared some time between Jul 15, 2013 and Jul 5, 2013, based on the dates of reviews [19:11] you may be able to use the .cdx.idx files (which are much smaller) to figure out which megawarc to look in [19:16] Will do later, thanks. Computer's acting up right now. [19:16] Thanks for the help. [19:17] Hope this works out. [19:17] good luck -- if you do find copies, dump them on IA [19:17] I'll leave asking how to do that for when (read: if) I do. [19:17] nods [19:18] at a minimum, you can dump it in a pastebin, then use web.archive.org/save/ to copy it into the Wayback Machine [19:19] Got it. [19:23] Actually, as it turns out... I have ~800 fics saved on my computer in epub and azw3 formats. 13 have been deleted. I *know* at least 1 of those 13 is on my computer, not on the text only archive. How do I add it to the archive? [19:28] *** JesseW has quit IRC (Ping timeout: 370 seconds) [19:29] *** JesseW has joined #archiveteam-bs [19:35] *** JesseW has quit IRC (Ping timeout: 370 seconds) [19:59] *** alembic has quit IRC (Ping timeout: 244 seconds) [20:01] *** alembic has joined #archiveteam-bs [20:46] Is anybody still there who I was talking to? Because I've downloaded all 18 .cdx.idx files, and the guy's profile page doesn't seem to be listed in any of them, and nor are any of the fics that I know the names and//or IDs of. What's going on? [20:49] it may be that they never were archived [20:50] if the user urĂ² isn't in any of the cdx then it means it wasn't archived [20:50] *** alembic has quit IRC (Read error: Connection reset by peer) [20:51] the idx files are partial [20:51] Seems plausible. Is a 830 gigabyte "Scrape" supposed to be just some files or as many as possible? [20:51] if it's not in the idx files, it doesn't mean it's not archived [20:51] Is there a way to check that doesn't involve downloading 830 gigs?? [20:53] *** alembic has joined #archiveteam-bs [20:53] arkiver: really? [20:53] why are those partial [20:54] they *.cdx.gz files are full [20:54] ah k [20:54] Matt_Lock: try to download those then [20:55] luckcolor: that's what he's trying to avoid [20:56] ah i wasn't thinking that those where 800 gb [20:56] then yeah [20:56] * luckcolor derped [20:57] The cdx.gz files seem to be just(!) 180 ish megs each, so probably only 3 or 4 gigs total, assuming part 0 is reprasentative (I'm looking for files like archiveteam-fanfiction-warc-00.cdx.gz , right)? [20:58] yeah [21:03] Have to go now. Bye, and thanks for the help [21:03] *** Matt_Lock has quit IRC (Quit: Page closed) [21:09] Np :P [21:27] *** metalcamp has quit IRC (Ping timeout: 506 seconds) [21:30] *** JesseW has joined #archiveteam-bs [21:37] *** Aranje has joined #archiveteam-bs [21:41] *** dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.) [21:42] *** dashcloud has joined #archiveteam-bs [21:46] *** tfgbd_znc has joined #archiveteam-bs [22:14] *** ravetcofx has joined #archiveteam-bs [22:26] *** VADemon has quit IRC (Quit: left4dead) [22:29] *** zenguy_pc has quit IRC (Excess Flood) [22:30] *** JesseW has quit IRC (Read error: Operation timed out) [22:31] *** Honno has quit IRC (Read error: Operation timed out) [22:32] *** zenguy_pc has joined #archiveteam-bs [22:37] *** Genericen has quit IRC (Remote host closed the connection) [22:56] *** zenguy_pc has quit IRC (Excess Flood) [22:58] *** zenguy_pc has joined #archiveteam-bs [23:04] *** zenguy_pc has quit IRC (Ping timeout: 255 seconds) [23:06] *** zenguy_pc has joined #archiveteam-bs [23:22] *** zenguy_pc has quit IRC (Ping timeout: 255 seconds) [23:23] *** dashcloud has quit IRC (Read error: Operation timed out) [23:26] *** JesseW has joined #archiveteam-bs [23:26] *** dashcloud has joined #archiveteam-bs [23:59] *** robink has quit IRC (Ping timeout: 246 seconds)