[00:02] mmm. anxiety has a tendency to make the "kinda uncomfortably long time" thing end up being measured in seconds or low single-digit minutes, but I do get what you're talking about, and I'll see if I can stretch that to longer. thanks again [00:04] SketchCow: I also wanted to clarify one last thing. ex.ua has been officially closed, so the web interface no longer works. with rover.info, that's not the case. I personally believe rover.info will close too. the thing is, they're one and the same website - once you login to rover.info you see the EX logo and the site is identical. IMHO, rover.info is what we should have been mirroring all along. it [00:04] does require cookie management, but it provides complete access to the site. [00:04] SketchCow: right now we're just getting references to uploaded files, not the conversations. it's arguable this is a "99% vs 100%" thing, but it's also arguable that archiving the file references and archiving the conversations produces for a very different dataset. [00:04] yes. patience is a skill that must be learned. [00:05] i336_: did you read me earlier when i said "trust us, the what.cd data is safe (if inaccessible)"? [00:05] xmc: yes, and I was really happy to learn about that :P [00:05] ok [00:06] so you're saying that rover.info might also happen next? [00:06] i am not saying that [00:06] ok. [00:06] *** Stiletto has joined #archiveteam-bs [00:06] i have said nothing about ex.ua or rover.info [00:06] right [00:08] SketchCow: just reread what I said, I don't think I made this part clear - the API methods I've found only provide access to file listings. if we hit rover.info, we get a) files b) folders c) collections d) threads e) individual messages f) user avatars ...etc etc etc. I must be frank and say that I'm really frustrated that we aren't indexing that. [00:09] with the API methods we only get files, folders, and the first 100 items in collections. some users have collections with thousands of folders in them. (collections contain folders contain files) [00:09] hey maybe you should make a channel for this project and discuss it there with interested parties [00:10] xmc: SketchCow isn't in #exexbaby [00:10] then he's not interested in the project [00:11] or he hasn't heard of it yet [00:11] my money's on the first [00:11] I can understand that. I'm poking him because I don't know how else to convince arkiver that rover.info is a good idea [00:12] that doesn't make any sense [00:12] what happened was that, I first found the XSPFs, then I found the r_view URLs, and then the next day I found out about rover.info. each discovered completely superceded the one before it [00:12] -> #exexbaby [00:12] ok [00:23] You're concerning arkiver [00:23] that takes talent. [00:31] *** krazedkat has quit IRC (Quit: Leaving) [00:33] * i336_ is sad now [00:42] Pick yourself up and work on a coherent, limited, directed effort to save some portion of ex.ua. [00:42] * i336_ sighs in frustration [00:43] the thing is, I've been trying to say that ex.ua is not where the data is. rover.info is another domain run by the same company that provides full access to the site content. all the conversations etc, [00:43] s/,/./ [00:43] I was wrong in thinking ex.ua was where it's at. rover.info IS. it requires cookie management to access, but provides superceeded data over ex.ua. [00:43] unfortunately I did not make this discovery initially. [00:44] I don't really know what else to say. :( [00:44] I'm sorry if there's some thing here that I'm not getting or something I'm not picking up on. [00:45] another thing - saving rover.info is actually simpler than saving ex.ua [00:46] just save all URLs that return 200, ignore everything that sends a 302 [00:46] if there's pagination in the page, send &per=200&p=... until you're on the last page, and then move on [00:46] done [00:47] i336: I think you've made the point about rover.info several times now. given the timezone differences, just chill and wait for people to respond [00:48] if they're interested, they'll join #exexbaby. if not, you're just spamming the channel and people will tune you out [00:48] alright then. sorry for the spam. I'll wait it out then. [00:48] and I'll try and get in touch with arkiver [00:51] It's OK, only one user had to die [00:53] lol [00:55] *** robink has quit IRC (Read error: Connection reset by peer) [00:59] I'm curious if johansch can come back in here, or if he's gone for good [01:00] Read topic, insert knowledge [01:00] * i336_ +1 knowledge! [01:00] * i336_ updates question [01:00] do bans ever get revoked? [01:00] they tend to age out manually [01:01] right. I see. [01:01] speaking of [01:01] *** xmc sets mode: -b *!uid118096@* [01:01] *** xmc sets mode: -b *!4f8dff3d@ag-255-61.sta.ji.cz [01:01] *** xmc sets mode: -b *!*Thunderbi@*.res.bhn.net [01:02] *** xmc sets mode: -b *!*webchat@*.res.bhn.net [01:02] i336_: the best i can see doing with 50GB per a month cap is use archivebot [01:04] yeah [01:05] *** ndiddy has joined #archiveteam-bs [01:12] so i found this website: http://radio.garden/ [01:13] *** robink has joined #archiveteam-bs [01:15] https://archive.org/details/cratediggers?&sort=publicdate is going to slowly grow overnight [02:04] *** ndiddy has quit IRC (Quit: Leaving) [02:11] *** BlueMaxim has quit IRC (Quit: Leaving) [02:11] *** BlueMaxim has joined #archiveteam-bs [02:17] *** Stiletto has quit IRC (Read error: Operation timed out) [02:22] "It also allegedly obtained and allegedly posted the allegedly uncut 5-minute "footage" of President George W. Bush allegedly sitting in a Florida classroom as the 9/11 attacks happened." [02:22] that's a lot of allegedly [02:29] I noticed that too. [02:44] *** robink has quit IRC (Read error: Connection reset by peer) [02:47] I wrote a thing that is now going through 2.5 million items in the "audio uploads" section [02:47] And if it has a cover, it's going into another collection [02:47] Lotta audiobooks, turns out [02:48] Occasional calls to jihad [02:56] *** hook54321 has quit IRC (Quit: Updating details, brb) [02:57] *** hook54321 has joined #archiveteam-bs [02:58] *** hook54321 has quit IRC (Client Quit) [02:58] *** robink has joined #archiveteam-bs [03:00] *** hook54321 has joined #archiveteam-bs [03:08] *** hook54321 has quit IRC (Quit: Updating details, brb) [03:11] SketchCow: we're now at 2.4 TB for the exua grab, can you please start the upload on FOS? [03:14] *** Stiletto has joined #archiveteam-bs [03:15] arkiver: I'm currently doing some work on figuring out the concrete details about how to parse rover.info's HTML. is there any interest in also adding rover.info to the crawl project? [03:17] arkiver: if there is, I've taken a look at setting up my own project tracker, and it's admittedly over my head. I have a small request to ask. could you setup an unlisted project on the tracker that sends me say 5 or 6 specific URLs every time I request it, so I can have a go at building and tuning wget+lua to fetch rover.info? [03:17] (I realize sending the same URLs every time may be tricky) [03:26] woops, wrong channel (again), my apologies. I'll try and be more attentive [03:26] *** hook54321 has joined #archiveteam-bs [03:30] what are the most reliable efnet servers? [03:47] Your own leaf server, lol [03:51] leaf server...?? what's that? [03:54] *** lain_ has joined #archiveteam-bs [04:07] hook54321: I use irc.choopa.net, seems okay. [04:19] arkiver: What is IN it [04:26] What. Is. In. It [04:42] Kenshin: Hey hey [04:42] SketchCow: what's up? [04:43] Are you doing parallel ex.ua stuff [04:43] nope [04:43] Good [04:43] Avoid. We are not doing this project into IA servers [04:43] k [04:44] i usually only touch projects that can be hit hard [04:44] normal projects the guys already hit hard enough [04:44] Understood [04:49] Are you able to kill a project off the tracker? [04:51] We're really seeing the best of me today [04:59] yes, I do [04:59] but then, so does arkiver, so might just let him make the call [05:00] I am deleting all data arriving on the machine [05:01] related to exua [05:01] So if we can stop it, that would be good [05:01] O_o [05:01] uh ok [05:01] SketchCow: is it gone yet? can I have a copy? [05:01] exua was a needless distraction from gov backup [05:02] if there's anything left I wouldn't mind downloading it... it'll take me a really long time but I'd appreciate it [05:02] No. [05:02] :( [05:19] *** SketchCow sets mode: +b *!*i336@*.lnse3.ken.bigpond.net.au [05:19] *** i336_ was kicked by SketchCow (i336_) [05:24] Wow, 9gb came in after the delete [05:24] I wonder if the pipes have stopped. [05:31] * xmc slurps [05:32] root@teamarchive0:/1/CHFOO/warrior/exua# du -sh . [05:32] 2.1G . [05:32] So that's still happening. [05:32] (I mean, I realize the jobs have stopped) [05:34] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [05:42] *** Sk1d has joined #archiveteam-bs [05:46] In other news, I have scripts sorting audio on archive.org pretty intensely. [06:41] *** Aranje has quit IRC (Quit: Three sheets to the wind) [06:54] SketchCow: what's ex.ua? [06:55] shhh [06:57] nvm, I searched in the logs [06:57] go back to sleep now if I woke you up :P [06:58] I'm doing this massive language sort using scripts [06:58] Why isn't it going on archive.org? [07:03] *** kristian_ has joined #archiveteam-bs [07:05] It's garbage [07:08] Like, pastebin but more useless? [07:09] Yes [07:09] Megaupload, but for the Ukraine [07:10] Yeah, I won't ever need to access anything from it then [07:10] Busted multiple times [07:10] Nor will most people [07:11] Still working by a trick of IP geofiltering [07:11] Busted for what exactly? Piracy? [07:11] Probably so mobbed up it has a seat in the restaurant you knock people out of if it wants it [07:11] megapiracy [07:11] petabytes [07:12] In the roughest way, I tried to think "well, if it had some ukranian culture, maybe" [07:12] But it doesn't. [07:14] It's kinda sad that some internet culture is largely hosted exclusively on pastebin and similar sites [07:15] Are there any ukranian people involved in archiveteam? [07:38] *** GE has joined #archiveteam-bs [08:33] *** ravetcofx has quit IRC (Read error: Operation timed out) [08:56] *** BlueMaxim has quit IRC (Quit: Leaving) [10:30] *** kristian_ has quit IRC (Quit: Leaving) [10:57] SketchCow: it was all metadata and preview images. [10:58] Project is removed from the tracker and github [11:27] *** GE has quit IRC (Remote host closed the connection) [12:47] *** VADemon has joined #archiveteam-bs [13:00] *** GE has joined #archiveteam-bs [14:24] *** ravetcofx has joined #archiveteam-bs [14:30] Again, sorry for the misunderstanding and miscommunication, arkiver [14:30] All on me. [14:59] *** ravetcofx has quit IRC (Read error: Operation timed out) [15:25] *** Start has quit IRC (Quit: Disconnected.) [16:11] *** Honno has joined #archiveteam-bs [16:19] what exploded while i was sleeping [16:28] Everybody's dead [16:29] that explains why it's so cold here [16:30] And delicious [16:30] damn fine cup of coffee [16:30] speaking of, i went to the diner where twin peaks was filmed [16:30] as billed, they have good coffee and cherry pie [16:32] As I begin the basic run against the 2,500,000 items in the audio inbox, some amazing crap is starting to emerge. [16:32] Best of all, my routine is running without me, classifying languagesand moving bulks of uploads never regarded before. [16:32] what kind of tasks do you perform on the things? [16:33] Well, when someone uploads a lot, I can't just do a "move them all" command because it chokes the metadata manager. [16:33] ah [16:33] So I have a script that's got the list of one guy who uploaded 10,189 russian language audiobooks and is now shifting those over to a collection, over an hour. [16:33] Another is searching for Arabic language items and classifying them, because they tend to be a bit of a mess. [16:34] I'm also finding where one guy uploaded a pile of one theme [16:34] beautiful [16:34] And I have scripts that say "find everything from this guy and make a collection" [16:34] https://archive.org/details/audioboo_ru [16:34] useful, that [16:35] 6,449 audiobooks in russian. [16:35] holy moly [16:35] Iknow it will eventually have 10,500 inthere. [16:35] New ones come in with every refresh you do, due to that script. [16:36] https://archive.org/details/cratediggers is my workspace. It'll grow and shrink [16:39] Because I'll notice trends like "oh, someone uploaded 2300 2-hour chill mixes" [16:39] And that becomes a collection [16:39] Also, this find a language trick is now running in the texts uploads section, classifying thousands of Arabic texts and mislabelled arabic texts. [16:40] how does it figure the language? [16:40] Finds arabic characters [16:40] oh, easy enough [16:40] that's right you go for the automated 85% solution because it's better than the manual 100%-but-actually-only-3%-gets-done solution [16:41] Always [16:41] new collection of all this audio horseshit is http://archive.org/details/folksoundomy by the way [16:41] I'm throwing in ecollections and adding new ones and so on [16:43] *** Honno has quit IRC (Quit: Leaving) [16:44] *** Honno has joined #archiveteam-bs [17:51] I'm happy to say Russ Kick is joining us [17:51] He is going to Archivebot archivebot like Scotty [17:53] The memory hole guy? (quick google) [17:53] correct [18:55] Keep an eye out for him [20:02] *** BlueMaxim has joined #archiveteam-bs [20:03] *** BlueMaxim has quit IRC (Client Quit) [20:03] *** BlueMaxim has joined #archiveteam-bs [20:05] arkiver: can you run multiple FTP chunkers on one FTP server to speed it up? [20:06] not right now [20:06] but I need to update the FTP project [20:06] to make discovery also warrior compatible [20:06] Chunking 200tb will take a long time [20:06] agree [20:06] that will be after the 23 december though [20:06] depends on the number of files? [20:07] True. Hopefully it's like 100 2tb files :p [20:10] I've looked at NOAA [20:10] It's not. [20:10] Few KB txt files, csv files [20:10] you probably know, but it's no large files [20:10] Occasional zip files [20:10] *not [20:11] Fun [20:11] generally it's a file per day, week, month, etc... for each and every type of dataset [20:11] of which there are hundreds [20:11] I tried squirting some through archivebot, It didn't like it much. [20:11] and separate date-sorted files per locale [20:11] ie, individual weather station or town [20:11] Igloo: my local grab site very promptly fell over too [20:12] Times out on dir listing [20:12] My big dedicated is currently crunching through the same URL list [20:12] However,I have noticed inconsistencies with the FTP server [20:12] From one IP the directory is valid [20:12] From another it is not. [20:12] They may IP and connection limit [20:13] We had that problem massively on one FTP grab before [20:13] Maybe. I've got 2 concurrent conns [20:13] From one IP at the mo. [20:13] I need to think how to effectively chunk ths up in a manner which is going to be managable to a) download b) get into IA [20:14] Waiting to see how big this test WARC is and go from there really [20:22] Climate change work, go to #cheetoflee [21:05] *** Start has joined #archiveteam-bs [21:10] *** Ravenloft has joined #archiveteam-bs [21:24] *** Meroje has quit IRC (Quit: bye!) [21:25] *** Meroje has joined #archiveteam-bs [21:40] *** Start has quit IRC (Quit: Disconnected.) [21:44] *** Start has joined #archiveteam-bs [21:51] *** glass3 has joined #archiveteam-bs [22:11] *** Start has quit IRC (Remote host closed the connection) [22:15] *** Start has joined #archiveteam-bs [22:23] *** Start has quit IRC (Quit: Disconnected.) [23:04] *** GE has quit IRC (Remote host closed the connection) [23:43] *** Start has joined #archiveteam-bs [23:55] *** DiscantX has joined #archiveteam-bs