[00:13] *** wp494 has quit IRC (Ping timeout: 633 seconds) [00:13] *** wp494 has joined #archiveteam-bs [00:15] *** hdch has joined #archiveteam-bs [01:40] *** jut has quit IRC (Read error: Connection reset by peer) [01:45] *** jut has joined #archiveteam-bs [02:41] *** DFJustin has quit IRC (Ping timeout: 260 seconds) [04:06] JAA: Anything that can be done to help save the UOL forums? [04:08] *** exoire has quit IRC (Read error: Operation timed out) [04:08] jodizzle: Idk. The only hope to grab all of it at this point would probably be a warrior project, and that would have to start very soon. [04:09] The footer claims 7.4M threads with 180M posts... [04:09] I get less than that adding up the numbers, but it's still pretty large. [04:10] And it's a custom forum software, so we don't have existing code we can reuse. [04:13] That's unfortunate. [04:14] I'd be willing to try some targeted/hand scraping, but unfortunately I don't have any experience writing Warrior projects. [04:15] And it sounds like the issue is mostly just that the site is so large? [04:16] Well, I guess that's usually the issue... [04:21] *** guest has joined #archiveteam-bs [04:36] *** qw3rty115 has joined #archiveteam-bs [04:40] ... continuing the not a really good idea conversation about keeping local copies of things ... [04:40] i get that it'd be a problem with something illegal or very dangerous (criticism of cartel or islamic terrorism) but the truecrypt website justu contained things like the format used, user manuals, etc. [04:41] (Just FYI, this channel is also logged publicly and well-known.) [04:42] *** qw3rty114 has quit IRC (Read error: Operation timed out) [04:42] is there some sort of drama behind truecrypt shutting down that i'm not aware of? i thought they just wanted the project to die. i don't think they care about going after people with backups of the site [04:43] to the best of my knowledge at least [04:44] there was a LOT of drama around truecrypt shutting down [04:44] *** odemgi_ has joined #archiveteam-bs [04:44] I *think* it's reasonably well explained on the wikipedia page? [04:44] (haven't checked recently) [04:44] ok checking wikipedia [04:45] Not much drama on there, but yeah, the shutdown was weird. [04:45] it was weird [04:46] It seems that the authors pretty much want everyone to forget that TrueCrypt ever existed or something. [04:46] my guess (personally) was that they were getting harrassed by law enforcement and just didn't want to deal with it anymore [04:46] but at least veracrypt is a thing now and is well maintained [04:46] *** odemgi has quit IRC (Read error: Operation timed out) [04:46] *** odemg has quit IRC (Ping timeout: 265 seconds) [04:58] *** odemg has joined #archiveteam-bs [05:19] *** ndiddy has quit IRC (Quit: nighty night) [05:38] *** DFJustin has joined #archiveteam-bs [05:38] *** swebb sets mode: +o DFJustin [06:24] *** riley has joined #archiveteam-bs [06:45] *** jut has quit IRC (Ping timeout: 252 seconds) [06:46] *** jut has joined #archiveteam-bs [07:54] *** hdch has quit IRC (Ping timeout: 265 seconds) [08:03] What's the URL for UOL forums? [08:05] Nm it's on Reddit [08:16] *** hdch has joined #archiveteam-bs [08:16] *** tomaspark has quit IRC (Read error: Connection reset by peer) [08:22] *** LFlare has joined #archiveteam-bs [08:28] *** macrosoft has joined #archiveteam-bs [08:29] Ha, I forgot mil is 1000 [09:07] *** wp494 has quit IRC (Ping timeout: 265 seconds) [09:07] *** wp494 has joined #archiveteam-bs [09:31] *** ubahn has joined #archiveteam-bs [09:40] *** ubahn_ has joined #archiveteam-bs [09:42] *** ubahn has quit IRC (Ping timeout: 260 seconds) [09:44] On beta UOL, the index pages are truncated by a lot. What is archive bot doing? [09:46] following orders, for better or worse [09:48] *** exoire has joined #archiveteam-bs [09:58] *** hook54321 has quit IRC (Quit: Connection closed for inactivity) [10:11] *** exoire has quit IRC (Read error: Operation timed out) [10:22] *** ubahn has joined #archiveteam-bs [10:23] *** ubahn_ has quit IRC (Read error: Operation timed out) [10:26] *** ubahn_ has joined #archiveteam-bs [10:26] UOL seems small. Easier to not use warriors, imho [10:28] *** ubahn has quit IRC (Ping timeout: 360 seconds) [11:09] *** hdch has quit IRC (Ping timeout: 265 seconds) [11:56] *** odemgi_ has quit IRC (Remote host closed the connection) [12:07] *** BlueMax has quit IRC (Quit: Leaving) [12:10] *** hook54321 has joined #archiveteam-bs [12:51] *** macrosoft has quit IRC (Ping timeout: 265 seconds) [13:21] *** benjinsmi has joined #archiveteam-bs [13:25] *** benjins has quit IRC (Read error: Operation timed out) [13:30] *** jut has quit IRC (Ping timeout: 252 seconds) [13:32] *** jut has joined #archiveteam-bs [13:40] *** Hani has quit IRC (Read error: Connection reset by peer) [14:38] *** Despatche has joined #archiveteam-bs [14:39] *** guest has quit IRC (Quit: y ppl s) [14:53] *** n00b928 has joined #archiveteam-bs [14:53] *** n00b928 has quit IRC (Client Quit) [15:38] *** hyku has joined #archiveteam-bs [15:47] From my understanding this is the place for help correct? [16:10] *** Hani has joined #archiveteam-bs [16:29] Does anyone in BE/NL have space for 30 tonnes of books? https://old.reddit.com/r/Archiveteam/comments/ac2f8j/about_30_tones_of_technical_books_and_articles/ [16:30] The Vlaamse Vereniging voor Industriële Archeologie needs to get rid of 50k books and magazines. [16:37] *** Mateon1 has quit IRC (Ping timeout: 255 seconds) [17:00] *** benjinsmi has quit IRC (Leaving) [17:00] *** benjins has joined #archiveteam-bs [17:34] *** ubahn_ has quit IRC (Quit: ubahn_) [17:49] *** hook54321 has quit IRC (Quit: Connection closed for inactivity) [17:57] SketchCow: see what JAA wrote ^ [17:58] what are these UOL forums [17:58] no page on the wiki [17:59] arkiver: http://forum.jogos.uol.com.br/ http://forum.esporte.uol.com.br/ http://forum.televisao.uol.com.br/ http://forum.tecnologia.uol.com.br/ http://beta.forum.jogos.uol.com.br/ [18:00] I'll create a wiki page. [18:00] *** jut has quit IRC (Ping timeout: 252 seconds) [18:03] *** jut has joined #archiveteam-bs [18:08] *** wp494 has quit IRC (Ping timeout: 260 seconds) [18:08] *** wp494 has joined #archiveteam-bs [18:09] bringing the books up with IA [18:23] arkiver: The Televisão and Tecnologia forums were archived through ArchiveBot in November, and the archives seem to be complete (based on random clicking around in the WBM). The Esporte forums job back then crashed, but my new job yesterday seems to have completed; I didn't check yet how complete that is though. The jobs for the two Jogos forums are still running, and those are by far the largest [18:23] ones. [18:23] It's actually one forum, but some subforums are on the beta subdomain for whatever reason. [18:23] *** ubahn has joined #archiveteam-bs [18:23] It seems that thread IDs are shared between the first four forums I linked, and the beta thing uses separate IDs. [18:23] I'm not 100 % sure on that though. [18:24] *** ubahn has quit IRC (Client Quit) [18:24] *** ubahn has joined #archiveteam-bs [18:24] "Shared IDs" means that IDs are globally unique rather than per forum. However, each ID only works on the specific forum where it exists. [18:25] *** ubahn has quit IRC (Client Quit) [18:25] Thread IDs go to roughly 4.5 million on those four forums, despite the footer claiming that there are 7.4 million threads. (The beta forums are small in comparison at only 213k threads.) [18:26] Not sure where that discrepancy comes from. [18:52] JAA: we can easily archive these threads, for example http://beta.forum.jogos.uol.com.br/t/14913819/bannon-e-eduardo-bolsonaro-encontro and http://beta.forum.jogos.uol.com.br/t/14913819/ [18:56] and http://forum.jogos.uol.com.br/qual-console-comprar-ps4-ou-xbox-one_t_3341884 as http://forum.jogos.uol.com.br/_t_3341884 [18:56] from http://forum.jogos.uol.com.br/_t_3341884 we can get http://forum.jogos.uol.com.br/qual-console-comprar-ps4-ou-xbox-one_t_3341884 again [18:57] but not sure we can get http://beta.forum.jogos.uol.com.br/t/14913819/bannon-e-eduardo-bolsonaro-encontro from http://beta.forum.jogos.uol.com.br/t/14913819/ [19:00] https://archive.org/details/@yurigagarin Look at this lad, this absolute unit [19:04] *** Mateon1 has joined #archiveteam-bs [19:09] *** Despatche has quit IRC (Quit: Error: Connection reset by peer) [19:21] JAA: making a project for at least http://forum.jogos.uol.com.br/ [19:21] can I make you admin? [19:26] *** hook54321 has joined #archiveteam-bs [19:27] *** chimyatta has quit IRC (Ping timeout: 252 seconds) [19:28] *** chimyatta has joined #archiveteam-bs [19:35] *** Stilett0 has joined #archiveteam-bs [19:36] *** Stiletto has quit IRC (Read error: Operation timed out) [19:51] arkiver: The job for beta.* seems to be coming to and end now. efutp9w41j8cssacnrxgiycqp [19:51] good [19:52] and the not-beta? [20:05] https://tracker.archiveteam.org/uolforums/ [20:05] https://github.com/ArchiveTeam/uolforums-grab [20:08] arkiver: Sure. I won't have much time for coding or testing, but I can help with the queue management. [20:09] yeah, don´t have time for the queue stuff [20:09] I´m adding 100000 items now, and the you can do the rest [20:10] Ack [20:10] The non-beta AB job is nowhere near finishing, by the way. [20:11] And I'm not entirely convinced the beta AB job got everything. Either it errored out on the pagination and will still continue for a while, or it probably missed some stuff. 700k URLs for 213k threads just doesn't seem right. [20:12] yeah [20:13] shall I queue thread 0-999999 for jogos now? [20:13] threads* [20:18] queued. [20:18] FOS is the target [20:19] now the default project [20:40] Sweet, thanks. [21:00] *** adinbied has joined #archiveteam-bs [21:00] *** BasDub has joined #archiveteam-bs [21:02] *** Mateon1 has quit IRC (Read error: Operation timed out) [21:03] *** DasBub has quit IRC (Read error: Operation timed out) [21:08] *** Mateon1 has joined #archiveteam-bs [21:10] *** VerfiedJ has joined #archiveteam-bs [21:15] *** macrosoft has joined #archiveteam-bs [21:19] *** adinbied has quit IRC (Quit: Leaving) [21:22] *** antomatic has quit IRC (Read error: Operation timed out) [21:33] *** DosBob has joined #archiveteam-bs [21:37] *** BasDub has quit IRC (Ping timeout: 252 seconds) [21:37] *** DosBob is now known as DasBub [21:39] *** Stiletto has joined #archiveteam-bs [21:45] *** Stilett0 has quit IRC (Read error: Operation timed out) [21:49] *** tomaspark has joined #archiveteam-bs [22:05] *** Dj-Wawa has joined #archiveteam-bs [22:08] arkiver: Can I have access to uolforums-items (and maybe also -grab)? [22:12] *** Wizzito has joined #archiveteam-bs [22:12] So many 0/O.1 MBs on the UOL tracker [22:12] ... we've done 2 GB so far, eh? [22:16] *** antomatic has joined #archiveteam-bs [22:16] *** swebb sets mode: +o antomatic [22:19] most threads seem empty [22:26] *** BlueMax has joined #archiveteam-bs [22:27] *** Stilett0 has joined #archiveteam-bs [22:29] *** Stiletto has quit IRC (Read error: Operation timed out) [22:30] yes Wizzito we try to keep #archiveteam quiet, so people who aren't very active can catch up on new stuff [22:30] ok [22:30] without having to read past pages and pages of random discussions [22:34] *** DasBub has quit IRC (Quit: rebeught) [22:35] The small items seem legit. Even very old threads seem to use larger IDs. I've only seen a handful of threads under 1 million. [22:36] *** DasBub has joined #archiveteam-bs [22:37] arkiver: Access to uolforums-items please? We're already halfway done with the first 100k items. :-) [22:38] *** exoire has joined #archiveteam-bs [22:39] Also, http://jsuol.com.br/p/forum/j/funcoes_admin.js?1.1.17 is a candidate for the ignore list. [22:40] Hmm, what's with errors like this? http://forum.jogos.uol.com.br/_t_221480 "Erro buscando a página 1 do tópico 221480" [22:40] Apparently that's returned as a HTTP 200. [22:41] Wait no, now I get a redirect to the homepage. [22:45] Reducing the rate to see if we see less of those errors then. [22:46] arkiver: ^ What do we want to do about that? [22:46] Abort when there's such an error? [22:55] yes [22:56] I think I've seen a different error before as well, but I don't remember the message. [22:56] you have access now [22:56] *** Stilett0 has quit IRC (Ping timeout: 246 seconds) [22:58] Here's an example HTML page with that error from http://forum.jogos.uol.com.br/_t_692873 : https://transfer.sh/aGKYU/F%C3%B3rum%20UOL%20Jogos%20::%20%C3%8Dndice%20do%20f%C3%B3rum.html [23:00] *** Stiletto has joined #archiveteam-bs [23:19] *** Stilett0 has joined #archiveteam-bs [23:21] *** Stiletto has quit IRC (Read error: Operation timed out) [23:26] Pausing the project. arkiver, can you add a fix for that error? [23:27] Oh, we're pausing the UOL grab? Can I turn off my Warrior if that's the case? [23:31] Just tried provoking that error under load on an existing topic, but that didn't seem to work. Not sure if that means we didn't miss content though. In any case, we should abort when we get the error probably. [23:32] Wizzito: You can do that whenever you want, or you can let it run and automatically resume when we're ready again. Up to you really. [23:33] is there a channel for the UOL project? [23:38] Nope [23:38] Considering more of UOL will go down soon most likely (as mentioned earlier in #archiveteam), we might want to create one though. [23:40] Any good ideas for a channel name? [23:49] *** chimyatta has quit IRC (Quit: quitting)