#archiveteam-bs 2019-09-25,Wed

↑back Search

Time Nickname Message
00:48 🔗 dxrt has quit IRC (Ping timeout: 252 seconds)
00:51 🔗 chazchaz has quit IRC (Read error: Operation timed out)
00:57 🔗 dxrt has joined #archiveteam-bs
00:57 🔗 Fusl____ sets mode: +o dxrt
00:57 🔗 Fusl sets mode: +o dxrt
00:57 🔗 Fusl_ sets mode: +o dxrt
01:01 🔗 chazchaz has joined #archiveteam-bs
01:12 🔗 ats has joined #archiveteam-bs
01:12 🔗 ats_ has quit IRC (Read error: Operation timed out)
01:17 🔗 bluefoo has quit IRC (Read error: Connection reset by peer)
01:19 🔗 bluefoo has joined #archiveteam-bs
01:24 🔗 larryv has quit IRC (Quit: larryv)
02:05 🔗 DogsRNice has quit IRC (Read error: Connection reset by peer)
02:50 🔗 SketchCow Just posted https://archive.org/details/ouyalibrary
02:50 🔗 SketchCow See you in an unspecified CIA jail
03:30 🔗 qw3rty2 has joined #archiveteam-bs
03:35 🔗 qw3rty has quit IRC (Ping timeout: 745 seconds)
03:37 🔗 odemgi has joined #archiveteam-bs
03:40 🔗 odemgi_ has quit IRC (Ping timeout: 252 seconds)
03:45 🔗 Stiletto has quit IRC (Ping timeout: 360 seconds)
04:02 🔗 Stiletto has joined #archiveteam-bs
05:00 🔗 fredgido has quit IRC (Ping timeout: 612 seconds)
05:01 🔗 fredgido has joined #archiveteam-bs
05:15 🔗 fredgido_ has joined #archiveteam-bs
05:22 🔗 fredgido has quit IRC (Read error: Operation timed out)
06:19 🔗 deevious has quit IRC (Quit: deevious)
06:28 🔗 joshua_ has quit IRC (Remote host closed the connection)
06:33 🔗 DigiDigi has quit IRC (Remote host closed the connection)
07:28 🔗 eythian has quit IRC (Ping timeout: 258 seconds)
07:34 🔗 eythian has joined #archiveteam-bs
09:42 🔗 K4k has quit IRC (Read error: Operation timed out)
09:43 🔗 K4k has joined #archiveteam-bs
09:52 🔗 deevious has joined #archiveteam-bs
10:41 🔗 bluefoo has quit IRC (Read error: Connection reset by peer)
11:01 🔗 bluefoo has joined #archiveteam-bs
11:20 🔗 killsushi has quit IRC (Quit: Leaving)
13:22 🔗 second has joined #archiveteam-bs
13:26 🔗 deevious has quit IRC (Quit: deevious)
13:43 🔗 DigiDigi has joined #archiveteam-bs
13:44 🔗 fredgido has joined #archiveteam-bs
13:50 🔗 fredgido_ has quit IRC (Read error: Operation timed out)
15:22 🔗 JH881326 has joined #archiveteam-bs
15:40 🔗 eythian has quit IRC (Remote host closed the connection)
15:40 🔗 SketchCow Heyyyyy godane - I finally got one for you.
15:40 🔗 SketchCow https://community.apan.org/wg/tradoc-g2/fmso/p/fmso-file-cabinet
15:40 🔗 SketchCow I got all the books in the "bookshelf" up. But these monographs. Your best skills in transferring them over, please
15:41 🔗 SketchCow also https://community.apan.org/wg/tradoc-g2/fmso/p/oe-watch-issues
15:42 🔗 eythian has joined #archiveteam-bs
15:57 🔗 super3_ is now known as super3
16:07 🔗 SketchCow I completely forgot I had that thing running against the archivebot collection, and it's been doing the work! Down to page 4
16:25 🔗 Sauce has joined #archiveteam-bs
16:30 🔗 Stilettoo has joined #archiveteam-bs
16:31 🔗 Stiletto has quit IRC (Ping timeout: 255 seconds)
16:45 🔗 ShellyRol has quit IRC (Ping timeout: 492 seconds)
16:46 🔗 ShellyRol has joined #archiveteam-bs
17:04 🔗 ShellyRol has quit IRC (Remote host closed the connection)
17:04 🔗 ShellyRol has joined #archiveteam-bs
17:06 🔗 Sauce has quit IRC (C:\exit.exe)
17:07 🔗 jamiew has joined #archiveteam-bs
17:13 🔗 godane SketchCow: i will check do it the same way i did ERIC archives
17:13 🔗 godane find the pdfs and filter out the html pages after
17:15 🔗 godane i will call it the APAN archives or something like that
17:17 🔗 godane oh this maybe good
17:17 🔗 godane looks like i maybe able to grab every id with a pdf
17:17 🔗 godane even stuff like OEWatch
17:18 🔗 godane even though i'm doing it with the fmso path
17:32 🔗 Stilettoo is now known as Stiletto
17:44 🔗 second Is archlinux just storing packages here https://archive.org/search.php?query=creator%3A%22Arch+Linux%22
17:45 🔗 second along with history
18:04 🔗 zhongfu has quit IRC (Remote host closed the connection)
18:05 🔗 zhongfu has joined #archiveteam-bs
18:33 🔗 jamiew has quit IRC (Quit: My iMac has gone to sleep. ZZZzzz…)
18:42 🔗 PurpleSym has quit IRC (Remote host closed the connection)
18:44 🔗 bluefoo has quit IRC (Read error: Operation timed out)
18:51 🔗 DogsRNice has joined #archiveteam-bs
18:51 🔗 ShellyRol has quit IRC (Read error: Connection reset by peer)
18:51 🔗 ShellyRol has joined #archiveteam-bs
18:54 🔗 systwi has joined #archiveteam-bs
19:03 🔗 bluefoo has joined #archiveteam-bs
20:34 🔗 systwi_ has joined #archiveteam-bs
20:36 🔗 ats has quit IRC (Quit: new motherboard (caution: sharks))
20:42 🔗 systwi has quit IRC (Read error: Operation timed out)
20:43 🔗 MaximeleG has joined #archiveteam-bs
20:44 🔗 MaximeleG has quit IRC (Client Quit)
20:50 🔗 MaximeleG has joined #archiveteam-bs
20:50 🔗 MaximeleG has quit IRC (Client Quit)
20:58 🔗 jleclanch I'm still working on the archival of the legacy battle.net forums. So far I have all of EU, TW and KR discovered (~5.5M URLs discovered), and I'm almost done crawling the US forums (will probably be another 5-10M URLs). Could someone more knowledgeable guide me through the process of actually archiving it all once I have all the URLs?
20:59 🔗 jleclanch I have a data extractor written in python that I can run on a page and extract the relevant in JSON format. I wanted to build a proper archive of this, not just HTML dumps
21:16 🔗 Fusl jleclanch: you mean us.battle.net? thats been done already
21:18 🔗 Fusl as for archiving, we've done that on mips since they started blocking archivebot pipelines
21:22 🔗 odemgi has quit IRC (Read error: Connection reset by peer)
21:22 🔗 odemgi has joined #archiveteam-bs
21:24 🔗 jleclanch done where, do you have info about it?
21:24 🔗 jleclanch and yeah, like these forums: https://eu.battle.net/forums/en/wow/12309726/?page=5
21:40 🔗 Fusl oh wow i didnt know these were a thing
21:41 🔗 Fusl i thought us.battle.net was the only one
21:42 🔗 jleclanch Fusl: there's us.bnet, eu.bnet, kr.bnet and tw.bnet. I believe I have an exhaustive list of every single one of them (publicly available, that is)
21:42 🔗 jleclanch Fusl: metadata on every single forum: https://dpaste.de/KNmh
21:45 🔗 Fusl thanks, ill start a mips job for those
21:45 🔗 jleclanch Fusl: what's a mips job?
21:46 🔗 Fusl like #archivebot but on a special server with lots of ips and horse power
21:47 🔗 jleclanch Fusl: I see. Is it possible to get a dump of the HTML of all the pages collected? I'm interested in extracting the data and making it all searchable
21:47 🔗 jleclanch I have URL lists if you want
21:47 🔗 Fusl sure
21:47 🔗 jleclanch let me upload them to drive, 1s
21:49 🔗 jleclanch Fusl: kr.battle.net, all URLs https://usercontent.irccloud-cdn.com/file/lgtj5EAe/urls.kr.txt.xz
21:49 🔗 jleclanch Fusl: tw.battle.net, all URLs https://usercontent.irccloud-cdn.com/file/XUfmFZvd/urls.tw.txt.xz
21:49 🔗 jleclanch (eu is still compressing)
21:49 🔗 jleclanch Fusl: eu.battle.net, all URLs https://usercontent.irccloud-cdn.com/file/L8bVuss2/urls.eu.txt.xz
21:50 🔗 jleclanch I should have US ready in a couple of days at most.
21:51 🔗 Fusl ill throw all of them + the category starting point urls into mips to grab all of it
21:51 🔗 jleclanch Actually I say all URLs, none of these contain category URLs (topic listings). I'll be able to generate those at some point, but I already actually have all of them in a sqlite db
21:51 🔗 jleclanch but it's all posts URLs, including pages inside the posts
21:53 🔗 Fusl thanks
21:53 🔗 jleclanch I'll ping back when US is ready. Fusl: where can I get a hold of all the pages afterwards, in a way that won't require me to scrape archive.org?
21:53 🔗 Fusl ill dump them onto a storage server, you can download it with rsync from there for a few weeks until they are purged
21:54 🔗 jleclanch sweet, thank you :)
21:54 🔗 jleclanch I can also generate category page URLs if you want, btw (I have total topic counts, so I can figure out how many pages there are in a category)
21:55 🔗 Fusl no worries, i generated those using the json
21:55 🔗 jleclanch well by exhaustive I mean including ?page=x, ?page=x+1, etc
21:56 🔗 jleclanch eg. this is the last page of the wow general discussion forum: https://us.battle.net/forums/en/wow/984270/?page=22090 (and loading it now probably will crash)
22:08 🔗 Fusl feel free to follow the scraping progress here http://103.230.141.2:29000/ and here https://atdash.meo.ws/d/grabsitemips/grab-site-mips-pipeline?orgId=1&var-ident=e8cf4322&var-ident=ef4e18eb&var-ident=e16f9714&var-ident=d0b72ae6
22:09 🔗 jleclanch neat
22:29 🔗 BlueMax has quit IRC (Quit: Leaving)
22:38 🔗 ats has joined #archiveteam-bs
22:43 🔗 killsushi has joined #archiveteam-bs

irclogger-viewer