Time |
Nickname |
Message |
00:48
🔗
|
|
dxrt has quit IRC (Ping timeout: 252 seconds) |
00:51
🔗
|
|
chazchaz has quit IRC (Read error: Operation timed out) |
00:57
🔗
|
|
dxrt has joined #archiveteam-bs |
00:57
🔗
|
|
Fusl____ sets mode: +o dxrt |
00:57
🔗
|
|
Fusl sets mode: +o dxrt |
00:57
🔗
|
|
Fusl_ sets mode: +o dxrt |
01:01
🔗
|
|
chazchaz has joined #archiveteam-bs |
01:12
🔗
|
|
ats has joined #archiveteam-bs |
01:12
🔗
|
|
ats_ has quit IRC (Read error: Operation timed out) |
01:17
🔗
|
|
bluefoo has quit IRC (Read error: Connection reset by peer) |
01:19
🔗
|
|
bluefoo has joined #archiveteam-bs |
01:24
🔗
|
|
larryv has quit IRC (Quit: larryv) |
02:05
🔗
|
|
DogsRNice has quit IRC (Read error: Connection reset by peer) |
02:50
🔗
|
SketchCow |
Just posted https://archive.org/details/ouyalibrary |
02:50
🔗
|
SketchCow |
See you in an unspecified CIA jail |
03:30
🔗
|
|
qw3rty2 has joined #archiveteam-bs |
03:35
🔗
|
|
qw3rty has quit IRC (Ping timeout: 745 seconds) |
03:37
🔗
|
|
odemgi has joined #archiveteam-bs |
03:40
🔗
|
|
odemgi_ has quit IRC (Ping timeout: 252 seconds) |
03:45
🔗
|
|
Stiletto has quit IRC (Ping timeout: 360 seconds) |
04:02
🔗
|
|
Stiletto has joined #archiveteam-bs |
05:00
🔗
|
|
fredgido has quit IRC (Ping timeout: 612 seconds) |
05:01
🔗
|
|
fredgido has joined #archiveteam-bs |
05:15
🔗
|
|
fredgido_ has joined #archiveteam-bs |
05:22
🔗
|
|
fredgido has quit IRC (Read error: Operation timed out) |
06:19
🔗
|
|
deevious has quit IRC (Quit: deevious) |
06:28
🔗
|
|
joshua_ has quit IRC (Remote host closed the connection) |
06:33
🔗
|
|
DigiDigi has quit IRC (Remote host closed the connection) |
07:28
🔗
|
|
eythian has quit IRC (Ping timeout: 258 seconds) |
07:34
🔗
|
|
eythian has joined #archiveteam-bs |
09:42
🔗
|
|
K4k has quit IRC (Read error: Operation timed out) |
09:43
🔗
|
|
K4k has joined #archiveteam-bs |
09:52
🔗
|
|
deevious has joined #archiveteam-bs |
10:41
🔗
|
|
bluefoo has quit IRC (Read error: Connection reset by peer) |
11:01
🔗
|
|
bluefoo has joined #archiveteam-bs |
11:20
🔗
|
|
killsushi has quit IRC (Quit: Leaving) |
13:22
🔗
|
|
second has joined #archiveteam-bs |
13:26
🔗
|
|
deevious has quit IRC (Quit: deevious) |
13:43
🔗
|
|
DigiDigi has joined #archiveteam-bs |
13:44
🔗
|
|
fredgido has joined #archiveteam-bs |
13:50
🔗
|
|
fredgido_ has quit IRC (Read error: Operation timed out) |
15:22
🔗
|
|
JH881326 has joined #archiveteam-bs |
15:40
🔗
|
|
eythian has quit IRC (Remote host closed the connection) |
15:40
🔗
|
SketchCow |
Heyyyyy godane - I finally got one for you. |
15:40
🔗
|
SketchCow |
https://community.apan.org/wg/tradoc-g2/fmso/p/fmso-file-cabinet |
15:40
🔗
|
SketchCow |
I got all the books in the "bookshelf" up. But these monographs. Your best skills in transferring them over, please |
15:41
🔗
|
SketchCow |
also https://community.apan.org/wg/tradoc-g2/fmso/p/oe-watch-issues |
15:42
🔗
|
|
eythian has joined #archiveteam-bs |
15:57
🔗
|
|
super3_ is now known as super3 |
16:07
🔗
|
SketchCow |
I completely forgot I had that thing running against the archivebot collection, and it's been doing the work! Down to page 4 |
16:25
🔗
|
|
Sauce has joined #archiveteam-bs |
16:30
🔗
|
|
Stilettoo has joined #archiveteam-bs |
16:31
🔗
|
|
Stiletto has quit IRC (Ping timeout: 255 seconds) |
16:45
🔗
|
|
ShellyRol has quit IRC (Ping timeout: 492 seconds) |
16:46
🔗
|
|
ShellyRol has joined #archiveteam-bs |
17:04
🔗
|
|
ShellyRol has quit IRC (Remote host closed the connection) |
17:04
🔗
|
|
ShellyRol has joined #archiveteam-bs |
17:06
🔗
|
|
Sauce has quit IRC (C:\exit.exe) |
17:07
🔗
|
|
jamiew has joined #archiveteam-bs |
17:13
🔗
|
godane |
SketchCow: i will check do it the same way i did ERIC archives |
17:13
🔗
|
godane |
find the pdfs and filter out the html pages after |
17:15
🔗
|
godane |
i will call it the APAN archives or something like that |
17:17
🔗
|
godane |
oh this maybe good |
17:17
🔗
|
godane |
looks like i maybe able to grab every id with a pdf |
17:17
🔗
|
godane |
even stuff like OEWatch |
17:18
🔗
|
godane |
even though i'm doing it with the fmso path |
17:32
🔗
|
|
Stilettoo is now known as Stiletto |
17:44
🔗
|
second |
Is archlinux just storing packages here https://archive.org/search.php?query=creator%3A%22Arch+Linux%22 |
17:45
🔗
|
second |
along with history |
18:04
🔗
|
|
zhongfu has quit IRC (Remote host closed the connection) |
18:05
🔗
|
|
zhongfu has joined #archiveteam-bs |
18:33
🔗
|
|
jamiew has quit IRC (Quit: My iMac has gone to sleep. ZZZzzz…) |
18:42
🔗
|
|
PurpleSym has quit IRC (Remote host closed the connection) |
18:44
🔗
|
|
bluefoo has quit IRC (Read error: Operation timed out) |
18:51
🔗
|
|
DogsRNice has joined #archiveteam-bs |
18:51
🔗
|
|
ShellyRol has quit IRC (Read error: Connection reset by peer) |
18:51
🔗
|
|
ShellyRol has joined #archiveteam-bs |
18:54
🔗
|
|
systwi has joined #archiveteam-bs |
19:03
🔗
|
|
bluefoo has joined #archiveteam-bs |
20:34
🔗
|
|
systwi_ has joined #archiveteam-bs |
20:36
🔗
|
|
ats has quit IRC (Quit: new motherboard (caution: sharks)) |
20:42
🔗
|
|
systwi has quit IRC (Read error: Operation timed out) |
20:43
🔗
|
|
MaximeleG has joined #archiveteam-bs |
20:44
🔗
|
|
MaximeleG has quit IRC (Client Quit) |
20:50
🔗
|
|
MaximeleG has joined #archiveteam-bs |
20:50
🔗
|
|
MaximeleG has quit IRC (Client Quit) |
20:58
🔗
|
jleclanch |
I'm still working on the archival of the legacy battle.net forums. So far I have all of EU, TW and KR discovered (~5.5M URLs discovered), and I'm almost done crawling the US forums (will probably be another 5-10M URLs). Could someone more knowledgeable guide me through the process of actually archiving it all once I have all the URLs? |
20:59
🔗
|
jleclanch |
I have a data extractor written in python that I can run on a page and extract the relevant in JSON format. I wanted to build a proper archive of this, not just HTML dumps |
21:16
🔗
|
Fusl |
jleclanch: you mean us.battle.net? thats been done already |
21:18
🔗
|
Fusl |
as for archiving, we've done that on mips since they started blocking archivebot pipelines |
21:22
🔗
|
|
odemgi has quit IRC (Read error: Connection reset by peer) |
21:22
🔗
|
|
odemgi has joined #archiveteam-bs |
21:24
🔗
|
jleclanch |
done where, do you have info about it? |
21:24
🔗
|
jleclanch |
and yeah, like these forums: https://eu.battle.net/forums/en/wow/12309726/?page=5 |
21:40
🔗
|
Fusl |
oh wow i didnt know these were a thing |
21:41
🔗
|
Fusl |
i thought us.battle.net was the only one |
21:42
🔗
|
jleclanch |
Fusl: there's us.bnet, eu.bnet, kr.bnet and tw.bnet. I believe I have an exhaustive list of every single one of them (publicly available, that is) |
21:42
🔗
|
jleclanch |
Fusl: metadata on every single forum: https://dpaste.de/KNmh |
21:45
🔗
|
Fusl |
thanks, ill start a mips job for those |
21:45
🔗
|
jleclanch |
Fusl: what's a mips job? |
21:46
🔗
|
Fusl |
like #archivebot but on a special server with lots of ips and horse power |
21:47
🔗
|
jleclanch |
Fusl: I see. Is it possible to get a dump of the HTML of all the pages collected? I'm interested in extracting the data and making it all searchable |
21:47
🔗
|
jleclanch |
I have URL lists if you want |
21:47
🔗
|
Fusl |
sure |
21:47
🔗
|
jleclanch |
let me upload them to drive, 1s |
21:49
🔗
|
jleclanch |
Fusl: kr.battle.net, all URLs https://usercontent.irccloud-cdn.com/file/lgtj5EAe/urls.kr.txt.xz |
21:49
🔗
|
jleclanch |
Fusl: tw.battle.net, all URLs https://usercontent.irccloud-cdn.com/file/XUfmFZvd/urls.tw.txt.xz |
21:49
🔗
|
jleclanch |
(eu is still compressing) |
21:49
🔗
|
jleclanch |
Fusl: eu.battle.net, all URLs https://usercontent.irccloud-cdn.com/file/L8bVuss2/urls.eu.txt.xz |
21:50
🔗
|
jleclanch |
I should have US ready in a couple of days at most. |
21:51
🔗
|
Fusl |
ill throw all of them + the category starting point urls into mips to grab all of it |
21:51
🔗
|
jleclanch |
Actually I say all URLs, none of these contain category URLs (topic listings). I'll be able to generate those at some point, but I already actually have all of them in a sqlite db |
21:51
🔗
|
jleclanch |
but it's all posts URLs, including pages inside the posts |
21:53
🔗
|
Fusl |
thanks |
21:53
🔗
|
jleclanch |
I'll ping back when US is ready. Fusl: where can I get a hold of all the pages afterwards, in a way that won't require me to scrape archive.org? |
21:53
🔗
|
Fusl |
ill dump them onto a storage server, you can download it with rsync from there for a few weeks until they are purged |
21:54
🔗
|
jleclanch |
sweet, thank you :) |
21:54
🔗
|
jleclanch |
I can also generate category page URLs if you want, btw (I have total topic counts, so I can figure out how many pages there are in a category) |
21:55
🔗
|
Fusl |
no worries, i generated those using the json |
21:55
🔗
|
jleclanch |
well by exhaustive I mean including ?page=x, ?page=x+1, etc |
21:56
🔗
|
jleclanch |
eg. this is the last page of the wow general discussion forum: https://us.battle.net/forums/en/wow/984270/?page=22090 (and loading it now probably will crash) |
22:08
🔗
|
Fusl |
feel free to follow the scraping progress here http://103.230.141.2:29000/ and here https://atdash.meo.ws/d/grabsitemips/grab-site-mips-pipeline?orgId=1&var-ident=e8cf4322&var-ident=ef4e18eb&var-ident=e16f9714&var-ident=d0b72ae6 |
22:09
🔗
|
jleclanch |
neat |
22:29
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
22:38
🔗
|
|
ats has joined #archiveteam-bs |
22:43
🔗
|
|
killsushi has joined #archiveteam-bs |