#archiveteam-bs 2020-06-25,Thu

↑back Search

Time Nickname Message
00:24 πŸ”— BlueMax has joined #archiveteam-bs
00:30 πŸ”— lennier2 has joined #archiveteam-bs
00:39 πŸ”— sivoais_ has joined #archiveteam-bs
00:41 πŸ”— justcool3 has joined #archiveteam-bs
00:43 πŸ”— lennier1 has quit IRC (Ping timeout: 745 seconds)
00:43 πŸ”— lennier2 is now known as lennier1
00:49 πŸ”— sivoais has quit IRC (se.hub irc.efnet.nl)
00:49 πŸ”— Mateon1 has quit IRC (se.hub irc.efnet.nl)
00:49 πŸ”— Ctrl has quit IRC (se.hub irc.efnet.nl)
00:49 πŸ”— Meli has quit IRC (se.hub irc.efnet.nl)
01:20 πŸ”— OrIdow6 has quit IRC (Read error: Connection reset by peer)
01:20 πŸ”— OrIdow6 has joined #archiveteam-bs
01:24 πŸ”— Mateon1 has joined #archiveteam-bs
01:29 πŸ”— Ctrl has joined #archiveteam-bs
01:30 πŸ”— fredgido_ has quit IRC (Read error: Connection reset by peer)
01:34 πŸ”— fredgido_ has joined #archiveteam-bs
02:31 πŸ”— vitzli has joined #archiveteam-bs
02:33 πŸ”— vitzli has quit IRC (Client Quit)
02:42 πŸ”— wyatt8750 has joined #archiveteam-bs
02:43 πŸ”— wyatt8740 has quit IRC (Ping timeout: 260 seconds)
03:31 πŸ”— DogsRNice has quit IRC (Read error: Connection reset by peer)
03:41 πŸ”— qw3rty__ has quit IRC (Ping timeout: 265 seconds)
03:55 πŸ”— wyatt8750 has quit IRC (Read error: Operation timed out)
03:55 πŸ”— wyatt8740 has joined #archiveteam-bs
04:16 πŸ”— HP_Archiv has joined #archiveteam-bs
04:45 πŸ”— OrIdow6 has quit IRC (Quit: Leaving.)
06:07 πŸ”— britmob_ has quit IRC (Read error: Connection reset by peer)
06:08 πŸ”— britmob has joined #archiveteam-bs
06:11 πŸ”— dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.)
06:18 πŸ”— dashcloud has joined #archiveteam-bs
06:24 πŸ”— Terbium has quit IRC (Read error: Connection reset by peer)
06:25 πŸ”— Terbium has joined #archiveteam-bs
06:39 πŸ”— qw3rty has joined #archiveteam-bs
06:50 πŸ”— justcool3 has quit IRC (Quit: Connection closed for inactivity)
07:20 πŸ”— OrIdow6 has joined #archiveteam-bs
09:26 πŸ”— Arcorann Some time ago I asked about archiving someone's Twitter and someone mentioned socialbot? Would like to try it out (someone I follow is planning to close their account)
09:54 πŸ”— OrIdow6 has quit IRC (Remote host closed the connection)
09:54 πŸ”— OrIdow6 has joined #archiveteam-bs
10:01 πŸ”— Ryz Hello Arcorann, what Twitter account you want to archive? o:
10:22 πŸ”— Arcorann Handle is @CINDERELLAGlRLS
10:40 πŸ”— Ryz Ah, found the source of saying their shut down; it's now being run through socialbot and will go through AB afterwards
10:40 πŸ”— Arcorann Thanks
10:42 πŸ”— asdf01010 has joined #archiveteam-bs
10:45 πŸ”— asdf0101 has quit IRC (Read error: Operation timed out)
10:45 πŸ”— asdf01010 is now known as asdf0101
11:09 πŸ”— wessel152 has joined #archiveteam-bs
11:35 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
12:09 πŸ”— lunik13 has quit IRC (Ping timeout: 265 seconds)
12:23 πŸ”— lunik13 has joined #archiveteam-bs
12:28 πŸ”— dashcloud has quit IRC (Read error: Operation timed out)
12:34 πŸ”— HP_Archiv has quit IRC (Quit: Leaving)
14:26 πŸ”— Arcorann has quit IRC (Read error: Connection reset by peer)
14:53 πŸ”— pew has quit IRC (Ping timeout: 265 seconds)
14:53 πŸ”— pew has joined #archiveteam-bs
15:59 πŸ”— dashcloud has joined #archiveteam-bs
16:14 πŸ”— Dj-Wawa has quit IRC (Quit: Dj-Wawa)
16:15 πŸ”— Dj-Wawa has joined #archiveteam-bs
16:19 πŸ”— DogsRNice has joined #archiveteam-bs
16:21 πŸ”— Panasonic has quit IRC (Read error: Connection reset by peer)
17:10 πŸ”— Dj-Wawa has quit IRC (Dj-Wawa)
17:21 πŸ”— Dj-Wawa has joined #archiveteam-bs
17:51 πŸ”— jodizzle Looks like arkiver is purging
18:03 πŸ”— ripdog has quit IRC (Remote host closed the connection)
18:14 πŸ”— logchfoo3 starts logging #archiveteam-bs at Thu Jun 25 18:14:03 2020
18:14 πŸ”— logchfoo3 has joined #archiveteam-bs
18:14 πŸ”— pew has joined #archiveteam-bs
18:17 πŸ”— VoynichCr has joined #archiveteam-bs
18:17 πŸ”— Dj-Wawa has joined #archiveteam-bs
18:18 πŸ”— i0npulse has joined #archiveteam-bs
18:18 πŸ”— roxfan has joined #archiveteam-bs
18:19 πŸ”— atomicthu has joined #archiveteam-bs
18:22 πŸ”— atomicthu JAA: hello
18:24 πŸ”— atomicthu i have no idea how suitable what i'm doing is for archive team's work, i've just got a wpull oneliner that scrapes an individual SA subforum, downloads the forum and threads, plus page requisites
18:25 πŸ”— atomicthu every page downloaded has my username embedded in it
18:25 πŸ”— atomicthu due to being logged in and such
18:25 πŸ”— JAA Are you producing WARC files or plain .html etc.?
18:25 πŸ”— atomicthu both
18:26 πŸ”— JAA Good. The WARC files are what we're after. But yes, being logged in is always a bit problematic.
18:26 πŸ”— atomicthu I saw a post earlier that the admins aren't sure what the future of the site is and looking for options for offsites, and turned off grabbing stuff like "show user's posts in this thread" and clickable links to quoted posts to save time, so those will be broken.
18:27 πŸ”— atomicthu no idea if wpull's database allows a rescrape with different parameters to grab stuff like that. if it does i might turn off external page requisites too
18:28 πŸ”— atomicthu I'm not getting Archives either, despite paying for access, because it looks like that functionality is broken serverside
18:29 πŸ”— atomicthu Old threads that have continued posting and haven't fallen off the boards will still be present
18:29 πŸ”— atomicthu the one I've been running since yesterday was only scraping The Firing Range as a test, I've just added YOSPOS and The Dorkroom since that's pretty much where I posted when I was on there
18:31 πŸ”— JAA It's not possible to rescrape that directly just by changing options after the fact, but the URLs can be pulled from the DB file and then grabbed in a separate wpull run.
18:32 πŸ”— JAA Right, I've been wondering about those "Archives"... Does that mean that over half of the content on the forums is just impossible to access?
18:32 πŸ”— JAA Well, over half the threads, almost three quarters of the posts.
18:32 πŸ”— JAA Going by the numbers on the homepage, that is.
18:34 πŸ”— atomicthu yeah basically if a thread didn't get posted in for long enough it falls into the archives
18:35 πŸ”— atomicthu there are very long-running threads in some forums, but in some cases threads go through iterations of being closed and a new one created, and the old one would fall into archives
18:36 πŸ”— atomicthu JAA: a hacky workaround might be to just stuff the scraped files into a local web server, and then run a second scrape on that :v
18:37 πŸ”— atomicthu workaround for rescrape with requisites, rather
18:38 πŸ”— JAA Eww
18:39 πŸ”— atomicthu i've had to kill the scraper a few times when it hung on a graceful stop. I just realized I don't know if that ruins the WARC or not (--warc-append is turned on)
18:40 πŸ”— JAA It shouldn't, except it might. Which wpull version are you using?
18:41 πŸ”— atomicthu 2.0.1
18:41 πŸ”— JAA Yeah, that's what I thought.
18:41 πŸ”— JAA I recommend using 2.0.3 instead. It's not on PyPI but needs to be installed from GitHub.
18:42 πŸ”— atomicthu ok
18:42 πŸ”— JAA pip install git+https://github.com/ArchiveTeam/wpull.git@v2.0.3
18:42 πŸ”— atomicthu one more question: what's the best way to get it to download page requisites from anything on any "somethingawful.com" subdomain, but not scrape pictures from imgur etc?
18:43 πŸ”— JAA What killing wpull does ruin is the log files. So keep any tmp-*.log.gz files you may have in the directory.
18:43 πŸ”— atomicthu got it
18:44 πŸ”— JAA `--page-requisites --span-hosts-allow page-requisites --domains somethingawful.com` should do it I think.
18:44 πŸ”— atomicthu thanks.
18:49 πŸ”— atomicthu I might restart the scrape and move things to a big linode. Possibly grab the whole forum if I'm able.
18:50 πŸ”— atomicthu what amount of RAM/vps size should I budget here?
18:54 πŸ”— JAA You only need (base system +) ~100-200 MB RAM for wpull. The resource that will likely be limiting is the CPU due to the HTML parsing, compressing for WARCs, etc. wpull is single-threaded (for practical purposes), so you won't gain anything by having more than 1 core either.
18:54 πŸ”— atomicthu ok.
18:58 πŸ”— atomicthu can, or should I, indicate in the useragent that the scrape is for Archive Team along with a request to PM me if I need to slow it down? I know I'm not using whatever official solution exists for this
18:58 πŸ”— atomicthu my own was "fuck lowtax, this is atomicthumbs archiving selected subforums, if i'm slamming the server please PM and I'll slow it down" :v
18:59 πŸ”— atomicthu it's behind cloudflare but I have no idea what % of my requests actually hit a dynamic page
19:01 πŸ”— atomicthu thank you for all your help btw
19:12 πŸ”— Meli has joined #archiveteam-bs
19:12 πŸ”— JAA It's generally a good idea to identify yourself. You're logged in though, so they'd already know who you are in theory. Adding your profile URL to the UA might still be good.
19:50 πŸ”— godane has joined #archiveteam-bs
20:12 πŸ”— mtntmnky_ has joined #archiveteam-bs
20:13 πŸ”— mtntmnky has quit IRC (Remote host closed the connection)
20:20 πŸ”— roxfan has quit IRC (Remote host closed the connection)
20:20 πŸ”— schbirid has quit IRC (Quit: Leaving)
21:23 πŸ”— mtntmnky has joined #archiveteam-bs
21:33 πŸ”— atomicthu oh christ i never paid for no-ads and doing that now would mean giving lowtax money
21:34 πŸ”— mtntmnky_ has quit IRC (Remote host closed the connection)
21:53 πŸ”— atomicthu gonna fire this thing up once python 3.6 finishes building on the linode.
21:53 πŸ”— atomicthu is there any reason I shouldn't use --html-parser libxml2-lxml on wpull
21:54 πŸ”— atomicthu got postres installed for sqlalchemy
21:54 πŸ”— atomicthu *gres
21:54 πŸ”— lennier2 has joined #archiveteam-bs
21:55 πŸ”— lennier1 has quit IRC (Read error: Operation timed out)
21:56 πŸ”— lennier2 is now known as lennier1
22:01 πŸ”— Jens has joined #archiveteam-bs
22:04 πŸ”— JAA atomicthu: wpull only works with SQLite (even though it uses SQLAlchemy). It uses syntax (INSERT OR IGNORE) that to my knowledge is not supported by anything else.
22:04 πŸ”— atomicthu thank you
22:05 πŸ”— JAA The libxml2-lxml parser might fail in some corner cases of invalid HTML, but it should work fine for the vast majority of sites and is much faster than html5lib.
22:21 πŸ”— atomicthu sweet
22:21 πŸ”— atomicthu i'm running it on a dedicated cpu instance so hopefully it'll mostly be waiting for the forum
23:07 πŸ”— Arcorann has joined #archiveteam-bs
23:19 πŸ”— JAA http://www.tigris.org/ "The site will be decommissioned and shut down on 1-July-2020"
23:19 πŸ”— JAA Yeah... I know that, now let me access the stuff.
23:19 πŸ”— JAA Someone is doing dev work on the prod system, it seems.
23:22 πŸ”— atomicthu INFO Converting links in file β€˜forums.somethingawful.com/js/vb/forums.combined.js?1476414227’ (type=None).
23:22 πŸ”— atomicthu ERROR Fatal exception.
23:22 πŸ”— atomicthu [stack trace]
23:22 πŸ”— atomicthu Exception: Unknown link type.
23:22 πŸ”— atomicthu maybe that's a reason not to use lxml
23:23 πŸ”— JAA Hmm, no idea, never used -k before as I typically only save WARCs and use --delete-after for the rest.
23:24 πŸ”— atomicthu hm
23:24 πŸ”— JAA That seems like a bug though, and also type really shouldn't be None.
23:24 πŸ”— atomicthu This is my first time using WARCs. Are they as easily browsable as a local mirror of the site, given the right tool
23:24 πŸ”— JAA Yes, with pywb.
23:26 πŸ”— JAA In theory, you could also produce your familiar tree structure from it.
23:26 πŸ”— JAA WARCs contain the raw HTTP requests and responses, so really any processing can be done with it later if desired.
23:26 πŸ”— JAA I'm not aware of a tool to do it, but it should be possible by reusing some wpull innards.
23:26 πŸ”— atomicthu yeah it looks like i could just... crawl warcserver if i wanted to
23:26 πŸ”— atomicthu and convert the links?
23:27 πŸ”— JAA That would also work if your archive isn't too large (since it'll be fairly inefficient).
23:30 πŸ”— Maylay has quit IRC (Read error: Operation timed out)
23:39 πŸ”— BlueMax has joined #archiveteam-bs
23:42 πŸ”— Maylay has joined #archiveteam-bs
23:45 πŸ”— arkiver jodizzle: yeah fuck that

irclogger-viewer